0 Basic Java knowledge of the crawler to get to know what to edit recommended content _java

Source: Internet
Author: User
Tags readline

Know is a real network question and answer community, community atmosphere friendly, rational, serious, connect the elite of all walks of life. They share each other's expertise, experience and insights, providing a steady stream of high-quality information for the Chinese Internet.

First, spend 35 minutes designing a logo=. = As a programmer I have always had a heart to do art!


Well, do a little bit of it, just make it work.

Next, we start making a known reptile.

First, set the first goal: Edit the recommendation.

Web Links: http://www.zhihu.com/explore/recommendations

We made a slight change to the last code to get the page content first:

Import java.io.*;
Import java.net.*;
Import java.util.regex.*;
public class Main {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream ()));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
static string Regexstring (String targetstr, String patternstr) {
Defines a style template that uses regular expressions in parentheses to catch the content
It's the same place where the trap match is buried, and it goes down.
Pattern pattern = pattern.compile (PATTERNSTR);
Define a matcher to make a match
Matcher Matcher = Pattern.matcher (TARGETSTR);
If you find it,
if (Matcher.find ()) {
Print out the results
return Matcher.group (1);
}
Return to "nothing";
}
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.zhihu.com/explore/recommendations";
Access links and get page content
String result = sendget (URL);
Using the SRC content of a regular matching picture
String imgsrc = regexstring (result, "src=\" (. +?) \"");
Print results
SYSTEM.OUT.PRINTLN (result);
}
}

There is a problem with running the wood, and then there is the problem of a regular match.

First we'll get all the questions on the page.

Right-click the title and review the elements:

Aha, you can see the title is actually a a tag, that is, a hyperlink, which can be separated from other hyperlinks, it should be that class, that is, class selector.

So our regular statement came out: "Question_link.+?href=\" (. +?) \"

Call the Regexstring function and give it a reference:

public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.zhihu.com/explore/recommendations";
Access links and get page content
String result = sendget (URL);
Using the SRC content of a regular matching picture
String imgsrc = regexstring (result, question_link.+?> (. +?) < ");
Print results
System.out.println (IMGSRC);
}

Aha, we can see that we've succeeded in catching a headline (note, just one):

Wait a minute, what the hell is this mess?

Don't be nervous. = It's just a character garbled.

Coding problems See: HTML character Set

Generally speaking, the mainstream encoding for Chinese support is utf-8,gb2312 and GBK encoding.

Web pages can be coded by charset of meta tags, such as:

<meta charset= "Utf-8"/>

We right-click to view the page source code:

As you can see, it is UTF-8 encoding that is known to be used.

Here to explain the difference between viewing the page source code and reviewing the elements.

Viewing the page source code is all the code that displays the entire page, not formatted according to HTML tags, which is equivalent to viewing the source directly, which is useful for viewing information about the entire page, such as Meta.

Review elements, or some browsers called viewing elements, are for you to right-click on the elements to view, such as a DIV or IMG, compared to a separate view of the object's properties and tags.

Okay, so now we know that the problem is coding, and then we're going to encode the content that's crawled.

Implementation in Java is simple, just specify the encoding in the InputStreamReader:

Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream (), "UTF-8"));

When you run the program again, you will see that the title is displayed correctly:

Good! Very good!

But now there is only one title, all we need is the title.

We will be slightly modified to store the results of the search into a ArrayList:

Import java.io.*;
Import java.net.*;
Import java.util.ArrayList;
Import java.util.regex.*;
public class Main {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream (), "UTF-8"));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
Static arraylist<string> regexstring (String targetstr, String patternstr) {
Predefined one ArrayList to store results
arraylist<string> results = new arraylist<string> ();
Defines a style template that uses regular expressions in parentheses to catch the content
Pattern pattern = pattern.compile (PATTERNSTR);
Define a matcher to make a match
Matcher Matcher = Pattern.matcher (TARGETSTR);
If you find it,
Boolean isfind = Matcher.find ();
Use loops to find and replace all the Kelvin in a sentence and add the contents to SB
while (Isfind) {
Add results of a successful match
Results.add (Matcher.group (1));
Continue to find next matching object
Isfind = Matcher.find ();
}
return results;
}
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.zhihu.com/explore/recommendations";
Access links and get page content
String result = sendget (URL);
Using the SRC content of a regular matching picture
arraylist<string> imgsrc = regexstring (result, "question_link.+?>" (. +?) < ");
Print results
System.out.println (IMGSRC);
}
}

This will match all the results (because the ArrayList is printed directly so there are parentheses and commas):


OK, so that's the first step in knowing the crawler.

But we can see that in this way there is no way to catch all the questions and answers.

We need to design a Zhihu encapsulation class to store all the objects that are crawled.

Zhihu.java Source:

Import java.util.ArrayList;
public class Zhihu {
Public String question;//Problem
Public String zhihuurl;//Web page link
Public arraylist<string> answers;//Stores an array of all answers
Construct method to initialize data
Public Zhihu () {
Question = "";
Zhihuurl = "";
Answers = new arraylist<string> ();
}
@Override
Public String toString () {
Return "question:" + question + "\ n Link:" + zhihuurl + "\ n Answer:" + answers + "\ n";
}
}

Create a new spider class to store some of the most common reptilian functions.

Spider.java Source:

Import Java.io.BufferedReader;
Import Java.io.InputStreamReader;
Import Java.net.URL;
Import java.net.URLConnection;
Import java.util.ArrayList;
Import Java.util.regex.Matcher;
Import Java.util.regex.Pattern;
public class Spider {
static string Sendget (string url) {
Defines a string for storing Web page content
String result = "";
Defines a buffer character input stream
BufferedReader in = null;
try {
To convert a string to a URL object
URL realurl = new URL (URL);
Initializes a connection that is linked to that URL
URLConnection connection = Realurl.openconnection ();
Start the actual connection
Connection.connect ();
Initializes the BufferedReader input stream to read the response of the URL
in = new BufferedReader (New InputStreamReader (
Connection.getinputstream (), "UTF-8"));
Used to temporarily store data for each row crawled to
String Line;
while (line = In.readline ())!= null) {
Iterate over each row crawled to and store it in result
result + = line;
}
catch (Exception e) {
SYSTEM.OUT.PRINTLN ("Send GET request exception!") "+ e);
E.printstacktrace ();
}
Use finally to close the input stream
finally {
try {
if (in!= null) {
In.close ();
}
catch (Exception E2) {
E2.printstacktrace ();
}
}
return result;
}
Static arraylist<zhihu> Getzhihu (String content) {
Predefined one ArrayList to store results
arraylist<zhihu> results = new arraylist<zhihu> ();
Used to match headers
Pattern Questionpattern = Pattern.compile ("question_link.+?>" (. +?) < ");
Matcher questionmatcher = questionpattern.matcher (content);
To match the URL, which is the link to the problem
Pattern Urlpattern = Pattern.compile ("Question_link.+?href=\" (. +?) \"");
Matcher urlmatcher = urlpattern.matcher (content);
Questions and links can be matched to
Boolean isfind = Questionmatcher.find () && urlmatcher.find ();
while (Isfind) {
Define a known object to store crawled information
Zhihu zhuhutemp = new Zhihu ();
Zhuhutemp.question = Questionmatcher.group (1);
Zhuhutemp.zhihuurl = "http://www.zhihu.com" + urlmatcher.group (1);
Add results of a successful match
Results.add (zhuhutemp);
Continue to find next matching object
Isfind = Questionmatcher.find () && urlmatcher.find ();
}
return results;
}
}

The last main method is responsible for calling.

Import java.util.ArrayList;
public class Main {
public static void Main (string[] args) {
Define the links that will be accessed
String url = "Http://www.zhihu.com/explore/recommendations";
Access links and get page content
String content = spider.sendget (URL);
Get all the known objects for the page
arraylist<zhihu> Myzhihu = Spider.getzhihu (content);
Print results
System.out.println (Myzhihu);
}
}


OK, it's done. Run a look at the results:

Good effect is good.

The next step is to access the link and get all the answers.

We'll introduce you next time.

Well, the above is a simple description of how to use Java to crawl the content of the recommended editing of the entire process, very detailed, but also very simple and understandable, yes, there is a need for small partners can refer to, free expansion also no problem ha

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.