Use Webcollector to create a crawler (JAVA) that crawls "knowing" and makes accurate extraction of problems

Source: Internet
Author: User

Brief introduction:

Webcollector is a Java crawler framework (kernel) that does not need to be configured and is easy to develop two times, providing a streamlined API. A powerful crawler can be implemented with just a small amount of code.

How to import Webcollector project please see the following tutorial:

Java Web crawler webcollector depth Analysis--crawler kernel


Number of references:

The webcollector does not require cumbersome configuration. You can start a crawler by simply giving the following necessary parameters in your code:

1. Seeds (required):

The start page of the seed is the crawler. A crawler can add one or more seeds.

2. Regular (optional):

Regular is some regular type of constrained crawl range.

The regular is not necessarily given. Assuming that the user does not give a regular, the system will voluntarily limit the crawl scope to the seed domain name.

3. Number of threads (optional):

Webcollector is a multi-threaded crawler that works by default with 10 threads at the same time. Developers can define the number of threads themselves.


Demand:

Briefly describe the function of the code in the tutorial: Customizing a crawler to crawl the "know" site. Instead of requiring all pages and files to be downloaded, you are asked to extract the "questions" from all of the "Ask questions" pages.

We need to extract the question title: "Look at Batman, if I break into the NYSE and force all of them to sell all the stock futures bonds in his hand, what would be the devastating consequences?" ",

And the content of the question: "I'm not going to discuss the feasibility, I want to hear the consequences of this devastating?" Another example is Baidu. 58, new Oriental or something listed in that Chinese company? What would be the impact on other analogies, for example, of the London exchange? The impact of China's stock market? What about the effects of other currencies? “


Code:

The code is divided into two parts, the crawler and the controller.

The crawler customizes its own crawl tasks by using the Visit method in the override parent class, which is what is required for each page being crawled.

The controller sets the number of parameters (seed, regular, number of threads) to the crawler. Start the crawler to complete the control function.

1. Crawler:

Webcollector integrates a variety of crawlers (mainly the traversal algorithm is different), the most frequent use of the Walker is Breadthcrawler, which is based on the breadth traversal algorithm to crawl. We created a new Java class Zhihucrawler, inherited Breadthcrawler, to customize our crawler.


public class Zhihucrawler extends breadthcrawler{    /*visit function Customizing the actions required to access each page */    @Override public    Void Visit (Page page) {        String question_regex= "^http://www.zhihu.com/question/[0-9]+";        if (Pattern.matches (Question_regex, Page.geturl ())) {            System.out.println ("Extracting" +page.geturl ());            /* Extract title */            String Title=page.getdoc (). Title ();            System.out.println (title);            /* Extract the question * * *            String Question=page.getdoc (). Select ("Div[id=zh-question-detail]"). Text ();            System.out.println (question);        }    }


Code parsing:

There are a lot of pages in "know": Question pages, users ' personal information pages, and pages. Now we just have to work on the question page.

The URL of the question page is generally as follows: http://www.zhihu.com/question/21962447

Question_regex is the regular form of all the question pages, in the code:

if (Pattern.matches (Question_regex, Page.geturl ())) {          //operation code}

Make sure we extract only the "question" page.

The parameter of the Visit Function page page is a page that has been crawled and parsed into a DOM tree. Page's number of references:

Page.geturl () returns the URL of the downloaded page

Page.getcontent () Returns the origin data of the page

Page.getdoc () Returns an instance of Org.jsoup.nodes.Document

Page.getresponse () returns the HTTP response of the page

Page.getfetchtime () Returns the time this page is fetched at generated by System.currenttimemillis ()


It is particularly important to note that Page.getdoc () (DOM tree). The Page.getdoc () here is the document of Jsoup, assuming that HTML parsing and extraction is required. Using Page.getdoc () is not the second choice. For the use of Jsoup, you can take a jsoup tutorial:

http://www.brieftools.info/document/jsoup/



Page.geturl () and Page.getdoc () are used in the Zhihucrawler.

We can find that the title of the page on the "Know" Question page is the title of the question, so by:

String Title=page.getdoc (). Title ();
You get the title of the question.

To extract the questions from the question page, you need to observe the rules from the HTML source code:

<div data-action= "/question/detail" data-resourceid= "965792" class= "Zm-item-rich-text" id= "Zh-question-detail" >    <div class= "Zm-editable-content" > I'm not going to discuss the feasibility, I just want to hear about the devastating consequences? <br>, for      example, Baidu, 58, New Oriental or something listed in the Chinese company? What is the impact of <br>,      for example, on other figurative London exchanges? The impact of China's stock market? <br> What about the effects of other currencies, for example?    </div></div>


For the "know" all the question interface. We found that the questions were placed in a id= "Zh-question-detail" div. This is one of the most suitable cases for jsoup. We just need to find this div and get the text out of it:

String question=page.getdoc (). Select ("Div[id=zh-question-detail]"). Text ();


2. Controller:

We need a controller to start the crawler:


public class Controller {public        static void Main (string[] args) throws ioexception{           Zhihucrawler crawler=new Zhi Hucrawler ();        Crawler.addseed ("http://www.zhihu.com/question/21003086");        Crawler.addregex ("http://www.zhihu.com/.*"); Crawler.start (5);}       }


First, instantiate the newly defined Zhihucrawler (the crawler).

One seed for the crawler: http://www.zhihu.com/question/21003086

Crawler.start (5) does not mean that 5 thread crawls are turned on, and 5 indicates the depth of the crawl (the number of layers traversed by breadth).

Executing the controller class, you will find that there is constant output generated. But the output is more cluttered. This is because we have not only exported the questions we extracted. Also output the crawler logs (crawl records, etc.):

fetch:http://www.zhihu.com/people/lxjtsfetch:http://www.zhihu.com/question/24597698 is extracting http://www.zhihu.com/. question/24597698 Millet 4 adopts 304 stainless steel in addition to enhancing texture and B-grid. Are there other practical advantages? -whether the signal will be as tragic as IP4 ... Fetch:http://www.zhihu.com/topic/20006139fetch:http://www.zhihu.com/topic/19559450fetch: http://www.zhihu.com/question/20014415#fetch:http://www.zhihu.com/collection/31102864fetch:http:// www.zhihu.com/topic/19663238fetch:http://www.zhihu.com/collection/20021567
Above the output fragment, we can see the operation of the visit function, visited the http://www.zhihu.com/question/24597698, the title of the problem was extracted: "Millet 4 with 304 stainless steel In addition to enhance the texture and B-grid. Are there other practical advantages? -Knowing. " At the same time, the problem was also extracted: "The signal will not be as tragic as IP4 ...".
"fetch:http://www.zhihu.com/question/24597698" is the output of the log, which represents a Web page that has been crawled.


Suppose you want to see a clean crawl result, there are several workarounds:

1. In Zhihucrawler's visit method, add the code. The title and question strings are exported to the file.

2. In the visit method of Zhihucrawler. Add the code to submit the title and question string to the database (recommended).


There may be doubts. Why do we add the seeds to the crawler when not to join the "know" of the home, this is because the "know" home page in the case of non-login will default back to the login screen.





Use Webcollector to create a crawler (JAVA) that crawls "knowing" and makes accurate extraction of problems

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.