The use of "Java" Java Crawler framework WebMagic

Source: Internet
Author: User
Tags xpath

See http://webmagic.io/for specific details

Implementation of Pageprocessor:

Implement the Pageprocessor interface.

You can customize your own reptile rules inside.

The WebMagic page crawl process is pageprocessor divided into three parts:

1. Set the parameters of the gripper: Repeat Count, repeat event, etc.

2. Set crawl rules: that is, to give you an HTML page which information you want to crawl

3. From the current page to find no access to the connection, join the crawl queue, waiting to crawl

 Public classGithubrepopageprocessorImplementsPageprocessor {//Part One: Crawl the relevant configuration of the site, including encoding, crawl interval, number of retries, etc.    PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (1000); @Override//process is the core interface of custom crawler logic, where the extraction logic is written     Public voidProcess (Page page) {//Section Two: Define how to extract page information and save itPage.putfield ("Author", Page.geturl (). Regex ("https://github\\.com/(\\w+)/.*"). toString ()); Page.putfield ("Name", Page.gethtml (). XPath ("//h1[@class = ' entry-title public ']/strong/a/text ()"). toString ()); if(Page.getresultitems (). Get ("name") = =NULL) {            //Skip this pagePage.setskip (true); } Page.putfield ("Readme", page.gethtml (). XPath ("//div[@id = ' Readme ']/tidytext ()")); //Part Three: Crawling from the page to find subsequent URL addressesPage.addtargetrequests (Page.gethtml (). Links (). Regex ("(https://github\\.com/\\w+/\\w+)"). All ()); } @Override PublicSite Getsite () {returnsite; }     Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft")                //Open 5 Thread crawl. Thread (5)                //Start crawler. Run (); }}

What is the selectable interface:

Implement the selectable interface to complete the chain extraction of page elements

Page.gethtml () Returns an HTML object that implements the selectable interface and can continue to extract

That is, you can directly in the page.gethtml (). XXX (). XXX () such as chain extraction elements

  

Get results:

If you get what you want, you can use the Get method or the ToString method to get the result.

Get () return string

ToString () returns a string

All () returns all extracted results

Match () returns a Boolean value that represents whether there is a matching result

Save Results:

The above process can already get the results you want, and now we need to deal with these results.

Choose whether to output it, save it to a database, or save it to a file.

I used the pipeline component.

This component is responsible for specifying the result.

For example, the output from the console is saved with Consolepipeline.

If you want to store him in a directory, follow the code below to make it easy.

  

 Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft"). Addpipeline (NewJsonfilepipeline ("d:\\webmagic\\"))            //Open 5 Thread crawl. Thread (5)            //Start crawler. Run ();}

Configuration of the crawler:

Spider is a class, this class is a crawler-initiated portal

Need to give his create method a policy that is the implementation of Pageprocessor

Then configure

Then. Run () runs

  

Configuration of the website:

There are some configuration information for the site itself

For example, some sites need to log in, set a cookie

So use the site object class to configure the various properties required for a site.

Crawler monitoring:

View crawler's execution status

See how many pages there are and how many pages have been got

Implemented with JMX

You can use tools such as jconsole to view

It's easy to add a monitor

SpiderMonitor.instance().register(oschinaSpider);        SpiderMonitor.instance().register(githubSpider);

Components of the WebMagic:

Four, Pageprocessor,schedule,downloader,pipline

can be customized separately

Custom Pipline:

Implement the Pipline interface to

Several default pipline have been provided

  

Consolepipeline Output results to the console Extracting results requires the ToString method to be implemented
Filepipeline Save Results to File Extracting results requires the ToString method to be implemented
Jsonfilepipeline Save results to file in JSON format
Consolepagemodelpipeline (annotation mode) output results to the console
Filepagemodelpipeline (annotation mode) Save results to file
Jsonfilepagemodelpipeline (annotation mode) Save results to file in JSON format The field you want to persist requires a getter method

Custom Scheduler

Scheduler is a component that is managed for URLs

The URL queue can be redirected to

The existing scheduler

  

Duplicateremovedscheduler Abstract base class that provides some template methods Inherit it to achieve its own function
Queuescheduler Use memory queue to save the URL to crawl
Priorityscheduler Use memory queue with priority to save the URL to crawl Consumes more memory than Queuescheduler, but when Request.priority is set, only Priorityscheduler can be used to take precedence
Filecachequeuescheduler Use the file to save the crawl URL, you can close the program and the next time you start, from the previously crawled URL to continue to crawl To specify a path, the. Urls.txt and. Cursor.txt Two files are established
Redisscheduler Use Redis to save the crawl queue for multiple machines to simultaneously crawl together Need to install and start Redis

You can define the strategy of the de-weight separately

    

Hashsetduplicateremover Use HashSet to remove weight and occupy large memory
Bloomfilterduplicateremover Use Bloomfilter to remove weight, take up less memory, but may miss the page


The use of "Java" Java Crawler framework WebMagic

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.