The use of "Java" Java Crawler framework WebMagic

Last Update:2015-10-27 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

See http://webmagic.io/for specific details

Implementation of Pageprocessor:

Implement the Pageprocessor interface.

You can customize your own reptile rules inside.

The WebMagic page crawl process is pageprocessor divided into three parts:

1. Set the parameters of the gripper: Repeat Count, repeat event, etc.

2. Set crawl rules: that is, to give you an HTML page which information you want to crawl

3. From the current page to find no access to the connection, join the crawl queue, waiting to crawl

 Public classGithubrepopageprocessorImplementsPageprocessor {//Part One: Crawl the relevant configuration of the site, including encoding, crawl interval, number of retries, etc.    PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (1000); @Override//process is the core interface of custom crawler logic, where the extraction logic is written     Public voidProcess (Page page) {//Section Two: Define how to extract page information and save itPage.putfield ("Author", Page.geturl (). Regex ("https://github\\.com/(\\w+)/.*"). toString ()); Page.putfield ("Name", Page.gethtml (). XPath ("//h1[@class = ' entry-title public ']/strong/a/text ()"). toString ()); if(Page.getresultitems (). Get ("name") = =NULL) {            //Skip this pagePage.setskip (true); } Page.putfield ("Readme", page.gethtml (). XPath ("//div[@id = ' Readme ']/tidytext ()")); //Part Three: Crawling from the page to find subsequent URL addressesPage.addtargetrequests (Page.gethtml (). Links (). Regex ("(https://github\\.com/\\w+/\\w+)"). All ()); } @Override PublicSite Getsite () {returnsite; }     Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft")                //Open 5 Thread crawl. Thread (5)                //Start crawler. Run (); }}

What is the selectable interface:

Implement the selectable interface to complete the chain extraction of page elements

Page.gethtml () Returns an HTML object that implements the selectable interface and can continue to extract

That is, you can directly in the page.gethtml (). XXX (). XXX () such as chain extraction elements

Get results:

If you get what you want, you can use the Get method or the ToString method to get the result.

Get () return string

ToString () returns a string

All () returns all extracted results

Match () returns a Boolean value that represents whether there is a matching result

Save Results:

The above process can already get the results you want, and now we need to deal with these results.

Choose whether to output it, save it to a database, or save it to a file.

I used the pipeline component.

This component is responsible for specifying the result.

For example, the output from the console is saved with Consolepipeline.

If you want to store him in a directory, follow the code below to make it easy.

 Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft"). Addpipeline (NewJsonfilepipeline ("d:\\webmagic\\"))            //Open 5 Thread crawl. Thread (5)            //Start crawler. Run ();}

Configuration of the crawler:

Spider is a class, this class is a crawler-initiated portal

Need to give his create method a policy that is the implementation of Pageprocessor

Then configure

Then. Run () runs

Configuration of the website:

There are some configuration information for the site itself

For example, some sites need to log in, set a cookie

So use the site object class to configure the various properties required for a site.

Crawler monitoring:

View crawler's execution status

See how many pages there are and how many pages have been got

Implemented with JMX

You can use tools such as jconsole to view

It's easy to add a monitor

SpiderMonitor.instance().register(oschinaSpider);        SpiderMonitor.instance().register(githubSpider);

Components of the WebMagic:

Four, Pageprocessor,schedule,downloader,pipline

can be customized separately

Custom Pipline:

Implement the Pipline interface to

Several default pipline have been provided

Consolepipeline	Output results to the console	Extracting results requires the ToString method to be implemented
Filepipeline	Save Results to File	Extracting results requires the ToString method to be implemented
Jsonfilepipeline	Save results to file in JSON format
Consolepagemodelpipeline	(annotation mode) output results to the console
Filepagemodelpipeline	(annotation mode) Save results to file
Jsonfilepagemodelpipeline	(annotation mode) Save results to file in JSON format	The field you want to persist requires a getter method

Custom Scheduler

Scheduler is a component that is managed for URLs

The URL queue can be redirected to

The existing scheduler

Duplicateremovedscheduler	Abstract base class that provides some template methods	Inherit it to achieve its own function
Queuescheduler	Use memory queue to save the URL to crawl
Priorityscheduler	Use memory queue with priority to save the URL to crawl	Consumes more memory than Queuescheduler, but when Request.priority is set, only Priorityscheduler can be used to take precedence
Filecachequeuescheduler	Use the file to save the crawl URL, you can close the program and the next time you start, from the previously crawled URL to continue to crawl	To specify a path, the. Urls.txt and. Cursor.txt Two files are established
Redisscheduler	Use Redis to save the crawl queue for multiple machines to simultaneously crawl together	Need to install and start Redis

You can define the strategy of the de-weight separately

Hashsetduplicateremover	Use HashSet to remove weight and occupy large memory
Bloomfilterduplicateremover	Use Bloomfilter to remove weight, take up less memory, but may miss the page

The use of "Java" Java Crawler framework WebMagic

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More