See http://webmagic.io/for specific details
Implementation of Pageprocessor:
Implement the Pageprocessor interface.
You can customize your own reptile rules inside.
The WebMagic page crawl process is pageprocessor divided into three parts:
1. Set the parameters of the gripper: Repeat Count, repeat event, etc.
2. Set crawl rules: that is, to give you an HTML page which information you want to crawl
3. From the current page to find no access to the connection, join the crawl queue, waiting to crawl
Public classGithubrepopageprocessorImplementsPageprocessor {//Part One: Crawl the relevant configuration of the site, including encoding, crawl interval, number of retries, etc. PrivateSite site = site.me (). Setretrytimes (3). Setsleeptime (1000); @Override//process is the core interface of custom crawler logic, where the extraction logic is written Public voidProcess (Page page) {//Section Two: Define how to extract page information and save itPage.putfield ("Author", Page.geturl (). Regex ("https://github\\.com/(\\w+)/.*"). toString ()); Page.putfield ("Name", Page.gethtml (). XPath ("//h1[@class = ' entry-title public ']/strong/a/text ()"). toString ()); if(Page.getresultitems (). Get ("name") = =NULL) { //Skip this pagePage.setskip (true); } Page.putfield ("Readme", page.gethtml (). XPath ("//div[@id = ' Readme ']/tidytext ()")); //Part Three: Crawling from the page to find subsequent URL addressesPage.addtargetrequests (Page.gethtml (). Links (). Regex ("(https://github\\.com/\\w+/\\w+)"). All ()); } @Override PublicSite Getsite () {returnsite; } Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft") //Open 5 Thread crawl. Thread (5) //Start crawler. Run (); }}
What is the selectable interface:
Implement the selectable interface to complete the chain extraction of page elements
Page.gethtml () Returns an HTML object that implements the selectable interface and can continue to extract
That is, you can directly in the page.gethtml (). XXX (). XXX () such as chain extraction elements
Get results:
If you get what you want, you can use the Get method or the ToString method to get the result.
Get () return string
ToString () returns a string
All () returns all extracted results
Match () returns a Boolean value that represents whether there is a matching result
Save Results:
The above process can already get the results you want, and now we need to deal with these results.
Choose whether to output it, save it to a database, or save it to a file.
I used the pipeline component.
This component is responsible for specifying the result.
For example, the output from the console is saved with Consolepipeline.
If you want to store him in a directory, follow the code below to make it easy.
Public Static voidMain (string[] args) {spider.create (Newgithubrepopageprocessor ())//from theHttps://github.com/code4craft"Start scratching. Addurl ("Https://github.com/code4craft"). Addpipeline (NewJsonfilepipeline ("d:\\webmagic\\")) //Open 5 Thread crawl. Thread (5) //Start crawler. Run ();}
Configuration of the crawler:
Spider is a class, this class is a crawler-initiated portal
Need to give his create method a policy that is the implementation of Pageprocessor
Then configure
Then. Run () runs
Configuration of the website:
There are some configuration information for the site itself
For example, some sites need to log in, set a cookie
So use the site object class to configure the various properties required for a site.
Crawler monitoring:
View crawler's execution status
See how many pages there are and how many pages have been got
Implemented with JMX
You can use tools such as jconsole to view
It's easy to add a monitor
SpiderMonitor.instance().register(oschinaSpider); SpiderMonitor.instance().register(githubSpider);
Components of the WebMagic:
Four, Pageprocessor,schedule,downloader,pipline
can be customized separately
Custom Pipline:
Implement the Pipline interface to
Several default pipline have been provided
Consolepipeline |
Output results to the console |
Extracting results requires the ToString method to be implemented |
Filepipeline |
Save Results to File |
Extracting results requires the ToString method to be implemented |
Jsonfilepipeline |
Save results to file in JSON format |
|
Consolepagemodelpipeline |
(annotation mode) output results to the console |
|
Filepagemodelpipeline |
(annotation mode) Save results to file |
|
Jsonfilepagemodelpipeline |
(annotation mode) Save results to file in JSON format |
The field you want to persist requires a getter method |
Custom Scheduler
Scheduler is a component that is managed for URLs
The URL queue can be redirected to
The existing scheduler
Duplicateremovedscheduler |
Abstract base class that provides some template methods |
Inherit it to achieve its own function |
Queuescheduler |
Use memory queue to save the URL to crawl |
|
Priorityscheduler |
Use memory queue with priority to save the URL to crawl |
Consumes more memory than Queuescheduler, but when Request.priority is set, only Priorityscheduler can be used to take precedence |
Filecachequeuescheduler |
Use the file to save the crawl URL, you can close the program and the next time you start, from the previously crawled URL to continue to crawl |
To specify a path, the. Urls.txt and. Cursor.txt Two files are established |
Redisscheduler |
Use Redis to save the crawl queue for multiple machines to simultaneously crawl together |
Need to install and start Redis |
You can define the strategy of the de-weight separately
Hashsetduplicateremover |
Use HashSet to remove weight and occupy large memory |
Bloomfilterduplicateremover |
Use Bloomfilter to remove weight, take up less memory, but may miss the page |
The use of "Java" Java Crawler framework WebMagic