Heritrix source code analysis (4) Descriptions of various classes (2)

Source: Internet
Author: User

  This blog is originalArticle, Reprinted! Reprint please be sure to indicate the source: http://guoyunsky.javaeye.com/blog/632191

Welcome to the heritrix group (qq ): 109148319, 10447185 (full), Lucene/SOLR group (qq ):118972724

 

 9.org. archive. crawler. fetcher

Serial number Class Description
1 Fetchdns Obtain DNS data, such as IP
2 Fetchftp Obtain FTP data
3 Fetchhttp Get http data
4 Heritrixhttpmethodretryhandler HTTP retry processor, re-CONNECT HTTP

 

 

10.org. archive. crawler. Framework
Serial number Class Description
1 Abstracttracker The statistician that collects statistics on crawlers and parent classes. Specific statistics are implemented by sub-classes.
2 Alertmanager The UI message manager displays crawler-related messages, such as exceptions.
3 Checkpointer It regularly backs up heritrix-related data, such as logs and bdb files.
4 Crawlcontroller Controller to control the startup, pause, and stop of the entire crawler. heritrix core class
5 Crawlscope URL Range manager, such as seeds and URLs that match the captured URLs that do not match the captured URLs
6 Filter Filter to determine which URLs can be crawled and which are not allowed. The parent class is implemented by the subclass.
7 Frontier The scheduler schedules the incoming URLs so that they can be captured under the Access
8 Processor A processor. a url is jointly developed by different processors (components). This is the processor parent class and different components have different implementations.
9 Processorchain The processor chain contains the same type of processors, such as extractorhtml extracted from HTML during URL extraction and extractorjs extracted from JavaScript.
10 Processorchainlist A set of processor chains, including multiple processor chains. Each URL has a set of processors, so that the processor chain is obtained from the set first, and then each processor is obtained from the chain, finally, let each processor perform their processing to complete the whole capture.
11 Scoper The Range manager checks whether a URL is in the range configured by the user (obtained from order. XML ).
12 Statisticstracking The tracking statistician mainly collects statistics on crawlers, such as bandwidth usage, captured URLs, and crawling speeds.
13 Toepool Thread Pool for managing crawler threads
14 Toethread Crawler threads represent a crawling, heritrix core class that runs throughout the whole crawler. Next we will focus on the analysis
15 Writerpoolprocessor The write processor management pool is used to manage multiple write processors and can be used in Distributed scenarios.

 

 

11.org. archive. crawler. Frontier
Serial number Class Description
1 Abstractfrontier The scheduler's basic implementation class, one of the most complex aspects of heritrix, will focus on the analysis
2 Bdbfrontier The bdb scheduler uses the bdb database to manage all the URLs, such as the URLs to be crawled, the URLs to be crawled, and one of the most complex aspects of heritrix. Next we will focus on the analysis.
3 Bdbmultipleworkqueues All queues are managed, and all the queue data is stored in the bdb database. heritrix is one of the most complex parts. Next, we will focus on the analysis.
4 Bdbworkqueue The crawling queue is managed by the bdb storage, and the URL with the same classkey is a queue. the classkey is determined by the user configuration. By default, heritrix uses the URL of the same host as a queue. one of the most complex aspects of heritrix will be analyzed later
5 Frontierjournal The scheduler Records Management and records the running status of the scheduler, such as the insertion URL or insertion failure URL.
6 Hostnamequeueassignmentpolicy URL classkey acquisition policy. heritrix's default policy obtains the class key of the URL through the domain name. Then the same classkey is used to store the same queue.
7 Ipqueueassignmentpolicy URL classkey acquisition policy, which obtains the class key of the URL through IP
8 Queueassignmentpolicy URL classkey acquisition policy. This class is an abstract class. Different policies are implemented by different sub-classes, such as by domain name, IP address, etc. Users can expand their own
9 Recoveryjournal Manage/logs/recover.gz. This file records all URL captures. For example, different formats are available for successful or failed captures. This file is mainly used for next heritrix recovery. For example, if heritrix is abnormally interrupted and restarted, heritrix will be re-crawled. If it is started based on this file, this problem will be avoided, at the same time, URLs that have failed to be crawled due to the previous exception or interruption will be crawled first.
10 Recyclic gserialbinding Data output stream manager allocated to each thread, which uses threadlocal to manage the data output stream of each thread, which can save a lot of repeated serialization
11 Workqueue It represents a queue, abstract class, and has different sub-class implementations. For example, the bdbworkqueue stored in bdb is one of the most complex aspects of heritrix. Next we will focus on the analysis.
12 Workqueuefrontier The queue scheduler manages all queues. Different types of queues are used to manage different queues. For example, the queue is not in the active status: queue <string> inactivequeues. it can be said that it is the most complex and critical class in heritrix.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.