This blog is originalArticle, Reprinted! Reprint please be sure to indicate the source: http://guoyunsky.javaeye.com/blog/632191
Welcome to the heritrix group (qq ): 109148319, 10447185 (full), Lucene/SOLR group (qq ):118972724
9.org. archive. crawler. fetcher
Serial number |
Class |
Description |
1 |
Fetchdns |
Obtain DNS data, such as IP |
2 |
Fetchftp |
Obtain FTP data |
3 |
Fetchhttp |
Get http data |
4 |
Heritrixhttpmethodretryhandler |
HTTP retry processor, re-CONNECT HTTP |
10.org. archive. crawler. Framework
Serial number |
Class |
Description |
1 |
Abstracttracker |
The statistician that collects statistics on crawlers and parent classes. Specific statistics are implemented by sub-classes. |
2 |
Alertmanager |
The UI message manager displays crawler-related messages, such as exceptions. |
3 |
Checkpointer |
It regularly backs up heritrix-related data, such as logs and bdb files. |
4 |
Crawlcontroller |
Controller to control the startup, pause, and stop of the entire crawler. heritrix core class |
5 |
Crawlscope |
URL Range manager, such as seeds and URLs that match the captured URLs that do not match the captured URLs |
6 |
Filter |
Filter to determine which URLs can be crawled and which are not allowed. The parent class is implemented by the subclass. |
7 |
Frontier |
The scheduler schedules the incoming URLs so that they can be captured under the Access |
8 |
Processor |
A processor. a url is jointly developed by different processors (components). This is the processor parent class and different components have different implementations. |
9 |
Processorchain |
The processor chain contains the same type of processors, such as extractorhtml extracted from HTML during URL extraction and extractorjs extracted from JavaScript. |
10 |
Processorchainlist |
A set of processor chains, including multiple processor chains. Each URL has a set of processors, so that the processor chain is obtained from the set first, and then each processor is obtained from the chain, finally, let each processor perform their processing to complete the whole capture. |
11 |
Scoper |
The Range manager checks whether a URL is in the range configured by the user (obtained from order. XML ). |
12 |
Statisticstracking |
The tracking statistician mainly collects statistics on crawlers, such as bandwidth usage, captured URLs, and crawling speeds. |
13 |
Toepool |
Thread Pool for managing crawler threads |
14 |
Toethread |
Crawler threads represent a crawling, heritrix core class that runs throughout the whole crawler. Next we will focus on the analysis |
15 |
Writerpoolprocessor |
The write processor management pool is used to manage multiple write processors and can be used in Distributed scenarios. |
11.org. archive. crawler. Frontier
Serial number |
Class |
Description |
1 |
Abstractfrontier |
The scheduler's basic implementation class, one of the most complex aspects of heritrix, will focus on the analysis |
2 |
Bdbfrontier |
The bdb scheduler uses the bdb database to manage all the URLs, such as the URLs to be crawled, the URLs to be crawled, and one of the most complex aspects of heritrix. Next we will focus on the analysis. |
3 |
Bdbmultipleworkqueues |
All queues are managed, and all the queue data is stored in the bdb database. heritrix is one of the most complex parts. Next, we will focus on the analysis. |
4 |
Bdbworkqueue |
The crawling queue is managed by the bdb storage, and the URL with the same classkey is a queue. the classkey is determined by the user configuration. By default, heritrix uses the URL of the same host as a queue. one of the most complex aspects of heritrix will be analyzed later |
5 |
Frontierjournal |
The scheduler Records Management and records the running status of the scheduler, such as the insertion URL or insertion failure URL. |
6 |
Hostnamequeueassignmentpolicy |
URL classkey acquisition policy. heritrix's default policy obtains the class key of the URL through the domain name. Then the same classkey is used to store the same queue. |
7 |
Ipqueueassignmentpolicy |
URL classkey acquisition policy, which obtains the class key of the URL through IP |
8 |
Queueassignmentpolicy |
URL classkey acquisition policy. This class is an abstract class. Different policies are implemented by different sub-classes, such as by domain name, IP address, etc. Users can expand their own |
9 |
Recoveryjournal |
Manage/logs/recover.gz. This file records all URL captures. For example, different formats are available for successful or failed captures. This file is mainly used for next heritrix recovery. For example, if heritrix is abnormally interrupted and restarted, heritrix will be re-crawled. If it is started based on this file, this problem will be avoided, at the same time, URLs that have failed to be crawled due to the previous exception or interruption will be crawled first. |
10 |
Recyclic gserialbinding |
Data output stream manager allocated to each thread, which uses threadlocal to manage the data output stream of each thread, which can save a lot of repeated serialization |
11 |
Workqueue |
It represents a queue, abstract class, and has different sub-class implementations. For example, the bdbworkqueue stored in bdb is one of the most complex aspects of heritrix. Next we will focus on the analysis. |
12 |
Workqueuefrontier |
The queue scheduler manages all queues. Different types of queues are used to manage different queues. For example, the queue is not in the active status: queue <string> inactivequeues. it can be said that it is the most complex and critical class in heritrix. |