This blog is originalArticle, Reprinted! Reprint please be sure to indicate the source: http://guoyunsky.javaeye.com/blog/630347
Welcome to the heritrix group (qq ): 109148319, 10447185 (full), Lucene/SOLR group (qq ):118972724
Heritrix's classes are indeed cumbersome. They often inherit one layer after another, and the most inherited classes seem to have seven layers. The following section describes the role of each class in a package. The heritrix components in the package are clearly defined, and many components are useless and I have no idea about the classes of the component, so I will ignore some of them here. If you know, please add them. Thank you! If you are not familiar with the package, you can view my previous article, here also gives the link http://guoyunsky.javaeye.com/admin/blogs/613249
1.org. archive. Crawler
Serial number |
Class |
Description |
1 |
Commandlineparser |
Heritrix can also be operated through the CMD command, which is used to parse the CMD command |
2 |
Heritrix |
Heritrix main class, which can be used to start heritrix |
3 |
Simplehttpserver |
Heritrix Web server, which allows you to manage heritrix through web |
4 |
Webapplifecycle |
Encapsulate the servlet so that you can start heritrix through the Web and load the heritrix object |
2.org. archive. crawler. Admin
Serial number |
Class |
Description |
1 |
Crawljob |
Heritrix's core class represents a crawling task. Most of the attributes in order. XML are configured around it. |
2 |
Crawljoberrorhandler |
The error log of a crawler job is displayed on the UI. |
3 |
Crawljobhandler |
Capture task processor. heritrix can manage multiple capture tasks. |
4 |
Invalidjobfileexception |
An error occurred while capturing the task file, which is of little significance. |
5 |
Seedrecord |
The processing record of the recorded seed, such as the URL to which the seeds.txt is redirected, indicates that the redirected value comes from it. |
6 |
Statisticssummary |
Statistical summary class, not used much |
7 |
Statisticstracker |
Heritrix core class, statistics tracker, throughout the operation of heritrix, such as statistics on the captured URL, will be highlighted in the future |
3.org. archive. crawler. admin. UI
Serial number |
Class |
Description |
1 |
Cookieutils |
Cookie tool class, mainly used to access cookies |
2 |
Jobconfigureutils |
This class is used when you configure a crawler job through the Web UI. |
3 |
Rootfilter |
Unfamiliar |
4.org. archive. crawler. datamodel
Serial number |
Class |
Description |
1 |
Candidateuri |
Heritrix's core class represents a URL that runs through the whole crawling. The difference with crawluri is that it has not passed the scheduler (frontier ), only the URL of the scheduler can be used to download the content area of the webpage. |
2 |
Candidateuritest |
Candidateuri test class, for example, you can use it to learn how to create a canditeuri |
3 |
Checkpoint |
Heritrix regularly backs up its data, such as logs and obtained URL content, and runs at the underlying level. When heritrix suffers an exception and is interrupted, it can be used for recovery. It is also similar to the ckeckpoint of each database. |
4 |
Coreattributeconstants |
The variable name containing the basic attribute of heritrix is generally the Tag Name in order. xml. |
5 |
Crawlhost |
The core class of heiritrix represents a host, which mainly contains domain names and IP addresses. Heritrix can control the crawling speed. For example, the crawling speed of a host indicates the host. |
6 |
Crawlorder |
The heritrix core class basically corresponds to the attribute values of order. xml. In addition to the detailed attributes of each component, it will be highlighted later. |
7 |
Crawlserver |
Heritrix's core class also corresponds to a host, which contains various heritrix data of a host, such as statistical information and crawler protocols. |
8 |
Crawlsubstats |
The crawling statistics class mainly collects statistics on the number of captured URLs, the number of successful URLs, and the number of downloaded bytes. |
9 |
Crawluri |
The subclass of candidateuri, which has more webpage content fingerprints, queues, and component processors than caidiateuri. |
10 |
Credentialstore |
Credential Storage Class, responsible for storing various creden, such as login |
11 |
Fetchstatuscodes |
Capture status. different attributes indicate different capture statuses. For example, if DNS is obtained successfully: s_dns_success |
12 |
Robotshonoringpolicy |
Crawler Protocols represent different crawling policies |
13 |
Robotstxt |
Crawler coordination, used in robot robots.txt |
14 |
Servercache |
Server cache, mainly cache crawlhost and crawler Server |
15 |
Uriuniqfilter |
Interface, used to filter URLs that have been crawled |
5.org. archive. crawler. datamodel. credential
Serial number |
Class |
Description |
1 |
Credential |
The Credential class represents a credential that retrieves data from the order. xml configuration file. |
2 |
Credentialavatar |
Represents a specific credential |
3 |
Htmlformcredential |
A subclass of credential, representing the creden required for submitting an HTML form |
4 |
Rfc2617credential |
The subclass of credential, representing the rfc2617 HTTP authentication credential |
6.org. archive. crawler. deciderules
Serial number |
Class |
Description |
1 |
Acceptdeciderule |
URL rule, indicating acceptance |
2 |
Configureddeciderule |
The URL rule determines whether to reject (reject) or accept (accept) through the configuration in the order. xml file) |
3 |
Deciderule |
The URL rule's parent class. It checks whether a URL accepts (accept), rejects (reject), or waives (PASS), and uses the decisionfor (Object object) method. This method is implemented by its subclass. |
4 |
Decidingscope |
Verify that a URL is in the range to determine whether to accept, reject, or give up. |
5 |
Matchesregexpdeciderule |
Determine whether the URL is acceptable, rejected, or abandoned by configuring a regular expression. |
6 |
Notmatchesregexpdeciderule |
The Child class of matchesregexpdeciderule. If the URL does not match the regular expression, |
7 |
Pathologicalpathdeciderule |
Deny if the same directory name in the URL exceeds the number in the configuration file, if the number of A in the http://www.xxx.com/a/a/a/a/a exceeds a certain limit |
8 |
Prerequisiteacceptdeciderule |
If there is a prerequisite URL in the URL, it is accepted, that is, the pathfromseed attribute in the candidateuri contains P, indicating that there is a URL to be run before the URL is run. |
9 |
Rejectdeciderule |
URL rule, indicating rejection |
10 |
Toomanyhopsdeciderule |
If the number of points (max-hops) in the configuration file is exceeded, the request is rejected. |
7.org. archive. crawler. Event
Serial number |
Class |
Description |
1 |
Crawlstatuslistener |
Crawler listener, such as whether the crawler is running or paused |
2 |
Crawluridispositionlistener |
URL listener. For example, if the listening URL fails, you must re-capture the listener. |
8.org. archive. crawler. Extractor
serial number |
class |
description |
1 |
extractor |
parent class of all extraction classes, used to extract a new URL from a URL |
2 |
extractorcss |
extract a new URL from CSS |
3 |
extractordoc |
extract a new URL from the doc |
4 |
extractorhtml |
extract a new URL from HTML, heritrix core class |
5 |
extractorhttp |
extract a new URL from HTTP |
6 |
extractorjs |
extract a new URL from JavaScript |
7 |
extractorpdf |
extract a new URL from PDF |
8 |
extractorswf |
extract a new URL from SwF |
9 |
extractorxml |
extract a new URL from XML |
10 |
httpcontentdigest |
the webpage content digest is actually fingerprint by MD5 or sha1 algorithm |
11 |
link |
link, indicating the extracted URL |