Heritrix source code analysis (4) Descriptions of various classes (1)

Last Update:2018-12-07 Source: Internet

Author: User

Tags http authentication

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This blog is originalArticle, Reprinted! Reprint please be sure to indicate the source: http://guoyunsky.javaeye.com/blog/630347

Welcome to the heritrix group (qq ): 109148319, 10447185 (full), Lucene/SOLR group (qq ):118972724

Heritrix's classes are indeed cumbersome. They often inherit one layer after another, and the most inherited classes seem to have seven layers. The following section describes the role of each class in a package. The heritrix components in the package are clearly defined, and many components are useless and I have no idea about the classes of the component, so I will ignore some of them here. If you know, please add them. Thank you! If you are not familiar with the package, you can view my previous article, here also gives the link http://guoyunsky.javaeye.com/admin/blogs/613249

1.org. archive. Crawler
Serial number	Class	Description
1	Commandlineparser	Heritrix can also be operated through the CMD command, which is used to parse the CMD command
2	Heritrix	Heritrix main class, which can be used to start heritrix
3	Simplehttpserver	Heritrix Web server, which allows you to manage heritrix through web
4	Webapplifecycle	Encapsulate the servlet so that you can start heritrix through the Web and load the heritrix object

2.org. archive. crawler. Admin
Serial number	Class	Description
1	Crawljob	Heritrix's core class represents a crawling task. Most of the attributes in order. XML are configured around it.
2	Crawljoberrorhandler	The error log of a crawler job is displayed on the UI.
3	Crawljobhandler	Capture task processor. heritrix can manage multiple capture tasks.
4	Invalidjobfileexception	An error occurred while capturing the task file, which is of little significance.
5	Seedrecord	The processing record of the recorded seed, such as the URL to which the seeds.txt is redirected, indicates that the redirected value comes from it.
6	Statisticssummary	Statistical summary class, not used much
7	Statisticstracker	Heritrix core class, statistics tracker, throughout the operation of heritrix, such as statistics on the captured URL, will be highlighted in the future

3.org. archive. crawler. admin. UI
Serial number	Class	Description
1	Cookieutils	Cookie tool class, mainly used to access cookies
2	Jobconfigureutils	This class is used when you configure a crawler job through the Web UI.
3	Rootfilter	Unfamiliar

4.org. archive. crawler. datamodel
Serial number	Class	Description
1	Candidateuri	Heritrix's core class represents a URL that runs through the whole crawling. The difference with crawluri is that it has not passed the scheduler (frontier ), only the URL of the scheduler can be used to download the content area of the webpage.
2	Candidateuritest	Candidateuri test class, for example, you can use it to learn how to create a canditeuri
3	Checkpoint	Heritrix regularly backs up its data, such as logs and obtained URL content, and runs at the underlying level. When heritrix suffers an exception and is interrupted, it can be used for recovery. It is also similar to the ckeckpoint of each database.
4	Coreattributeconstants	The variable name containing the basic attribute of heritrix is generally the Tag Name in order. xml.
5	Crawlhost	The core class of heiritrix represents a host, which mainly contains domain names and IP addresses. Heritrix can control the crawling speed. For example, the crawling speed of a host indicates the host.
6	Crawlorder	The heritrix core class basically corresponds to the attribute values of order. xml. In addition to the detailed attributes of each component, it will be highlighted later.
7	Crawlserver	Heritrix's core class also corresponds to a host, which contains various heritrix data of a host, such as statistical information and crawler protocols.
8	Crawlsubstats	The crawling statistics class mainly collects statistics on the number of captured URLs, the number of successful URLs, and the number of downloaded bytes.
9	Crawluri	The subclass of candidateuri, which has more webpage content fingerprints, queues, and component processors than caidiateuri.
10	Credentialstore	Credential Storage Class, responsible for storing various creden, such as login
11	Fetchstatuscodes	Capture status. different attributes indicate different capture statuses. For example, if DNS is obtained successfully: s_dns_success
12	Robotshonoringpolicy	Crawler Protocols represent different crawling policies
13	Robotstxt	Crawler coordination, used in robot robots.txt
14	Servercache	Server cache, mainly cache crawlhost and crawler Server
15	Uriuniqfilter	Interface, used to filter URLs that have been crawled

5.org. archive. crawler. datamodel. credential

Serial number	Class	Description
1	Credential	The Credential class represents a credential that retrieves data from the order. xml configuration file.
2	Credentialavatar	Represents a specific credential
3	Htmlformcredential	A subclass of credential, representing the creden required for submitting an HTML form
4	Rfc2617credential	The subclass of credential, representing the rfc2617 HTTP authentication credential

6.org. archive. crawler. deciderules
Serial number	Class	Description
1	Acceptdeciderule	URL rule, indicating acceptance
2	Configureddeciderule	The URL rule determines whether to reject (reject) or accept (accept) through the configuration in the order. xml file)
3	Deciderule	The URL rule's parent class. It checks whether a URL accepts (accept), rejects (reject), or waives (PASS), and uses the decisionfor (Object object) method. This method is implemented by its subclass.
4	Decidingscope	Verify that a URL is in the range to determine whether to accept, reject, or give up.
5	Matchesregexpdeciderule	Determine whether the URL is acceptable, rejected, or abandoned by configuring a regular expression.
6	Notmatchesregexpdeciderule	The Child class of matchesregexpdeciderule. If the URL does not match the regular expression,
7	Pathologicalpathdeciderule	Deny if the same directory name in the URL exceeds the number in the configuration file, if the number of A in the http://www.xxx.com/a/a/a/a/a exceeds a certain limit
8	Prerequisiteacceptdeciderule	If there is a prerequisite URL in the URL, it is accepted, that is, the pathfromseed attribute in the candidateuri contains P, indicating that there is a URL to be run before the URL is run.
9	Rejectdeciderule	URL rule, indicating rejection
10	Toomanyhopsdeciderule	If the number of points (max-hops) in the configuration file is exceeded, the request is rejected.

7.org. archive. crawler. Event
Serial number	Class	Description
1	Crawlstatuslistener	Crawler listener, such as whether the crawler is running or paused
2	Crawluridispositionlistener	URL listener. For example, if the listening URL fails, you must re-capture the listener.

8.org. archive. crawler. Extractor
serial number	class	description
1	extractor	parent class of all extraction classes, used to extract a new URL from a URL
2	extractorcss	extract a new URL from CSS
3	extractordoc	extract a new URL from the doc
4	extractorhtml	extract a new URL from HTML, heritrix core class
5	extractorhttp	extract a new URL from HTTP
6	extractorjs	extract a new URL from JavaScript
7	extractorpdf	extract a new URL from PDF
8	extractorswf	extract a new URL from SwF
9	extractorxml	extract a new URL from XML
10	httpcontentdigest	the webpage content digest is actually fingerprint by MD5 or sha1 algorithm
11	link	link, indicating the extracted URL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More