Heritrix source code analysis (4) Descriptions of various classes (1)

Source: Internet
Author: User
Tags http authentication

This blog is originalArticle, Reprinted! Reprint please be sure to indicate the source: http://guoyunsky.javaeye.com/blog/630347

Welcome to the heritrix group (qq ): 109148319, 10447185 (full), Lucene/SOLR group (qq ):118972724

 

Heritrix's classes are indeed cumbersome. They often inherit one layer after another, and the most inherited classes seem to have seven layers. The following section describes the role of each class in a package. The heritrix components in the package are clearly defined, and many components are useless and I have no idea about the classes of the component, so I will ignore some of them here. If you know, please add them. Thank you! If you are not familiar with the package, you can view my previous article, here also gives the link http://guoyunsky.javaeye.com/admin/blogs/613249

 

 1.org. archive. Crawler
Serial number Class Description
1 Commandlineparser Heritrix can also be operated through the CMD command, which is used to parse the CMD command
2 Heritrix Heritrix main class, which can be used to start heritrix
3 Simplehttpserver Heritrix Web server, which allows you to manage heritrix through web
4 Webapplifecycle Encapsulate the servlet so that you can start heritrix through the Web and load the heritrix object

 

2.org. archive. crawler. Admin
Serial number Class Description
1 Crawljob Heritrix's core class represents a crawling task. Most of the attributes in order. XML are configured around it.
2 Crawljoberrorhandler The error log of a crawler job is displayed on the UI.
3 Crawljobhandler Capture task processor. heritrix can manage multiple capture tasks.
4 Invalidjobfileexception An error occurred while capturing the task file, which is of little significance.
5 Seedrecord The processing record of the recorded seed, such as the URL to which the seeds.txt is redirected, indicates that the redirected value comes from it.
6 Statisticssummary Statistical summary class, not used much
7 Statisticstracker Heritrix core class, statistics tracker, throughout the operation of heritrix, such as statistics on the captured URL, will be highlighted in the future

 

 

3.org. archive. crawler. admin. UI
Serial number Class Description
1 Cookieutils Cookie tool class, mainly used to access cookies
2 Jobconfigureutils This class is used when you configure a crawler job through the Web UI.
3 Rootfilter Unfamiliar

 

 

4.org. archive. crawler. datamodel
Serial number Class Description
1 Candidateuri Heritrix's core class represents a URL that runs through the whole crawling. The difference with crawluri is that it has not passed the scheduler (frontier ), only the URL of the scheduler can be used to download the content area of the webpage.
2 Candidateuritest Candidateuri test class, for example, you can use it to learn how to create a canditeuri
3 Checkpoint Heritrix regularly backs up its data, such as logs and obtained URL content, and runs at the underlying level. When heritrix suffers an exception and is interrupted, it can be used for recovery. It is also similar to the ckeckpoint of each database.
4 Coreattributeconstants The variable name containing the basic attribute of heritrix is generally the Tag Name in order. xml.
5 Crawlhost The core class of heiritrix represents a host, which mainly contains domain names and IP addresses. Heritrix can control the crawling speed. For example, the crawling speed of a host indicates the host.
6 Crawlorder The heritrix core class basically corresponds to the attribute values of order. xml. In addition to the detailed attributes of each component, it will be highlighted later.
7 Crawlserver Heritrix's core class also corresponds to a host, which contains various heritrix data of a host, such as statistical information and crawler protocols.
8 Crawlsubstats The crawling statistics class mainly collects statistics on the number of captured URLs, the number of successful URLs, and the number of downloaded bytes.
9 Crawluri The subclass of candidateuri, which has more webpage content fingerprints, queues, and component processors than caidiateuri.
10 Credentialstore Credential Storage Class, responsible for storing various creden, such as login
11 Fetchstatuscodes Capture status. different attributes indicate different capture statuses. For example, if DNS is obtained successfully: s_dns_success
12 Robotshonoringpolicy Crawler Protocols represent different crawling policies
13 Robotstxt Crawler coordination, used in robot robots.txt
14 Servercache Server cache, mainly cache crawlhost and crawler Server
15 Uriuniqfilter Interface, used to filter URLs that have been crawled

 

 

5.org. archive. crawler. datamodel. credential

Serial number Class Description
1 Credential The Credential class represents a credential that retrieves data from the order. xml configuration file.
2 Credentialavatar Represents a specific credential
3 Htmlformcredential A subclass of credential, representing the creden required for submitting an HTML form
4 Rfc2617credential The subclass of credential, representing the rfc2617 HTTP authentication credential

 

 

6.org. archive. crawler. deciderules
Serial number Class Description
1 Acceptdeciderule URL rule, indicating acceptance
2 Configureddeciderule The URL rule determines whether to reject (reject) or accept (accept) through the configuration in the order. xml file)
3 Deciderule The URL rule's parent class. It checks whether a URL accepts (accept), rejects (reject), or waives (PASS), and uses the decisionfor (Object object) method. This method is implemented by its subclass.
4 Decidingscope Verify that a URL is in the range to determine whether to accept, reject, or give up.
5 Matchesregexpdeciderule Determine whether the URL is acceptable, rejected, or abandoned by configuring a regular expression.
6 Notmatchesregexpdeciderule The Child class of matchesregexpdeciderule. If the URL does not match the regular expression,
7 Pathologicalpathdeciderule Deny if the same directory name in the URL exceeds the number in the configuration file, if the number of A in the http://www.xxx.com/a/a/a/a/a exceeds a certain limit
8 Prerequisiteacceptdeciderule If there is a prerequisite URL in the URL, it is accepted, that is, the pathfromseed attribute in the candidateuri contains P, indicating that there is a URL to be run before the URL is run.
9 Rejectdeciderule URL rule, indicating rejection
10 Toomanyhopsdeciderule If the number of points (max-hops) in the configuration file is exceeded, the request is rejected.

 

 

7.org. archive. crawler. Event
Serial number Class Description
1 Crawlstatuslistener Crawler listener, such as whether the crawler is running or paused
2 Crawluridispositionlistener URL listener. For example, if the listening URL fails, you must re-capture the listener.

 

 

8.org. archive. crawler. Extractor
serial number class description
1 extractor parent class of all extraction classes, used to extract a new URL from a URL
2 extractorcss extract a new URL from CSS
3 extractordoc extract a new URL from the doc
4 extractorhtml extract a new URL from HTML, heritrix core class
5 extractorhttp extract a new URL from HTTP
6 extractorjs extract a new URL from JavaScript
7 extractorpdf extract a new URL from PDF
8 extractorswf extract a new URL from SwF
9 extractorxml extract a new URL from XML
10 httpcontentdigest the webpage content digest is actually fingerprint by MD5 or sha1 algorithm
11 link link, indicating the extracted URL
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.