Web Crawler heritrix source code analysis (I) package Introduction

Source: Internet
Author: User

Welcome to the heritrix group (qq ):10447185, Lucene/SOLR group (qq ):118972724

I have said that I want to share my crawler experience before, but I have never been able to find a breakthrough. Now I feel it is really difficult to write something. So I really want to thank those selfless predecessors, one article left on the Internet can be used to give some advice.Article.
After thinking for a long time, we should start with heritrix's package, then talk about classes, and finally talk about how to process heritrix, that is, to make it a crawler we want. Here we will add, my version is 1.14.3.

Serial number Package name Description
1 Org. Apache. commons. httpclient Encapsulated Apache httpclient for Fetch webpage content
2 Org. Apache. commons. httpclient. Cookie The httpclient of Apache is encapsulated for Fetch webpage content.
3 Org. Apache. commons. Pool. impl The httpclient of Apache is encapsulated for Fetch web page content.
4 Org. archive. Crawler HeritrixProgramThe running entry package, such as heritrix, can be directly crawled.
5 Org. archive. crawler. Admin Heritrix management package. For example, crawljob indicates a capture task job, And crawljobhandler manages jobs and log statistics.
6 Org. archive. crawler. admin. UI Serves the UI management interface, such as job parameter settings
7 Org. archive. crawler. datamodel Heritrix data model package, for example, candidateuri that represents a URL in heritrix
8 Org. archive. crawler. datamodel. credential Manage creden in the heritrix data model. For example, a user name and password are required to capture some websites.
9 Org. archive. crawler. deciderules Heritrix rule set, such as determining which URLs can be crawled and can be scheduled
10 Org. archive. crawler. deciderules. recrawl Which URLs need to be crawled again?
11 Org. archive. crawler. Event Event Management, such as the pause, restart, and stop of heritrix
12 Org. archive. crawler. Extractor Heritrix's hematopoietic device, which extracts new URLs and crawls them again
13 Org. archive. crawler. fetcher Heritrix collection package, such as HTTP, DNS, and FTP data
14 Org. archive. crawler. Filter Heritrix filters, such as using rule to filter URLs that are not needed
15 Org. archive. crawler. Framework The heritrix framework package stores some core classes, which are generally parent classes, such as the heritrix control class crawlcontroller and the scheduler class frontier.
16 Org. archive. crawler. Framework. Exceptions Heritrix framework exception package. Generally, the exception thrown here will cause heritrix to stop.
17 Org. archive. crawler. Frontier The scheduler of heritrix determines which URL to capture
18 Org. archive. crawler. Io It seems unreasonable to name heritrix's Io format package. Here we just define some formats, such as the format of statistics and the format of error logs.
19 Org. archive. crawler. postprocessor The name of the Auxiliary Processor package is unreasonable. Here we only process the URLs before and after processing, such as URL redirection.
20 Org. archive. crawler. prefetch Heritrix pre-processor package, such as determining whether a URL has been resolved by DNS
21 Org. archive. crawler. Processor Not available yet, to be studied
22 Org. archive. crawler. processor. recrawl Not available yet, to be studied
23 Org. archive. crawler. Scope Heritrix capture range management, such as seed
24 Org. archive. crawler. selftest Manage heritrix's web project self. War
25 Org. archive. crawler. Settings Manage the configurations in order. xml in the heritrix configuration file
26 Org. archive. crawler. settings. Refinements Manage heritrix's own data format standards, such as time format
27 Org. archive. crawler. url Not yet come into use, to be studied
28 Org. archive. crawler. url. canonicalize The URL normalization of heritrix, used to regulate each URL
29 Org. archive. crawler. util Heritrix tool kits for capturing, such as bdb and I/O
30 Org. archive. crawler. Writer Heritrix download package, used to write captured URL content to the hard disk
31 Org. archive. Extractor Not available yet, to be studied
32 Org. archive. httpclient Heritrix provides customized packages for httpclient, allowing you to get better webpage content.
33 Org. archive. Io Heritrix Io package, some IO operation classes encapsulated by itself
34 Org. archive. Io. Arc Io operation package for Arc format
35 Org. archive. Io. WARC Io operation package for WARC format
36 Org.archive.net Heritrix extends the java.net package and mainly extends the java.net. Uri class.
37 Org.archive.net. MD5 Heritrix does not use the URL MD5 encryption package much.
38 Org.archive.net. rsync Not available yet, to be studied
39 Org.archive.net. S3 Not available yet, to be studied
40 Org. archive. Queue Not available yet, to be studied
41 Org. archive. uid Heritrix ID management, mainly for Uri
42 Org. archive. util The entire heritrix tool class
43 Org. archive. util. anvl Not available yet, to be studied
44 Org. archive. util. bdbje Heritrix encapsulation of bdb
45 Org. archive. util. Fingerprint Not available yet, to be studied
46 Org. archive. util. iterator Heritrix self-encapsulated iterator
47 Org. archive. util. Ms Not available yet, to be studied
48 St. ata. util Other extended packages to be studied

Heritrix has more than 48 packages, and more than 30 third-party packages imported by heritrix. The complexity is obvious...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.