Serial number |
Package name |
Description |
1 |
Org. Apache. commons. httpclient |
Encapsulated Apache httpclient for Fetch webpage content |
2 |
Org. Apache. commons. httpclient. Cookie |
The httpclient of Apache is encapsulated for Fetch webpage content. |
3 |
Org. Apache. commons. Pool. impl |
The httpclient of Apache is encapsulated for Fetch web page content. |
4 |
Org. archive. Crawler |
HeritrixProgramThe running entry package, such as heritrix, can be directly crawled. |
5 |
Org. archive. crawler. Admin |
Heritrix management package. For example, crawljob indicates a capture task job, And crawljobhandler manages jobs and log statistics. |
6 |
Org. archive. crawler. admin. UI |
Serves the UI management interface, such as job parameter settings |
7 |
Org. archive. crawler. datamodel |
Heritrix data model package, for example, candidateuri that represents a URL in heritrix |
8 |
Org. archive. crawler. datamodel. credential |
Manage creden in the heritrix data model. For example, a user name and password are required to capture some websites. |
9 |
Org. archive. crawler. deciderules |
Heritrix rule set, such as determining which URLs can be crawled and can be scheduled |
10 |
Org. archive. crawler. deciderules. recrawl |
Which URLs need to be crawled again? |
11 |
Org. archive. crawler. Event |
Event Management, such as the pause, restart, and stop of heritrix |
12 |
Org. archive. crawler. Extractor |
Heritrix's hematopoietic device, which extracts new URLs and crawls them again |
13 |
Org. archive. crawler. fetcher |
Heritrix collection package, such as HTTP, DNS, and FTP data |
14 |
Org. archive. crawler. Filter |
Heritrix filters, such as using rule to filter URLs that are not needed |
15 |
Org. archive. crawler. Framework |
The heritrix framework package stores some core classes, which are generally parent classes, such as the heritrix control class crawlcontroller and the scheduler class frontier. |
16 |
Org. archive. crawler. Framework. Exceptions |
Heritrix framework exception package. Generally, the exception thrown here will cause heritrix to stop. |
17 |
Org. archive. crawler. Frontier |
The scheduler of heritrix determines which URL to capture |
18 |
Org. archive. crawler. Io |
It seems unreasonable to name heritrix's Io format package. Here we just define some formats, such as the format of statistics and the format of error logs. |
19 |
Org. archive. crawler. postprocessor |
The name of the Auxiliary Processor package is unreasonable. Here we only process the URLs before and after processing, such as URL redirection. |
20 |
Org. archive. crawler. prefetch |
Heritrix pre-processor package, such as determining whether a URL has been resolved by DNS |
21 |
Org. archive. crawler. Processor |
Not available yet, to be studied |
22 |
Org. archive. crawler. processor. recrawl |
Not available yet, to be studied |
23 |
Org. archive. crawler. Scope |
Heritrix capture range management, such as seed |
24 |
Org. archive. crawler. selftest |
Manage heritrix's web project self. War |
25 |
Org. archive. crawler. Settings |
Manage the configurations in order. xml in the heritrix configuration file |
26 |
Org. archive. crawler. settings. Refinements |
Manage heritrix's own data format standards, such as time format |
27 |
Org. archive. crawler. url |
Not yet come into use, to be studied |
28 |
Org. archive. crawler. url. canonicalize |
The URL normalization of heritrix, used to regulate each URL |
29 |
Org. archive. crawler. util |
Heritrix tool kits for capturing, such as bdb and I/O |
30 |
Org. archive. crawler. Writer |
Heritrix download package, used to write captured URL content to the hard disk |
31 |
Org. archive. Extractor |
Not available yet, to be studied |
32 |
Org. archive. httpclient |
Heritrix provides customized packages for httpclient, allowing you to get better webpage content. |
33 |
Org. archive. Io |
Heritrix Io package, some IO operation classes encapsulated by itself |
34 |
Org. archive. Io. Arc |
Io operation package for Arc format |
35 |
Org. archive. Io. WARC |
Io operation package for WARC format |
36 |
Org.archive.net |
Heritrix extends the java.net package and mainly extends the java.net. Uri class. |
37 |
Org.archive.net. MD5 |
Heritrix does not use the URL MD5 encryption package much. |
38 |
Org.archive.net. rsync |
Not available yet, to be studied |
39 |
Org.archive.net. S3 |
Not available yet, to be studied |
40 |
Org. archive. Queue |
Not available yet, to be studied |
41 |
Org. archive. uid |
Heritrix ID management, mainly for Uri |
42 |
Org. archive. util |
The entire heritrix tool class |
43 |
Org. archive. util. anvl |
Not available yet, to be studied |
44 |
Org. archive. util. bdbje |
Heritrix encapsulation of bdb |
45 |
Org. archive. util. Fingerprint |
Not available yet, to be studied |
46 |
Org. archive. util. iterator |
Heritrix self-encapsulated iterator |
47 |
Org. archive. util. Ms |
Not available yet, to be studied |
48 |
St. ata. util |
Other extended packages to be studied |