Configuring Heritrix in Eclipse

Source: Internet
Author: User
Tags tld list


First, the new project and the Heritrix source code import

1, download Heritrix-1.14.4-src.zip and heritrix-1.14.4.zip Two compressed package, and decompression, hereafter, respectively, hereinafter referred to as SRC package and zip package;
2, under Eclipse new Java project, named heritrix.1.14.4;
3. Copy the SRC package under the Src/java folder under the Org and St two folders into the project under the SRC package;
4, copy the SRC package under the SRC Conf folder to the project root directory;
5. Copy the Lib folder under the SRC package to the project root directory;
6, copy the WebApps folder under the ZIP package to the project root directory;



7, modify the project under Conf heritrix.properties file

Heritrix.version = <span style= "color: #ff0000;" >1.14.4</span># location of the Heritrix jobs directory.heritrix.jobsdir = jobs# Default commandline startup Valu es.# Below values is used if unspecified on the command line.heritrix.cmdline.admin = <span style= "color: #ff0000;" >admin:admin</span>heritrix.cmdline.port = <span style= "color: #ff0000;" >8080</span>
Major changes to version, user name, password and port number

8. On the project, right-click on the build path, configuration build path---add jar, select all the. jar files under the Lib directory, and click Done!
9, in the project/src/org.archive.crawler package Heritrix.java on the right click to select Run as->run configurations->classpath->user Entries- >advanced->add folder-> Select the Conf folder under Project, and then click Run


You can then log in to the system from http://127.0.0.1:8080/.



Second, configure the crawler task and start downloading

1. Login System Admin/admin


2. Click Jobs--->create new job---->with defaults


Each time a new job is created, it is equal to creating a new order.xml. In Heritrix, each task corresponds to a order.xml that describes the properties of the task. It is used to specify properties such as the processor class for the job, the Frontier class, the Fetcher class, the maximum number of threads to crawl, and the longest timeout.



3, enter the basic information, note that the last seeds must have a "/"



4, select the "Modules" below, enter the module configuration page (Heritrix extension functions are implemented through the module concept, you can achieve their own modules to complete their desired function). The first "select Crawl Scope" uses the default "Org.archive.crawler.deciderules.DecidingScope". The penultimate "Select writers" Removes the default "Org.archive.crawler.writer.ARCWriterProcessor" and adds " Org.archive.crawler.writer.MirrorWriterProcessor ", so that the page crawled when the task is executed is mirrored in the local directory structure instead of the Arc archive file.




5, select "Modules" to the right of "submodules", in the first content "Crawl-order->scope->decide-rules->rules" Delete the " Acceptiftranscluded "(org.archive.crawler.deciderules.TransclusionDecideRule) is a rule that fetches scopes. Otherwise, when the HTTP request returns 301 or 302, Heritrix will crawl the page under the other domain.

6, select "Settings" to enter the job configuration page in Wui's second line navigation bar, which mainly modifies two items: Http-headers user-agent and from, their "project_url_here" and "contact_ Email_address_here "Replace with your own content (" Project_url_here "to start with" http://")

7. Select the "Submit job" on the far right of the Wui in the second row of the navigation bar

8, in the first line of Wui navigation bar Select the first item of "Console", click "Start", crawl task officially started, the length of time has network conditions and the depth of the crawled site. Click "Refresh" to monitor the download situation



You can also click Logs and other observation logs.

9. By default, the file is downloaded to "project location \jobs".


Iii. Some notes

1, after the creation of the project, Heritrix Error: Sun.net.www.protocol.file.FileURLConnection, the original because the Sun package is a protected package, by default only Sun company's software to use. Eclipse will make an error and use Waring for protection.

The steps are as follows:errors/warnings-> Deprecated and trstricted API->, Java-Compiler, Windows-Preferences ; Forbidden Reference (Access rules): Change to Warning



2, in the module configuration page, if found that all the configuration can be deleted, move, but can not be added and modified, there is no optional drop-down box. Because the configuration file cannot be found, you should add the path to the configuration file in the Classpath tab page.

That is the 9th step of the first part.

3. Question: thread-10 org.archive.util.archiveutils.<clinit> () TLD list unavailable
Java.lang.NullPointerException
At java.io.reader.<init> (Unknown Source)
At java.io.inputstreamreader.<init> (Unknown Source)
At Org.archive.util.archiveutils.<clinit> (archiveutils.java:759)
Solution: Unzip the Heritrix-1.14.4-src.zip in src/resources/org/archive/ The Tlds-alpha-by-domain.txt file in util is copied to the project under the Org.archive.util package.


Iv. Some of the configuration items in modules

In the modules interface, a total of 8 options need to be configured, including the following

1.Crawl Scope

Used to configure the fetch range. Please see the options.

Depending on the name you can intuitively know the fetch range, the default is Broadscope, that is, unrestricted range.

When you select an item from the drop-down box and click the Chang button, the explanation above the drop-down box changes Accordingly, describing the characteristics of the current option.



2.URI Frontier

Used to determine the order of URLs to be crawled, i.e. the corresponding algorithm. The default entry is Bdbfrontier.


3.Pre processors: processors thatshould run before any fetching

Before fetching, the processor makes a judgment on some prerequisites. such as judging robot.txt information, it is the entire processor chain of the entrance.


4.fetchers: Processors thatfetch documents using various protocols

Specify which types of files to interpret and extract


5.extractors: Processors thatextracts links from URIs

Used to extract information currently obtained to a file.


6.Writers:processors. Write documents to archive files

Choose the way to save, there are 2 kinds of commonly used:

Org.archive.crawler.writer.MirrorWriterProcessor: Save the image and download the file directly.

Org.archive.crawler.writer.ARCWriterProcessor: Downloaded in archived format, the file cannot be viewed directly at this time. This item is the default option for the defaults module.


7.Post processors processors that does cleanup and feed the Frontier with new URIs

The finishing touches after the crawl.


8,Statistics Tracking

Used for some statistical information.







Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.