Configuring Heritrix in Eclipse

Last Update:2014-06-01 Source: Internet

Author: User

Tags tld list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the new project and the Heritrix source code import

1, download Heritrix-1.14.4-src.zip and heritrix-1.14.4.zip Two compressed package, and decompression, hereafter, respectively, hereinafter referred to as SRC package and zip package;
2, under Eclipse new Java project, named heritrix.1.14.4;
3. Copy the SRC package under the Src/java folder under the Org and St two folders into the project under the SRC package;
4, copy the SRC package under the SRC Conf folder to the project root directory;
5. Copy the Lib folder under the SRC package to the project root directory;
6, copy the WebApps folder under the ZIP package to the project root directory;

7, modify the project under Conf heritrix.properties file

Heritrix.version = <span style= "color: #ff0000;" >1.14.4</span># location of the Heritrix jobs directory.heritrix.jobsdir = jobs# Default commandline startup Valu es.# Below values is used if unspecified on the command line.heritrix.cmdline.admin = <span style= "color: #ff0000;" >admin:admin</span>heritrix.cmdline.port = <span style= "color: #ff0000;" >8080</span>

Major changes to version, user name, password and port number

8. On the project, right-click on the build path, configuration build path---add jar, select all the. jar files under the Lib directory, and click Done!
9, in the project/src/org.archive.crawler package Heritrix.java on the right click to select Run as->run configurations->classpath->user Entries- >advanced->add folder-> Select the Conf folder under Project, and then click Run

You can then log in to the system from http://127.0.0.1:8080/.

Second, configure the crawler task and start downloading

1. Login System Admin/admin

2. Click Jobs--->create new job---->with defaults

Each time a new job is created, it is equal to creating a new order.xml. In Heritrix, each task corresponds to a order.xml that describes the properties of the task. It is used to specify properties such as the processor class for the job, the Frontier class, the Fetcher class, the maximum number of threads to crawl, and the longest timeout.

3, enter the basic information, note that the last seeds must have a "/"

4, select the "Modules" below, enter the module configuration page (Heritrix extension functions are implemented through the module concept, you can achieve their own modules to complete their desired function). The first "select Crawl Scope" uses the default "Org.archive.crawler.deciderules.DecidingScope". The penultimate "Select writers" Removes the default "Org.archive.crawler.writer.ARCWriterProcessor" and adds " Org.archive.crawler.writer.MirrorWriterProcessor ", so that the page crawled when the task is executed is mirrored in the local directory structure instead of the Arc archive file.

5, select "Modules" to the right of "submodules", in the first content "Crawl-order->scope->decide-rules->rules" Delete the " Acceptiftranscluded "(org.archive.crawler.deciderules.TransclusionDecideRule) is a rule that fetches scopes. Otherwise, when the HTTP request returns 301 or 302, Heritrix will crawl the page under the other domain.

6, select "Settings" to enter the job configuration page in Wui's second line navigation bar, which mainly modifies two items: Http-headers user-agent and from, their "project_url_here" and "contact_ Email_address_here "Replace with your own content (" Project_url_here "to start with" http://")

7. Select the "Submit job" on the far right of the Wui in the second row of the navigation bar

8, in the first line of Wui navigation bar Select the first item of "Console", click "Start", crawl task officially started, the length of time has network conditions and the depth of the crawled site. Click "Refresh" to monitor the download situation

You can also click Logs and other observation logs.

9. By default, the file is downloaded to "project location \jobs".

Iii. Some notes

1, after the creation of the project, Heritrix Error: Sun.net.www.protocol.file.FileURLConnection, the original because the Sun package is a protected package, by default only Sun company's software to use. Eclipse will make an error and use Waring for protection.

The steps are as follows:errors/warnings-> Deprecated and trstricted API->, Java-Compiler, Windows-Preferences ; Forbidden Reference (Access rules): Change to Warning

2, in the module configuration page, if found that all the configuration can be deleted, move, but can not be added and modified, there is no optional drop-down box. Because the configuration file cannot be found, you should add the path to the configuration file in the Classpath tab page.

That is the 9th step of the first part.

3. Question: thread-10 org.archive.util.archiveutils.<clinit> () TLD list unavailable
Java.lang.NullPointerException
At java.io.reader.<init> (Unknown Source)
At java.io.inputstreamreader.<init> (Unknown Source)
At Org.archive.util.archiveutils.<clinit> (archiveutils.java:759)
Solution: Unzip the Heritrix-1.14.4-src.zip in src/resources/org/archive/ The Tlds-alpha-by-domain.txt file in util is copied to the project under the Org.archive.util package.

Iv. Some of the configuration items in modules

In the modules interface, a total of 8 options need to be configured, including the following

1.Crawl Scope

Used to configure the fetch range. Please see the options.

Depending on the name you can intuitively know the fetch range, the default is Broadscope, that is, unrestricted range.

When you select an item from the drop-down box and click the Chang button, the explanation above the drop-down box changes Accordingly, describing the characteristics of the current option.

2.URI Frontier

Used to determine the order of URLs to be crawled, i.e. the corresponding algorithm. The default entry is Bdbfrontier.

3.Pre processors: processors thatshould run before any fetching

Before fetching, the processor makes a judgment on some prerequisites. such as judging robot.txt information, it is the entire processor chain of the entrance.

4.fetchers: Processors thatfetch documents using various protocols

Specify which types of files to interpret and extract

5.extractors: Processors thatextracts links from URIs

Used to extract information currently obtained to a file.

6.Writers:processors. Write documents to archive files

Choose the way to save, there are 2 kinds of commonly used:

Org.archive.crawler.writer.MirrorWriterProcessor: Save the image and download the file directly.

Org.archive.crawler.writer.ARCWriterProcessor: Downloaded in archived format, the file cannot be viewed directly at this time. This item is the default option for the defaults module.

7.Post processors processors that does cleanup and feed the Frontier with new URIs

The finishing touches after the crawl.

8,Statistics Tracking

Used for some statistical information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More