"Heritrix Basic Tutorial 1" In Eclipse configuration Heritrix

Last Update:2014-10-09 Source: Internet

Author: User

Tags tld list

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, create a new project and import the Heritrix source code

1, download Heritrix-1.14.4-src.zip and heritrix-1.14.4.zip Two compressed package, and decompression, hereafter, respectively, hereinafter referred to as SRC package and zip package;
2, under Eclipse new Java project, named heritrix.1.14.4;
3, copy the SRC package below Src/java directory org and ST two directories into the project under the SRC package;
4, copy the SRC package under the SRC Conf folder to the project root folder;
5, copy the SRC package under the Lib folder to the project root folder;
6, copy the WebApps folder under the ZIP package to the project root folder;

7, change the project under Conf heritrix.properties file

Heritrix.version = <span style= "color: #ff0000;" >1.14.4</span># location of the Heritrix jobs directory.heritrix.jobsdir = jobs# Default commandline startup Valu es.# Below values is used if unspecified on the command line.heritrix.cmdline.admin = <span style= "color: #ff0000;" >admin:admin</span>heritrix.cmdline.port = <span style= "color: #ff0000;" >8080</span>

Major changes to version, username, password and port numbers

8, on the project right-click on the build path, configuration build path, and the Library tab, add jar, all the. jar files under the Lib folder are selected, click Finish!
9, in the project/src/org.archive.crawler package Heritrix.java on the right click to select Run as->run configurations->classpath->user Entries- >advanced->add folder-> Select the Conf directory under Project, and then click Run

You can then log in to the system from the http://127.0.0.1:8080/.

Execute Heritrix on the Liunx

(1) Export classpath=.: $JAVA _home/lib/dt.jar: $JAVA _home/lib/tools.jar:/home/jediael/heritirx1.14.4/lib/*:/home/ jediael/heritirx1.14.4/bin/*:/home/jediael/heritirx1.14.4/conf/*
(2) Cp-r webapps/bin/
(3) CD bin
(4) Java Org.archive.crawler.Heritrix

Note that this method can only be visited on this machine via 127.0.0.1, as the code reads:

Final String Rooturi = "127.0.0.1:" + integer.tostring (port);
String Selftesturl = "http:/" + Rooturi + '/';

That is, only for the local test.

Second, configure the crawler task and start downloading

1. Login System Admin/admin

2. Click Jobs--->create new job---->with defaults

Each time a new job is created, it is equal to creating a new order.xml. In Heritrix, each task corresponds to a order.xml, which is used to describe the properties of the narrative task. It is used to specify properties such as the processor class for the job, the Frontier class, the Fetcher class, the maximum number of threads to crawl, and the longest timeout.

3, enter the basic information, note that the last seeds must have a "/"

4, select the "Modules" below, enter the module configuration page (Heritrix extension functions are implemented through the module concept, to achieve their own module completion of their own desired function). The first "select Crawl Scope" uses the default "Org.archive.crawler.deciderules.DecidingScope". The penultimate "Select writers" Removes the default "Org.archive.crawler.writer.ARCWriterProcessor" and then adds " Org.archive.crawler.writer.MirrorWriterProcessor "so that the page crawled when the task is run is mirrored in the local folder structure instead of the Arc archive file.

5, select "Modules" to the right of "submodules", in the first content "Crawl-order->scope->decide-rules->rules" deleted the " Acceptiftranscluded "(org.archive.crawler.deciderules.TransclusionDecideRule) is a rule that fetches scopes. Otherwise, when the HTTP request returns 301 or 302, Heritrix will crawl the Web page under the other domain.

6, select "Settings" to enter the job configuration page in Wui's second line navigation bar, which mainly changes two items: Http-headers user-agent and from, their "project_url_here" and "contact_" Email_address_here "Replace with your own content (" Project_url_here "to start with" http://")

7. Select the "Submit job" on the far right of the Wui in the second row of the navigation bar

8, in Wui first line navigation bar Select the first item of "Console", click "Start", crawl task officially started, the length of time has network conditions and the depth of the crawled site. Click "Refresh" to monitor the download situation

You can also click Logs and other observation logs.

9. By default, the file is downloaded to "project location \jobs".

Iii. Some notes

1, after the creation of the project, Heritrix Error: Sun.net.www.protocol.file.FileURLConnection, the original because the Sun package is protected package, by default only Sun Company's software talent use. Eclipse will make an error and use Waring for protection.

Procedures such as the following:errors/warnings-> Deprecated and trstricted API-&, Windows-Preferences, Java-Compiler Gt Forbidden Reference (Access rules): Change to Warning

2, in the module configuration page, if found that all the configuration can be deleted, move, but not be able to add and change, there is no optional drop-down box. Because the configuration file cannot be found, the path to the configuration file should be added to the Classpath tab.

That is the 9th step of the first part.

3. Question: thread-10 org.archive.util.archiveutils.<clinit> () TLD list unavailable
Java.lang.NullPointerException
At java.io.reader.<init> (Unknown Source)
At java.io.inputstreamreader.<init> (Unknown Source)
At Org.archive.util.archiveutils.<clinit> (archiveutils.java:759)
Solution: Unzip the Heritrix-1.14.4-src.zip in src/resources/org/archive/ Tlds-alpha-by-domain.txt files in util are copied to project under the Org.archive.util package.

Iv. Some of the configuration items in modules

In the modules interface, there are 8 common options that need to be configured, including the following

1.Crawl Scope

Used to configure the fetch range. Please see the options.

According to the name can be intuitively aware of the crawl range, the default is Broadscope, that is, unlimited range.

After selecting an item from the drop-down box, click on the Chang button and the explanation above the dropdown box will change to describe the characteristics of the current option.

2.URI Frontier

Used to determine the order of URLs to be crawled, i.e. the corresponding algorithm. The default entry is Bdbfrontier.

3.Pre processors: processors thatshould run before any fetching

Before fetching, the processor infers some prerequisites. For example, infer information such as Robot.txt, which is the gateway to the entire processor chain.

4.fetchers: Processors thatfetch documents using various protocols

Specify which types of files to interpret and extract

5.extractors: Processors thatextracts links from URIs

Used to extract information currently obtained to a file.

6.Writers:processors. Write documents to archive files

Choose the way to save, often used in 2 kinds:

Org.archive.crawler.writer.MirrorWriterProcessor: Save the image and download the file directly.

Org.archive.crawler.writer.ARCWriterProcessor: Downloaded in archived format, the file cannot be viewed directly at this time. This item is the default option for the defaults module.

7.Post processors processors that does cleanup and feed the Frontier with new URIs

The finishing touches after the crawl.

8,Statistics Tracking

Used for some statistical information.

V. Some of the configuration items in setting

Change the path of the downloaded file

"Heritrix Basic Tutorial 1" In Eclipse configuration Heritrix

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More