Install and run heritrix in Windows

Source: Internet
Author: User
Tags valid email address administrator password java se

1 Installation
1.1 prerequisites
Windows XP/2003. The Java environment is required. Install j2se JRE/JDK on your own. The version I used is Java SE 1.6.0 _ 02.
1.2 download heritrix
Heritrix homepage:Http://crawler.archive.org/
Download Page:Http://crawler.archive.org/downloads.htmlOn this page, select SourceForge downloads to go To the download list and select a version of zip package for download. The latest version is the Heritrix-1.12.1.
1.3 install and configure heritrix
1. decompress the obtained heritrix package to a directory. I chose D:/heritrix.
2. unpackage the heritrix-1.12.1.jar files in the/heritrix directory and copy the two order.xmland seeds.txt files under profiles/default to the/heritrix/conf directory.
3. Open heritrix. properties file in "heritrix. using line. add the Administrator account and password you want to set after the "Admin =" item and separate them with ":", for example:
Heritrix. cmdline. Admin = admin: pwd1234
4. Copy the jmxremote. Password. template file under/heritrix/conf to the home directory/heritrix and rename it jmxremote. Password. Edit this file and change "@ password @" in the last two lines "monitorrole @" and "controlrole @ password @" to the administrator password. For example:
Monitorrole pwd1234
Controlrole pwd1234
2 run
2.1 run the script in the system
The "heritrix. cmd" script file is visible under the/heritrix/bin directory, which is the running script file of heritrix. If you double-click to run it, the command window will flash and fail to start. In this case, parameters must be included for running. You can create a script file in the/heritrix/bin directory. For example, run. CMD contains the following content:
Heritrix. CMD -- admin = admin: pwd1234
Double-click the script to run heritrix.
In XP, two windows are generated when this method is run. The first window is the script window. During the first run, the prompt "you need to change the jmxremote. Password attribute to read-only is displayed. Are you sure you want to change it? Y, N ", select y. The second window displays the heritrix running status. The previous window is automatically closed after the operation is successful, and the heritrix version is displayed in the second window. See figure 1.
However, running this script file in 2003std does not prompt you to change the read-only attribute of the file. After heritrix is started, the script command form is closed and only the heritrix running status form is retained.
2.2 run self-writing commands
Another method is to create a script file to run heritrix. Create the file run. bat in the/heritrix/bin directory. The script content is as follows [Reference 1]:
Java-xmx512m-dheritrix. Home = D: // heritrix-CP "D: // heritrix // lib // commons-codec-1.3.jar;
D: // heritrix // lib // commons-collections-3.1.jar; D: // heritrix // lib // dnsjava-2.0.3.jar;
D: // heritrix // lib // poi-scratchpad-2.0-RC1-20031102.jar;
D: // heritrix // lib // commons-logging-1.0.4.jar; D: // heritrix // lib // commons-httpclient-3.0.1.jar;
D: // heritrix // lib // commons-cli-1.0.jar; D: // heritrix // lib // mg4j-1.0.1.jar;
D: // heritrix // lib // javaswf-CVS-SNAPSHOT-1.jar; D: // heritrix // lib // bsh-2.0b4.jar;
D: // heritrix // lib // servlet-tomcat-4.1.30.jar; D: // heritrix // lib // junit-3.8.2.jar;
D: // heritrix // lib // jasper-compiler-tomcat-4.1.30.jar; D: // heritrix // lib // commons-lang-2.3.jar;
D: // heritrix // lib // itext-1.2.0.jar; D: // heritrix // lib // poi-2.0-RC1-20031102.jar;
D: // heritrix // lib // jetty-4.2.23.jar; D: // heritrix // lib // commons-net-1.4.1.jar;
D: // heritrix // lib // libidn-0.5.9.jar; D: // heritrix // lib // ant-1.6.2.jar;
D: // heritrix // lib // fastutil-5.0.3-heritrix-subset-1.0.jar; D: // heritrix // lib // je-3.2.23.jar;
D: // heritrix // lib // commons-pool-1.3.jar; D: // heritrix // lib // jasper-runtime-tomcat-4.1.30.jar;
D: // heritrix // heritrix-1.12.1.jar "org. archive. crawler. heritrix
If heriterx is run in this method, No prompts or other windows are displayed. The heritrix running status form is displayed directly.

Figure 1 heritrix running status form

3. Task
3.1 Login
Heritrix is running and can access its webui through a browser. Enter http: // 127.0.0.1: 8080/in the address bar of the browser. The logon interface of heritrix is displayed. 2:

Figure 2 logon interface of heritrix webui

Enter the username and password we set before to log on. After Successful Logon, page 3 is displayed:

Figure 3 interface displayed after logon to heritrix webui

3.2 Create a capture task
3.2.1 create a task
Click the "Jobs" menu in webui shown in Figure 3 to go to the task settings page. 4:

Figure 4 Task page

Click "with ults" as shown in Figure 4 and create a new capture task according to the default settings. You can set it as shown in Figure 5:

Figure 5 Create task settings

The task name can start at will, and the seed address can be added to multiple URLs, but it must be a complete URL including HTTP and the final slash. Here we will test the use of Baidu's URL. Click the modules button in the lower left corner to go to the processing link settings page.
3.2.2 handling chain settings
The specific settings are not described. For details, refer to the content in Bibliography 1 (csdn has a free trial). Set 6:

Figure 6 process chain settings

The setting method is simple, that is, select from the drop-down list and click Add. The subsequent settings can be found in the figure. Note that each processor setting is sequential. The name of each item actually reveals its role. Here we will not describe it any more. For more information about Chinese, see bibliography 1.
3.2.3 running parameter settings
After setting the processing link, click "Settings" in the menu shown in Figure 7 to set the running parameters.

Figure 7 select "Settings" in the jobs menu to go to the running parameter settings page

After entering the running parameter settings page, there are many parameters that can be set. For details about the settings, click "?" on the left of the settings box. To view the pop-up help information. In the simplest way, you can only change the content in the "http-header" item and change its attribute values "User-Agent" and "from ". Modify content 8:

Figure 8 "http-header" in "running parameter settings" Figure 8 shows the changes in the red box. "Project_url_here" in "User-Agent" corresponds to its complete URL address, and "from" sets its own valid email address. These two settings can be set at will, as long as the format rules are valid.
After setting is complete, click the task submission menu item shown in 9 to complete the creation of all tasks. Figure 9 task submission menu

3.3 run the task
After submitting the set task, the page returns to the main menu of jobs. You can see 10 images:

Figure 10 task created

In figure 10, we can see that a task has been created in red, and there are some task information and settings menus at the bottom of the page. Click the "console" menu in the upper left corner and return to the home page. 11 is displayed:

Figure 11 tasks awaiting start

Follow the red instructions in Figure 11 and click Start to start the task. 12:

Figure 12 activated tasks

Click the "refresh" option marked in red in the lower left corner to refresh the task status. Figure 13 shows the initial task status:

Figure 13 the task that just started is shown in Figure 13. to pause the task, click the "pause" option next to the task status ". Figure 14 shows the task that has been running for about two minutes. Figure 14 shows the task that has been running for two minutes.

As shown in figure 14, the capture speed is very fast. The capture thread in the previous settings option uses the default 100 line, and all of them are used. The capture speed reaches KBS, there are 7.7 URLs per second.
Save the captured website structure and file 15:

Figure 15 saved website and file structure figure 15 shows that the URL address of each site is the name of the saved directory, the file and the corresponding service directory structure are saved under this directory. This facilitates the use of search engines. However, heritrix is more like a powerful website download tool.


Now, the description of heritrix simple capture task is complete. Hope to help you.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.