Build specific site crawlers with Heritrix

Source: Internet
Author: User

Heritrix is a Java-developed, open-source web crawler that users can use to crawl the resources they want from the web. The best thing about it is that it's good scalability and allows users to implement their own crawl logic. This paper introduces the configuration and operation of Heritrix in Eclipse, and finally, taking the website of Beijing Forestry University as an example, describes how to extend it to crawl only pages of specific websites.

10 Reviews:

Guo Yanfen, IBM intern, IBM

November 29, 2010

    • Content

Develop and deploy your next application on the IBM Bluemix cloud platform.

Start your free trial now

This paper introduces the configuration and operation of Heritrix in Eclipse in detail. Finally, it expands to describe how to implement pages that crawl only specific sites.

Through this article, readers can learn about the characteristics of Heritrix and the configuration run in Eclipse, to build a proprietary crawler from scratch for a specific site, thereby adding full-text search services to the site.

Background

With the increase of website content, it is a common requirement to add search function to it, and search engine has become one of the most important applications of Internet. Do you think that the normal database retrieval is not enough to meet your query requirements? Want to spend the least price for your site to build a like Google, Baidu as the full-text search engine? Do you want to create your own proprietary search engine instead of trying to do SEO (search engine optimization, SEO) to wait for Google, Baidu included your site? With the power of open source tools, you can easily achieve these goals.

The implementation process of the search engine can be regarded as three steps: 1. Crawl Web pages from the Internet; 2. The Web page is processed to establish an index database; 3. To query. So no matter what kind of search engine, must have a well-designed crawler to support. Heritrix is a Java-based open-source crawler on SourceForge, which can launch, set crawl parameters and monitor crawling through a Web user interface, while developers can arbitrarily extend its components to implement their own crawl logic. Because of its convenient scalability and by the vast number of search engine enthusiasts love.

Although Heritrix is powerful, its configuration is complex, and the official tests are only done on Linux systems, which makes it difficult for users to get started. In this paper, we introduce the configuration of Heritrix under Windows Eclipse in detail, and simply extend it to crawl only for a specific website, and lay a good foundation for the full-text search engine to build the site.

Back to top of page

Heritrix Download

Currently the latest version of Heritrix is 1.14.4 (released in 2010-5-10), which you can download from SourceForge (http://sourceforge.net/projects/archive-crawler/files/). Each version has four compressed packages and two. tar.gz packages for Linux under,. zip for Windows. Where Heritrix-1.14.4.zip is the source code after the compiled packaged file, and Heritrix-1.14.4-src.zip contains the original source code, convenient for two development. This article needs to use Heritrix-1.14.4-src.zip, download it and unzip it to the Heritrix-1.14.4-src folder.

Back to top of page

Configuration in Eclipse

First, create a new Java engineering Myheritrix in Eclipse. Then use the downloaded source code package to configure the project according to the following steps.

1. Import the class library

The tool libraries used by Heritrix are in the Heritrix-1.14.4-src\lib directory and need to be imported into the Myheritrix project.

1) Copy the Lib folder under Heritrix-1.14.4-src to the Myheritrix project root directory;

2) Right-click on the Myheritrix project to select "Build Path?" Configure Build Path ..., and then select the Library tab, click "Add JARs ...", shown in 1.

Figure 1. Import class Library-before importing

3) in the "Jar Selection" dialog box that pops up, select all the jar files under the Myheritrix Project Lib folder and click the OK button. As shown in 2.

Figure 2. Select Class Library

After the setup is complete, 3 shows:

Figure 3. Import class Library-after import 2. Copy source code

1) Copy the COM, org, and St three folders under Heritrix-1.14.4-src\src\java into the SRC of the Myheritrix project. These three folders contain the core source code necessary to run the Heritrix;

2) Copy the file Tlds-alpha-by-domain.txt under Heritrix-1.14.4-src\src\resources\org\archive\util to Myheritrix\src\org\archive The \util. The file is a list of top-level domains that will be read when the Heritrix is started;

3) Copy the HERITRIX-1.14.4-SRC\SRC conf folder to the Heritrix project root directory. It contains the configuration files required to run the Heritrix;

4) Copy the WebApps folder from the HERITRIX-1.14.4-SRC\SRC to the Heritrix project root directory. This folder is a Web UI file that is used to provide a servlet engine that contains Heritrix. Note that it does not contain help documents, and if you want to use Help, you can copy the articles folder in Heritrix-1.14.4.zip\docs to Myheritrix\webapps\admin\docs (you need to create a new Docs folder). or replace the WebApps folder in Heritrix-1.14.4-src\src directly with the Heritrix-1.14.4.zip WebApps folder, the disadvantage is that this is a packaged. war file that cannot be modified by the source code.

After the copy is completed, the Myheritrix project directory level 4 is shown. The source code required to run Heritrix here is ready, so you need to modify the configuration file and add the running parameters.

Figure 4. The catalog level of the Myheritrix project is 3. Modifying a configuration file

The Conf folder is used to provide the configuration file, which contains a very important file: Heritrix.properties. A number of parameters are configured in heritrix.properties that are closely related to the Heritrix operation, and the configuration of these parameters determines some of the default tool classes for the Heritrix runtime, the startup parameters of the Web UI, and the Heritrix log format. When you run Heritrix for the first time, you only need to modify the file to include the user name and password for the Web UI. 5, set heritrix.cmdline.admin = Admin:admin, and "admin:admin" are user names and passwords, respectively. Then set the version parameter to 1.14.4.

Figure 5. Set login username and password 4. Configure the Run file

Right-click on the Myheritrix project to select "Run as?" Run configurations, make sure that the Project and main class options in the Main tab are the correct content, as shown in 6. The name parameter can be set to any easily identifiable name.

Figure 6. Configure run files-set up projects and classes

Then select the Userentries option on the Classpath page, at which point the Advanced button on the right is active, click on it, select "Add Folders" in the pop-up dialog and select the Conf folder under Myheritrix Project. As shown in 7.

Figure 7. Add a configuration file

At this point our Myheritrix project is ready to run. Let's take a look at how to start the Heritrix and set up a specific fetch task.

Back to top of page

Create a Web Crawl task

Locate the Heritrix.java file in the Org.archive.crawler package, which is the Heritrix crawler-initiated portal, right-click to select "Run as?" Java application ", if configured correctly, will show the boot information in console output 8.

Figure 8. Console output when running successfully

Entering http://localhost:8080 in the browser will open the Web UI login interface shown in 9.

Figure 9. Heritrix Login Interface

Enter the username/password that you set previously: Admin/admin, enter the admin interface to Heritrix, as shown in 10. Because we haven't created a crawl task yet, the jobs display is 0.

Figure 10. Heritrix Console

Heritrix uses a Web user interface to start, set crawl parameters and monitor crawling, simple and intuitive, easy to manage. Below we take the Beijing Forestry University homepage (http://www.bjfu.edu.cn/) as the seed site to create a crawl instance.

Create a new crawl task on the Jobs page, as shown in 11, you can create four types of tasks.

Figure 11. Create a crawl task
    • Based on existing job: Generate a new fetch task for the template with an existing fetch task.
    • Based on a recovery: In a previous task, some state points may have been set, and the new task will start at the state point of this setting.
    • Based on a profile: specific templates have been set up specifically for different tasks, and new tasks will be generated according to the template.
    • With defaults: This is the simplest, which means that a task is generated by default configuration.

Here we select "with defaults" and enter the task-related information as shown in 12.

Figure 12. Create crawl task "Bjfu"

Note the buttons below in Figure 11, which allow you to set up the crawl work in detail, where we only do the necessary settings.

First click on the "Modules" button, in the corresponding page for the task to set up each processing module, a total of seven configurable content, here we only set Crawl Scope and writers Two, the following brief description of the meaning of each.

1) Select Crawl scope:crawl scope is used to configure which Web page links should be crawled within the current scope. For example, selecting Broadscope indicates that the current fetch range is unrestricted, and selecting Hostscope indicates that the fetch range is within the current Host scope. Here we select Org.archive.crawler.scope.BroadScope and click on the Change button on the right to save the setting state.

2) Select URI Frontier:frontier is the processor of a URL that determines what the next URL is to be processed. It also adds URLs resolved through the processor chain to the queue waiting to be processed. Here we use the default values.

3) Select Pre processors: This queue processor is used to determine some prerequisites for crawling. such as judging robot.txt information, it is the entire processor chain of the entrance. Here we use the default values.

4) Select fetchers: This parameter is used to parse the network transport protocol, such as DNS, HTTP, or FTP. Here we use the default values.

5) Select extractors: Mainly used to parse the content returned by the current server, take out the URL on the page, wait for the next crawl. Here we use the default values.

6) Select Writers: it is used primarily to set the form in which the captured information is written to disk. One is the compression method (ARC), and the other is the mirroring method (Mirror). Here we choose a simple and intuitive way of mirroring: Org.archive.crawler.writer.MirrorWriterProcessor.

7) Select Post processors: This parameter is mainly used to capture the completion of the parsing process, such as the Extrator parse the URL conditionally added to the queue to be processed. Here we use the default values.

Effect after Setup 13:

Figure 13. Set Modules

After setting "Modules", click "Settings" button, here only need to set user-agent and from, where:

    • "@[email protected]" string needs to be replaced with Heritrix version information.
    • "Project_url_here" can be replaced with any full URL address.
    • The "from" property does not need to set the real e-mail address, as long as the correct format of the email address is OK.

For each parameter interpretation, you can click on the question mark before the parameter to view. This is shown in the task set 14.

Figure 14. Set Settings

After completing the above settings, click on the "Submit job" link and then go back to the console console to see that the task we just created is in the pending state, shown in 15.

Figure 15. Start a task

Click "Start" to start the task, refresh to see the crawl progress and related parameters. You can also pause or terminate the crawl process, as shown in 16. It is important to note that the percentage of the progress bar is not accurate, and this percentage is the ratio of the number of links actually processed and the total number of links analyzed. As the crawl work continues, this percentage of the figure is constantly changing.

Figure 16. Start crawl

At the same time, the "Jobs" folder is automatically generated under the Myheritrix project directory, which includes this fetch task. Crawl down the page to be stored in a mirrored way, that is, the URL address by "/" to slice, and then by the slicing out of the level of storage. As shown in 17.

Figure 17. Crawl to the Web page

As can be seen from Figure 17, because we chose the Broadscope crawl range, the crawler will crawl all encountered URLs, which will cause the URL Queue Unlimited expansion, can not be terminated, can only forcibly terminate the task. Although Heritrix also provides some classes for capturing range control, depending on the actual test experience, if you want to fully implement your own crawl logic, only the crawl control provided by Heritrix is not enough, only the extension source code can be modified.

The following is an example of how to extend Heritrix to implement its own crawl logic by grasping the relevant pages in the www.bjfu.edu.cn of Beijing Forestry University.

Back to top of page

Extended Heritrix

Let us first analyze the overall structure of the heritrix and the processing chain of the URI.

The overall structure of the Heritrix

The Heritrix uses a modular design that allows the user to select the module to use at runtime. It consists of the core class (classes) and the plug-in module (pluggable modules). Core classes can be configured, but cannot be overwritten, and plug-in modules can be replaced by third-party modules. So we can replace the default plug-in module with a third-party module that implements the specific crawl logic to meet our own crawl needs.

The overall structure of the Heritrix is shown in 18. where Crawlcontroller (download Controller) The total controller of the entire download process, the entire crawl work starting point, determine the start and end of the entire crawl task. Each URI has a separate thread that obtains a new URI from the boundary controller (Frontier) and then passes it to the Processor chains (the processing chain) through a series of Processor (processors) processing.

Figure 18. Heritrix overall structure URI processing flow

The processing chain consists of multiple processors, which collectively complete the processing of URIs, as shown in 19.

Figure 19. URI processing Chain

1) Pre-fetch processing chain (preprocessing chain), used to determine the crawl of some prerequisites, such as robot protocol, DNS and so on.

2) Fetch processing chain (crawl processing chain), parse the network transport protocol, obtain data from the remote server.

3) Extractor processing chain (extract processing chain), extract the new URL from the Web page.

4) Write/index processing chain (write processing chain), responsible for writing data to the local disk.

5) post-processing chain (post-processing chain), after the completion of the entire crawl parsing process, do some cleanup work, such as the previous Extractor parse out the URL conditionally added to the waiting queue. Here we can control the scope of the crawl by simply controlling the URLs that are added to the pending queue.

Extend Frontierscheduler to crawl specific site content

Frontierscheduler is a class in the Org.archive.crawler.postprocessor package, and its role is to add the links analyzed in Extractor to the Frontier, pending further processing. In the innerprocess (Crawluri) function of the class, first check whether there are some high-priority links in the current link queue. If there is, turn around immediately for processing, and if not, iterate through all the links and call the schedule () method in Frontier to join the queue for processing. As shown in code 20.

Figure 20. The innerprocess () and schedule () functions in the Frontierscheduler class

As you can see from the code above, the Innerprocess () function does not call Frontier's schedule () method directly, but instead calls its own internal schedule () method, and then calls Frontier's schedule () method in this method. The Frontierscheduler schedule () method actually directly joins the current candidate link directly into the crawl queue without any judgment. This approach leaves a good interface for the Frontierscheduler extension.

Here we need to construct a Frontierscheduler derived class Frontierschedulerforbjfu, which overloads the schedule (Candidateuri Cauri) method, which restricts the crawled URI to contain " Bjfu "to ensure that the crawled links are Beilin internal addresses. The derived class Frontierschedulerforbjfu specific code 21.

Figure 21. Derived class Frontierschedulerforbjfu

Then, add a line in Processor.options in the Modules folder "org.archive.crawler.postprocessor.frontierschedulerforbjfu| Frontierschedulerforbjfu ", so you can choose our extended org.archive.crawler.postprocessor.FrontierSchedulerForBjfu option in the crawler's WebUI. As shown in 22.

Figure 22. Replace Frontierscheduler with Frontierschedulerforbjfu

The final crawl is shown in page 23, which is all the pages under http://www.bjfu.edu.cn. Isn't it simple? Of course, if you just want to achieve this crawl target, without modifying the source code, you can also satisfy the requirements by setting up a crawl rule in the Web UI. This article just takes this as an example to illustrate how Heritrix expands Heritrix.

Figure 23. Post-expansion crawl effect

Back to top of page

FAQ 1. Access Restriction Error

Error message:

Access restriction:the type fileurlconnection isn't accessible due to restriction on required library C:\Program files\j Ava\jdk1.6.0_20\jre\lib\rt.jar, shown in 24.

Figure 24. Access Restriction Error

Solution:

This is the JRE access limit resulting in an error, right-click on the Myheritrix project to select "Build Path?" Configure Build Path ..., then select the Library tab, remove the JRE System Library and then re-import it to fix it. or select Windows? Preferences? Java? Compiler? Errors/warnings "Find" Forbidden reference (access rules) under Deprecated and restricted API, change the default setting "Error" to "Warning" or " Ignore ".

2. NullPointerException's error

The error message 25 shows:

Figure 25. NullPointerException Error

Solution:

The reason for this error is that the "tlds-alpha-by-domain.txt" file is missing and the file can be found under Heritrix-1.14.4-src\src\resources\org\archive\util and copied to Myheritrix\src\org\archive\util can be in.

3. Modules interface cannot change the selection

This is shown in error message 26.

Figure 26. Modules interface cannot change the selection

Solution:

This is because the configuration files required by the runtime are not added, as referenced in article "4. Configure run as Classpath add parameters to the system.

Back to top of page

Precautions

Heritrix belongs to multi-threaded download crawler, in the company intranet use has crawl restrictions.

Back to top of page

Summarize

In the search engine development process, using a good crawler to obtain the required web page information is the first step, but also the key to the success of the entire system. Heritrix is a powerful and efficient crawler with good scalability. This article describes its configuration runs and extensions in Windows Eclipse, allowing you to get started with Heritrix and enjoy your reptile journey as quickly as possible.

Reference Learning
    • Check out the Heritrix website to learn more about Heritrix.
    • Download Heritrix1.14.4 from SourceForge.
    • See the article "Using HttpClient and Htmlparser to implement simple crawlers" to learn how to use open source tools to write your own crawlers.
    • Download the Eclipse IDE.
    • Access the DeveloperWorks Open source zone for rich how-to information, tools and project updates, and the most popular articles and tutorials to help you develop with open source technology and use them with IBM products.
    • Stay tuned for DeveloperWorks technical activities and webcasts.

Build specific site crawlers with Heritrix

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.