Configure and run a simple job in heritrix

Source: Internet
Author: User

After a long vacation, I officially entered the development phase of the full-text search product. Although it is a full-text search for portal websites, It is a silly product, therefore, you need to use crawlers to obtain the website content. However
Then use Lucene to search. Here Lucene is a good saying, But crawlers are not easy to do. Although there are many open-source items, it is a crawler.ProgramI have never touched on it. What can I do? Fortunately, there is a document about
In a search engine book, crawlers in the book use "heritrix". Simply use this in the book.
From SourceForge
Download the latest cookbook --heritrix-1.14.3.zip, decompress it, and put it in the root directory of my computer's e drive, and rename the folder as heritrix. This is a benefit, and you will know it later.
.
After decompression, follow the content in the book and start it. Well, the startup command is really long enough, and I can't enter it manually every time. Forget it. Write it in Notepad first. Copy and paste
Post? It is also troublesome. If so, change it to the BAT file. Open CMD and enter E:/heritrix to run the BAT file. Yes, good. Congratulations!
Ji

Before running the command, modify "heritrix. cmdline. Admin =" in the heritrix. properties file under the conf directory.
In this line, enter "username: Password" after "=", restart once, and enter "http: // localhost: 8080/" in the address bar of the browser /"
Enter the user name and password on the page, for example:


Click the login button to go to the main interface, such:

Click "Jobs" and select "with ults" on the page to create a new job. You can enter the name and description as needed. The specific configuration is as follows:

Click the modules button to go to the configuration menu at the next level. After "Crawl scope" and "URI frontier" are selected, click "change". After other selections, click "add, the specific configuration is as follows:

Note that crawl Scope No
Select "broadscope". We recommend that you select "hostscope" because the "broadscope" method crawls all websites, rather than based on the entered URL. After setting modules, click Settings to go to the page shown in


The Marked Area in the figure is the content to be modified. You do not need to modify it elsewhere. The default value is enough. Click "submit job" to save the job. After saving, you can see the following information on the job page:

In this way, a simple job is created, and then click console to return to the Management page. We can see a "start", as shown in figure

After clicking it, this simple job will be able to work. Haha, success! Happy!
. You can use pause to pause the capture, or use terminate to stop the capture.

Through the test, I found that although the hostscope method is captured based on the host, it still parses other URLs on the page to continue crawling, this also results in the capture of content not only the input URL, but also other URL content. I will share it with you after testing all the methods.

Please go to my blog on 51cto (http://jerrysun.blog.51cto.com/745955/211772) to download my bat file, use 1.14.3 version of friends, modify the corresponding path to run

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.