Configure nutch-1.2 under ubuntu10.04

Last Update:2018-12-03 Source: Internet

Author: User

Tags xsl

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Install JDK and tomcat first. See the previous two blog posts.

Downlink

Apache Official Website

The latest version is apache-nutch-1.2-bin.tar.gz.

Installation

Decompress the package to a directory, such as/home/username/nutch.

Preparations

(1) create a new file weburls.txt, write the initial URL, such as http://www.csdn.net /.

(2) Open the nutch-1.2/CONF/crawl-urlfilter.xml, delete the original content, add:

+ ^ Http: // ([a-z0-9] */.) * csdn.net // web page that allows access to the csdn website

To access multiple websites by default, change the preceding statement to: + ^

Note: You must delete my. domain. Name.

(3) Open the nutch-1.2/CONF/nutch-site.xml, add the bold part:

<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>
<! -- Put site-specific property overrides in this file. -->

<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

</Configuration>

Otherwise, an error is returned:

Fetcher: no agents listed in 'HTTP. Agent. name' property.
Exception in thread "Main" Java. Lang. illegalargumentexception: fetcher: no agents listed in 'HTTP. Agent. name' Property

Capture webpages

Bin/maid weburls.txt-Dir localweb-depth 2-topn 100-threads 2
-Dir = localweb indicates the path for storing the downloaded data. If the directory does not exist, it is automatically created.
-Deptch = 2 the download depth is 2
-Topn = 100 download the first 100 eligible pages
-Threads = 2 Number of threads started
When the crawler runs, it outputs a large amount of data. After the crawling, you can find that the localweb directory is generated, which contains several directories.

Configure nutch in Tomcat

(1) set nutch-1.2 permissions, open tomcat6/CONF/Catalina. Policy, add:

Grant {

Permission java. Security. allpermission ;};

Otherwise, an error is returned: exception sending context initialized event to listener instance of class org. Apache. nutch. searcher. nutchbean $ nutchbeanconstructor
Java. Lang. runtimeexception: Java. Security. accesscontrolexception: Access Denied

(2) start Tomcat manually: CD/home/username/tomcat/tomcat6; bin/startup. Sh

(3) copy the nutch-1.2 in the nutch-1.2.war to tomcat6/webapps/, Tomcat will automatically decompress this package in the running state, open the decompressed package, add:

<Property>
<Name> searcher. dir </Name>
<Value>/home/username/nutch-1.2/localweb </value>
<Description> </description>
</Property>

The value is the storage path of the crawled data. The search engine searches for the desired content based on this path.

Run the nutch search on the Web

Address Bar input: http: // localhost: 8080/nutch-1.2

Enter the search keyword on the displayed search page to obtain the result. (If the result is garbled, refer to the previous blog to configure tomcat)

View the search results

(1) You can use the readdb tool to parse the webpage database and view the webpage and link quantity.

Simple viewing information:

$ Bin/nutch readdb localweb/crawldb-Stats (where stats indicates information in the statistics Library)

Call the-dump parameter to export the information of each URL to a text file in the pageurl directory:

$ Bin/nutch readdb localweb-dump pageurl

Call the-topn parameter to display the URL weight sorting information to the text file in the urlpath directory:

$ Bin/nutch readdb localweb/crawldb-topn 3 urlpath

(2) Call segread to read the information of all downloaded segments.

Simple View:

$ Bin/nutch segread-list-Dir localweb/segments/

For more details, run the shell command:

S = 'LS-D crawl-tinysite/segments/* | head-l'

Bin/nutch readseg-dump $ s

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More