Configure nutch-1.2 under ubuntu10.04

Source: Internet
Author: User
Tags xsl

Install JDK and tomcat first. See the previous two blog posts.

 

Downlink

Apache Official Website

The latest version is apache-nutch-1.2-bin.tar.gz.

 

Installation

Decompress the package to a directory, such as/home/username/nutch.

 

Preparations

(1) create a new file weburls.txt, write the initial URL, such as http://www.csdn.net /.

(2) Open the nutch-1.2/CONF/crawl-urlfilter.xml, delete the original content, add:

+ ^ Http: // ([a-z0-9] */.) * csdn.net // web page that allows access to the csdn website

To access multiple websites by default, change the preceding statement to: + ^

Note: You must delete my. domain. Name.

(3) Open the nutch-1.2/CONF/nutch-site.xml, add the bold part:

<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>
<! -- Put site-specific property overrides in this file. -->

<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>

</Configuration>

Otherwise, an error is returned:

Fetcher: no agents listed in 'HTTP. Agent. name' property.
Exception in thread "Main" Java. Lang. illegalargumentexception: fetcher: no agents listed in 'HTTP. Agent. name' Property

 

Capture webpages

Bin/maid weburls.txt-Dir localweb-depth 2-topn 100-threads 2
-Dir = localweb indicates the path for storing the downloaded data. If the directory does not exist, it is automatically created.
-Deptch = 2 the download depth is 2
-Topn = 100 download the first 100 eligible pages
-Threads = 2 Number of threads started
When the crawler runs, it outputs a large amount of data. After the crawling, you can find that the localweb directory is generated, which contains several directories.

 

Configure nutch in Tomcat

(1) set nutch-1.2 permissions, open tomcat6/CONF/Catalina. Policy, add:

Grant {

Permission java. Security. allpermission ;};

Otherwise, an error is returned: exception sending context initialized event to listener instance of class org. Apache. nutch. searcher. nutchbean $ nutchbeanconstructor
Java. Lang. runtimeexception: Java. Security. accesscontrolexception: Access Denied

(2) start Tomcat manually: CD/home/username/tomcat/tomcat6; bin/startup. Sh

(3) copy the nutch-1.2 in the nutch-1.2.war to tomcat6/webapps/, Tomcat will automatically decompress this package in the running state, open the decompressed package, add:

<Property>
<Name> searcher. dir </Name>
<Value>/home/username/nutch-1.2/localweb </value>
<Description> </description>
</Property>

The value is the storage path of the crawled data. The search engine searches for the desired content based on this path.

Run the nutch search on the Web

Address Bar input: http: // localhost: 8080/nutch-1.2

Enter the search keyword on the displayed search page to obtain the result. (If the result is garbled, refer to the previous blog to configure tomcat)

 

 

View the search results

 

(1) You can use the readdb tool to parse the webpage database and view the webpage and link quantity.

Simple viewing information:

$ Bin/nutch readdb localweb/crawldb-Stats (where stats indicates information in the statistics Library)

Call the-dump parameter to export the information of each URL to a text file in the pageurl directory:

$ Bin/nutch readdb localweb-dump pageurl

Call the-topn parameter to display the URL weight sorting information to the text file in the urlpath directory:

$ Bin/nutch readdb localweb/crawldb-topn 3 urlpath

 

(2) Call segread to read the information of all downloaded segments.

Simple View:

$ Bin/nutch segread-list-Dir localweb/segments/

 

For more details, run the shell command:

S = 'LS-D crawl-tinysite/segments/* | head-l'

Bin/nutch readseg-dump $ s

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.