Install JDK and tomcat first. See the previous two blog posts.
Downlink
Apache Official Website
The latest version is apache-nutch-1.2-bin.tar.gz.
Installation
Decompress the package to a directory, such as/home/username/nutch.
Preparations
(1) create a new file weburls.txt, write the initial URL, such as http://www.csdn.net /.
(2) Open the nutch-1.2/CONF/crawl-urlfilter.xml, delete the original content, add:
+ ^ Http: // ([a-z0-9] */.) * csdn.net // web page that allows access to the csdn website
To access multiple websites by default, change the preceding statement to: + ^
Note: You must delete my. domain. Name.
(3) Open the nutch-1.2/CONF/nutch-site.xml, add the bold part:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?>
<! -- Put site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.0 </value>
</Property>
</Configuration>
Otherwise, an error is returned:
Fetcher: no agents listed in 'HTTP. Agent. name' property.
Exception in thread "Main" Java. Lang. illegalargumentexception: fetcher: no agents listed in 'HTTP. Agent. name' Property
Capture webpages
Bin/maid weburls.txt-Dir localweb-depth 2-topn 100-threads 2
-Dir = localweb indicates the path for storing the downloaded data. If the directory does not exist, it is automatically created.
-Deptch = 2 the download depth is 2
-Topn = 100 download the first 100 eligible pages
-Threads = 2 Number of threads started
When the crawler runs, it outputs a large amount of data. After the crawling, you can find that the localweb directory is generated, which contains several directories.
Configure nutch in Tomcat
(1) set nutch-1.2 permissions, open tomcat6/CONF/Catalina. Policy, add:
Grant {
Permission java. Security. allpermission ;};
Otherwise, an error is returned: exception sending context initialized event to listener instance of class org. Apache. nutch. searcher. nutchbean $ nutchbeanconstructor
Java. Lang. runtimeexception: Java. Security. accesscontrolexception: Access Denied
(2) start Tomcat manually: CD/home/username/tomcat/tomcat6; bin/startup. Sh
(3) copy the nutch-1.2 in the nutch-1.2.war to tomcat6/webapps/, Tomcat will automatically decompress this package in the running state, open the decompressed package, add:
<Property>
<Name> searcher. dir </Name>
<Value>/home/username/nutch-1.2/localweb </value>
<Description> </description>
</Property>
The value is the storage path of the crawled data. The search engine searches for the desired content based on this path.
Run the nutch search on the Web
Address Bar input: http: // localhost: 8080/nutch-1.2
Enter the search keyword on the displayed search page to obtain the result. (If the result is garbled, refer to the previous blog to configure tomcat)
View the search results
(1) You can use the readdb tool to parse the webpage database and view the webpage and link quantity.
Simple viewing information:
$ Bin/nutch readdb localweb/crawldb-Stats (where stats indicates information in the statistics Library)
Call the-dump parameter to export the information of each URL to a text file in the pageurl directory:
$ Bin/nutch readdb localweb-dump pageurl
Call the-topn parameter to display the URL weight sorting information to the text file in the urlpath directory:
$ Bin/nutch readdb localweb/crawldb-topn 3 urlpath
(2) Call segread to read the information of all downloaded segments.
Simple View:
$ Bin/nutch segread-list-Dir localweb/segments/
For more details, run the shell command:
S = 'LS-D crawl-tinysite/segments/* | head-l'
Bin/nutch readseg-dump $ s