Install nutch 0.9 in Windows [zz]

Source: Internet
Author: User

I. Environment:
1. Operating System: WindowsXP, Windows2000 +
2. Set java_home to environment variables in java1.6.
3. cygwin, of course, this is not necessary, but the script provided by nutch can only be used in the shell environment, so cygwin is used for virtual shell commands.
4. nutch version: 0.9
5. Tomcat: 6.0

Ii. installation and configuration of nutch:

1. Install cygwin1.5.5 (I will install cygwin1.5.5 here to F: "cygsys"), decompress the nutch, and place it in the directory of the cygsys "home" User Name (I put it in F: "cygsys" home "DYK" under nutch ),

2, in the cygwin environment to enter the nutch-0.9 directory, use the command bin/nutch for testing, the normal result is:

3. Perform a website capture test. The following example uses http://www.163.com /.

1) create a new file myurl, and enter http://www.163.com/in the file to save the file. This file can be stored anywhere (my file is placed in F: "cygsys" home "DYK" nutch "myurl ), create another crawler log directory logs (I put it in F: "cygsys" home "DYK" nutch "logs)

2) Open the nutch-0.9 "conf" nutch-site.xml file and insert the following content in <configuration> </configuration>:

<Property>
<Name> HTTP. Agent. Name </Name>
<Value> </value>
<Description> HTTP 'user-agent' request header. Must not be empty-
Please set this to a single word uniquely related to your organization.

Note: You shoshould also check other related properties:

HTTP. Robots. Agents
HTTP. Agent. Description
HTTP. Agent. url
HTTP. Agent. Email
HTTP. Agent. Version

And set their values appropriately.

</Description>
</Property>

<Property>
<Name> HTTP. Agent. Description </Name>
<Value> </value>
<Description> further description of our bot-this text is used in
The User-Agent header. It appears in parenthesis after the agent name.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. url </Name>
<Value> </value>
<Description> a URL to advertise in the User-Agent header. This will
Appear in parenthesis after the agent name. Custom dictates that this
Shocould be a URL of a page explaining the purpose and behavior of this
Crawler.
</Description>
</Property>

<Property>
<Name> HTTP. Agent. Email </Name>
<Value> </value>
<Description> An email address to advertise in the HTTP 'from' request
Header and User-Agent header. A good practice is to mangle this
Address (e.g. 'info at example dot com ') to avoid spamming.
</Description>
</Property>

You can replace the content between <Name> XXX </Name> with other characters. Of course, it doesn't matter if you do not replace the content. The setting here is because the content of the parameter complies with the robots protocol, when obtaining response, submit relevant information to the crawled website for identification.

3) Open the nutch-0.9 "conf" crawl-urlfilter.txt file, put my. domain. replace the name character with the domain name in myurl (for example, I changed to "+ ^ http: // ([a-z0-9] *".) * trim ([a-z0-9] * ".) * You can use these words to indicate that all HTTP websites agree to crawl ).
<! -- [Endif] -->

4) run the crawler and run the following command in cygwin:

Bin/nutch crawl ../myurl-Dir ../mydir-depth 2> & ../logs/crawl1.log

Here, Dir indicates the directory to be stored,-depth indicates the depth of the URL crawling, and finally indicates the log file

After running, you can open the log file to view detailed crawler running processes.

5. Run nutch on Tomcat

Copy the nutch-0.9.war to Tomcat "webapps"

Enter http: // localhost: 8080/nutch-0.9/in the browser to make Tomcat expand the nutch-0.9.war and then modify the webapps/nutch-0.9/WEB-INF/classes/nutch-site.xml file as follows:

<Configuration>
<Property>
<Name> searcher. dir </Name>
<Value> F: "" cygsys "" home "" DYK "" nutch "" mydir4 </value>
</Property>
</Configuration>

Modify Tomcat "conf" server. XML to support Chinese search. Find the corresponding location and change it

<Connector Port = "8080" protocol = "HTTP/1.1"
Connectiontimeout = "20000"
Redirectport = "8443" uriencoding = "UTF-8" usebodyencodingforuri = "true"/>

Enter http: // localhost: 8080/nutch-0.9 in the browser,

Search for "NBA" and the result is

PS: This article about the installation of nutch is well written. Basically, he can set up it step by step.
But there are several notes:
1. Before executing the crawl command, use export nutch_java_home =/cygdrive/C // "Program Files" // Java // JDK-xxxx
2. It may be the use of cygwin. Many people are not familiar with this, but the change to a common directory is actually the same as that in DOS.
3. The configuration of the first nutch-site.xml File above requires an agent value, otherwise jobs may fail

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.