Nutch use of the sharpness of the preliminary

Source: Internet
Author: User
Tags command line xsl tomcat

"工欲善其事, its prerequisite. "We have completed the installation of Nutch in Windows after the" fine solution "of the previous article. Next let us through the sharpness of the initial examination, to personally experience the powerful function of Nutch it!

Nutch crawlers crawl Web pages in two ways, one way is an intranet crawling, for a corporate intranet or a small number of Web sites, using the crawl command, and another way is whole-web crawling, for the entire internet, using inject, Generate, Fetch, and updatedb commands at the bottom. This article will use Nutch for the author in Csdn's personal column (HTTP://BLOG.CSDN.NET/ZJZCL) article content to establish a search function for example, to explain the basic use of intranet crawling (assuming that the user's computer system has installed the JDK, Tomcat and resin, and have done the corresponding environment configuration).

1, set the NUTCH environment variable

In the Windows System environment variable setting, add the Nutch_java_home variable and set its value to the JDK installation directory. For example, the author computer jdk installed in D:\j2sdk1.4.2_09, so the Nutch_java_home value set to D:\j2sdk1.4.2_09.

2, Nutch crawl Web page before the preparation work

(1) Create a text file named Url.txt in the Nutch installation directory, which writes the top-level URL of the Web site to crawl, which is the starting page to crawl. The author writes the following in this file:

Http://blog.csdn.net/zjzcl

(2) Edit the Conf/crawl-urlfilter.txt file and modify the MY.DOMAIN.NAME section:

# Accept hosts in my. Domain.name

+^http://blog.csdn.net/zjzcl

3, run crawl command crawl site content

Double-click the Cygwin icon on your computer's desktop and enter it in the Command line window:

cd/cygdrive/i/nutch-0.7.1

Readers who do not understand the meaning of this command please refer to the previous "fine solution" article and then enter:

Bin/nutch Crawl Url.txt-dir crawled-depth 3-threads 4 >& Crawl.log

Wait about 2 minutes after the program runs out. Readers will find that a folder named crawled is created in the nutch-0.7.1 directory, and a log file named Crawl.log is also generated. With this log file, we can analyze any errors that might be encountered. In addition, in the parameters of the above command, dir specifies the directory where the crawl content is stored, and depth represents the number of concurrent threads threads the crawl depth from which to crawl the site's top-level URL.

4. Use Tomcat for search test

(1) Change the root folder name under Tomcat\webapps to ROOT1;

(2) Copy the Nutch-0.7.1.war of the nutch-0.7.1 directory to Tomcat\webapps and rename it to root;

(3) Open root\web-inf\classes under the Nutch-site.xml file, modified into the following form:

<?xml version="1.0"?>
   <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

   <!-- Put site-specific property overrides in this file. -->

   <nutch-conf>
    <property>
     <name>searcher.dir</name>
     <value>I:/nutch-0.7.1/crawled</value>
    </property>
   </nutch-conf>

The "<value>I:/nutch-0.7.1/crawled</value>" section, the reader should be based on their own settings to modify accordingly.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.