Getting started with nutch (1) -- Preparation and Intranet crawling

Source: Internet
Author: User
Getting started with Nutch (1) -- Preparation and Intranet crawling
    Blog type:

  • Java
TomcatBean Internet lucene Java code
  1. /**
  2. * I am also a beginner. If you have any errors, please give me more advice. Thank you!
  3. **/

 

Environment requirements:

1. JDK1.5 or higher.

2. Tomcat5.x or higher.

3. In windows, the Linux Simulation Environment Cygwin is required to provide Shell support.

 

Preparations:

1. Download and install Nutch, you can download from the http://lucene.apache.org/nutch/release/, and then unzip it. I decompress it here to D: Nutch-1.0.

2. In the Cygwin command window, use the Shell command "cd/cygdrive/d/nutch-1.0" to switch the current working directory to the installation directory of the Nutch. Cygdrive is the virtual directory for the system to access the local drive by default, d is the local drive compliance, and nutch-1.0 is the installation directory for the nutch. Enter the "bin/Nutch" command to test whether the command is available. After the command is correctly executed, a series of command parameters of Nutch should be provided, as shown in:


3. You need to give your web spider a name, which is required. Find the conf/nutch-default.xml file in the Nutch directory, search for http. agent. name, and set the value of this property. This attribute value is carried in the HTTP request header when a Web page is captured to indicate the identity of a web spider.

 

Now we are ready to capture webpages.

We can use either of the following methods.

1. Use the one-step crawl command, which is usually used for intranet crawling. The operation is simple, but there are many restrictions.

2. More flexible and convenient Internet crawling mode. Use more underlying commands, such as inject, generate, fetch, and updatedb.

 

First, we will implement the first method step by step.

1. Create a New urls directory under the Nutch installation directory and add a text file to the urls directory. The content of the file can be the url of the website you want to crawl. My example address is www.sina.com.cn.

2. Modify conf/crawl-urlfilter.txt. filter rules to "+" to allow download. The default rules are as follows:

+ ^ Http: // ([a-z0-9] * \.) * MY. DOMAIN. NAME/

If you are only allowed to download the page for iteye.com, you can change it to + ^ http: // ([a-z0-9] * \.) * sina.com.cn/

3. Run the crawl command. A typical command is as follows:

Java code
  1. Bin/nutch crawl urls-dir javaeye-depth 3-topN 100-threads 3

-Dir: directory for storing crawling results

-Depth: Page depth captured

-TopN: each layer captures the first N URLs.

= Threads: Number of download threads

 

After this step is completed, you can search.

4. Search.

Deploy the Nutch-1.0.war under the Nutch directory to the wepapp directory of tomcat, start tomcat. Find the nutch-1.0 file under the unzipped nutch-site.xml directory and modify its content as follows:

Java code
  1. <Configuration>
  2. <Property>
  3. <Name> http. agent. name </name>
  4. <Value> yourAgentName </value>
  5. </Property>
  6. <Property>
  7. <Name> searcher. dir </name>
  8. <Value> D:/nutch-1.0/javaeye </value>
  9. </Property>
  10. </Configuration>

Restart Tomcat. Visit: http: // localhost: 8080/nutch-1.0 in a browser and you will see the following search page:

Enter the keyword you want to search for and start the search experience!

 

 

There are still some problems to solve:

 

1. Search for Chinese characters with garbled characters, but this is not a problem with nutch. Modify the tomcat configuration file atat6 \ conf \ server. xml. Added URIEncoding/useBodyEncodingForURI.

Java code
  1. <Connector port = "8080" protocol = "HTTP/1.1"
  2. ConnectionTimeout = "20000"
  3. RedirectPort = "8443"
  4. URIEncoding = "UTF-8"
  5. UseBodyEncodingForURI = "true"/>

2. web snapshot garbled, modify webapps \ nutch-1.0 \ cached. jsp, convert content = new String (bean. modify getContent (details) to content = new String (bean. getContent (details), "UTF-8 ").

 

3. garbled Chinese characters appear on the search result page. It should be caused by <jsp: include>, but I have not found a solution. If anyone knows how to solve this problem, please leave a message and tell me. Thank you!

 

 

The next article will introduce the Internet crawling mode. Please stay tuned!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.