Use nutch (1)

Source: Internet
Author: User

Use nutch

Today, I checked the site log and found several reverse links to search for the nutch. In fact, I only mentioned this word in the article on Java coding specifications. This result will surely disappoint the friends who came here.
I will announce some of my experiments on nutch for your reference. It should be noted that there is no stable release in the nuttk, and it is constantly modified based on feedback. At present, Chinese retrieval is not supported. All in all, the current version is not practical for Chinese users. I think this should also be the reason why chelong, who has been studying and paying attention to nutch, did not take notes.
A few days ago, I talked to chelong on MSN and thought that the best solution to search this site is to use the opensource project web Lucene software package of chelong Based on Lucene. However, nutch seems to be more suitable for establishing vertical search engine websites, at least for the moment, I think.

1. Download and install

For some reason, this website cannot be accessed directly. I used 8.0-09-18 packaging (if you are interested, you can download it from here) and tried it in Red Hat Linux 4.1 + JRE 1.4.1 + Tomcat. Tar zxvf nutch-2003-09-18.tar.gz
CD nutch-2003-09-18 <---- the directory where the command is executed is called $ nutch_home, which is for description only.
Ant
Ant package
Bin/nutch <--- if everything is normal, the words "Usage: nutch command" and so on should appear.

2. Test Run script description

This script is explained by cutting in tutorial. The commands in the script should be run in sequence through the script. The values of $ S1, $ S2, and $ S3 are the same, but the values are different, this depends on the running context. The first time I ran it, I made a mental error. I split it and ran it. The result was wrong. :)

Initial preparation
Mkdir DB Create a directory to store web databases
Mkdir segments
Bin/nutch admin DB-create Create a new empty Database
First capture
Bin/nutch inject DB-dashfile content. RDF. u8-subset 3000 Retrieve the URL from the dmoz list and add it to the database
Bin/nutch generate dB segments Generates a fetchlist based on the database content)
S1 = 'LS-D segments/2 * | tail-1' Put the previously generated capture list in the last directory and name it
Bin/nutch fetch $ S1 Capture pages with robots
Bin/nutch updatedb dB $ S1 Update the database by capturing results
Second capture
Bin/nutch analyze DB 5 Link to the Analysis page after 5 iterations
Bin/nutch generate dB segments-topn 1000 Generate a new crawling list for the Top 1000 URLs
S2 = 'LS-D segments/2 * | tail-1' Capture, update, and iterate the analysis link twice.
Bin/nutch fetch $ S2
Bin/nutch updatedb dB $ S2
Round 3 Capture
Bin/nutch analyze DB 2
Bin/nutch generate dB segments-topn 1000
S3 = 'LS-D segments/2 * | tail-1'
Bin/nutch fetch $ S3
Bin/nutch updatedb dB $ S3
Bin/nutch analyze DB 2 (Prepare for the next time ?)
Index and deduplication
Bin/nutch index $ S1
Bin/nutch index $ S2
Bin/nutch index $ S3
Bin/nutch dedup segments dedup. tmp
Restart Tomcat
Catalina. Sh start Start in the directory where./segments is located

3. Script Modification and download

The dmoz file is too large to download. If it is only an experiment, it does not seem necessary to select a URL from it. The URL of the site.
You can download the script for reference. Run the script in the $ nutch_home directory and run the following command:

Sh all. Sh

4. Web Search

I 've been busy for a long time, but I just grabbed the webpage, Parsed the webpage, and indexed it. The following describes how to use the JSP program provided by nutch to provide the retrieval service. CD $ tomcathome/webapps
MV root rootold
Mkdir Root
CD Root
CP $ nutch_home/nutch-2003-09-18.war./root. War
Jar xvf root. War
CD $ nutch_home
$ Tomcat_home/bin/shutdown. Sh
$ Tomcat_home/bin/Catalina. Sh start

At this time, if there is no accident, you should be able to access it.
My trial URL is a http://cdls.nstl.gov.cn/se/ (where I modified it and didn't put it under the root directory) for reference. At this time, do not retrieve Chinese characters. You can only retrieve English letters, such as Hedong or Lucene.

Try it out in a rush. You are welcome to contact us.

References:
On-site full-text retrieval solution based on Lucene/XML
Http://www.chedong.com/tech/weblucene.html

Lucene Study Notes (2)
Http://hedong.3322.org/archives/000208.html
Running. Sh script
---------------------------------------
#! /Bin/bash

Mkdir DB
Mkdir segments
Bin/nutch admin DB-create
Bin/nutch inject DB-urlfile urls.txt
Bin/nutch generate dB segments
S1 = 'LS-D segments/2 * | tail-1'
Echo $ S1
Bin/nutch fetch $ S1
Bin/nutch updatedb dB $ S1
Bin/nutch analyze DB 5

Bin/nutch generate dB segments-topn 100
S2 = 'LS-D segments/2 * | tail-1'
Echo $ S2
Bin/nutch fetch $ S2
Bin/nutch updatedb dB $ S2
Bin/nutch analyze DB 2

Bin/nutch generate dB segments-topn 100
S3 = 'LS-D segments/2 * | tail-1'
Echo $ S3
Bin/nutch fetch $ S3
Bin/nutch updatedb dB $ S3
Bin/nutch analyze DB 2

Bin/nutch index $ S1
Bin/nutch index $ S2
Bin/nutch index $ S3

Bin/nutch dedup segments dedup. tmp
------------------------------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.