Use nutch (1)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Use nutch

Today, I checked the site log and found several reverse links to search for the nutch. In fact, I only mentioned this word in the article on Java coding specifications. This result will surely disappoint the friends who came here.
I will announce some of my experiments on nutch for your reference. It should be noted that there is no stable release in the nuttk, and it is constantly modified based on feedback. At present, Chinese retrieval is not supported. All in all, the current version is not practical for Chinese users. I think this should also be the reason why chelong, who has been studying and paying attention to nutch, did not take notes.
A few days ago, I talked to chelong on MSN and thought that the best solution to search this site is to use the opensource project web Lucene software package of chelong Based on Lucene. However, nutch seems to be more suitable for establishing vertical search engine websites, at least for the moment, I think.

1. Download and install

For some reason, this website cannot be accessed directly. I used 8.0-09-18 packaging (if you are interested, you can download it from here) and tried it in Red Hat Linux 4.1 + JRE 1.4.1 + Tomcat. Tar zxvf nutch-2003-09-18.tar.gz
CD nutch-2003-09-18 <---- the directory where the command is executed is called $ nutch_home, which is for description only.
Ant
Ant package
Bin/nutch <--- if everything is normal, the words "Usage: nutch command" and so on should appear.

2. Test Run script description

This script is explained by cutting in tutorial. The commands in the script should be run in sequence through the script. The values of $ S1, $ S2, and $ S3 are the same, but the values are different, this depends on the running context. The first time I ran it, I made a mental error. I split it and ran it. The result was wrong. :)

Initial preparation
	Mkdir DB	Create a directory to store web databases
	Mkdir segments
	Bin/nutch admin DB-create	Create a new empty Database
First capture
	Bin/nutch inject DB-dashfile content. RDF. u8-subset 3000	Retrieve the URL from the dmoz list and add it to the database
	Bin/nutch generate dB segments	Generates a fetchlist based on the database content)
	S1 = 'LS-D segments/2 * \| tail-1'	Put the previously generated capture list in the last directory and name it
	Bin/nutch fetch $ S1	Capture pages with robots
	Bin/nutch updatedb dB $ S1	Update the database by capturing results
Second capture
	Bin/nutch analyze DB 5	Link to the Analysis page after 5 iterations
	Bin/nutch generate dB segments-topn 1000	Generate a new crawling list for the Top 1000 URLs
	S2 = 'LS-D segments/2 * \| tail-1'	Capture, update, and iterate the analysis link twice.
	Bin/nutch fetch $ S2
	Bin/nutch updatedb dB $ S2
Round 3 Capture
	Bin/nutch analyze DB 2
	Bin/nutch generate dB segments-topn 1000
	S3 = 'LS-D segments/2 * \| tail-1'
	Bin/nutch fetch $ S3
	Bin/nutch updatedb dB $ S3
	Bin/nutch analyze DB 2	(Prepare for the next time ?)
Index and deduplication
	Bin/nutch index $ S1
	Bin/nutch index $ S2
	Bin/nutch index $ S3
	Bin/nutch dedup segments dedup. tmp
Restart Tomcat
	Catalina. Sh start	Start in the directory where./segments is located

3. Script Modification and download

The dmoz file is too large to download. If it is only an experiment, it does not seem necessary to select a URL from it. The URL of the site.
You can download the script for reference. Run the script in the $ nutch_home directory and run the following command:

Sh all. Sh

4. Web Search

I 've been busy for a long time, but I just grabbed the webpage, Parsed the webpage, and indexed it. The following describes how to use the JSP program provided by nutch to provide the retrieval service. CD $ tomcathome/webapps
MV root rootold
Mkdir Root
CD Root
CP $ nutch_home/nutch-2003-09-18.war./root. War
Jar xvf root. War
CD $ nutch_home
$ Tomcat_home/bin/shutdown. Sh
$ Tomcat_home/bin/Catalina. Sh start

At this time, if there is no accident, you should be able to access it.
My trial URL is a http://cdls.nstl.gov.cn/se/ (where I modified it and didn't put it under the root directory) for reference. At this time, do not retrieve Chinese characters. You can only retrieve English letters, such as Hedong or Lucene.

Try it out in a rush. You are welcome to contact us.

References:
On-site full-text retrieval solution based on Lucene/XML
Http://www.chedong.com/tech/weblucene.html

Lucene Study Notes (2)
Http://hedong.3322.org/archives/000208.html
Running. Sh script
---------------------------------------
#! /Bin/bash

Mkdir DB
Mkdir segments
Bin/nutch admin DB-create
Bin/nutch inject DB-urlfile urls.txt
Bin/nutch generate dB segments
S1 = 'LS-D segments/2 * | tail-1'
Echo $ S1
Bin/nutch fetch $ S1
Bin/nutch updatedb dB $ S1
Bin/nutch analyze DB 5

Bin/nutch generate dB segments-topn 100
S2 = 'LS-D segments/2 * | tail-1'
Echo $ S2
Bin/nutch fetch $ S2
Bin/nutch updatedb dB $ S2
Bin/nutch analyze DB 2

Bin/nutch generate dB segments-topn 100
S3 = 'LS-D segments/2 * | tail-1'
Echo $ S3
Bin/nutch fetch $ S3
Bin/nutch updatedb dB $ S3
Bin/nutch analyze DB 2

Bin/nutch index $ S1
Bin/nutch index $ S2
Bin/nutch index $ S3

Bin/nutch dedup segments dedup. tmp
------------------------------

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use nutch (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use nutch (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support