Use nutch
Today, I checked the site log and found several reverse links to search for the nutch. In fact, I only mentioned this word in the article on Java coding specifications. This result will surely disappoint the friends who came here.
I will announce some of my experiments on nutch for your reference. It should be noted that there is no stable release in the nuttk, and it is constantly modified based on feedback. At present, Chinese retrieval is not supported. All in all, the current version is not practical for Chinese users. I think this should also be the reason why chelong, who has been studying and paying attention to nutch, did not take notes.
A few days ago, I talked to chelong on MSN and thought that the best solution to search this site is to use the opensource project web Lucene software package of chelong Based on Lucene. However, nutch seems to be more suitable for establishing vertical search engine websites, at least for the moment, I think.
1. Download and install
For some reason, this website cannot be accessed directly. I used 8.0-09-18 packaging (if you are interested, you can download it from here) and tried it in Red Hat Linux 4.1 + JRE 1.4.1 + Tomcat. Tar zxvf nutch-2003-09-18.tar.gz
CD nutch-2003-09-18 <---- the directory where the command is executed is called $ nutch_home, which is for description only.
Ant
Ant package
Bin/nutch <--- if everything is normal, the words "Usage: nutch command" and so on should appear.
2. Test Run script description
This script is explained by cutting in tutorial. The commands in the script should be run in sequence through the script. The values of $ S1, $ S2, and $ S3 are the same, but the values are different, this depends on the running context. The first time I ran it, I made a mental error. I split it and ran it. The result was wrong. :)
Initial preparation |
|
|
|
Mkdir DB |
Create a directory to store web databases |
|
Mkdir segments |
|
|
Bin/nutch admin DB-create |
Create a new empty Database |
First capture |
|
|
|
Bin/nutch inject DB-dashfile content. RDF. u8-subset 3000 |
Retrieve the URL from the dmoz list and add it to the database |
|
Bin/nutch generate dB segments |
Generates a fetchlist based on the database content) |
|
S1 = 'LS-D segments/2 * | tail-1' |
Put the previously generated capture list in the last directory and name it |
|
Bin/nutch fetch $ S1 |
Capture pages with robots |
|
Bin/nutch updatedb dB $ S1 |
Update the database by capturing results |
Second capture |
|
|
|
Bin/nutch analyze DB 5 |
Link to the Analysis page after 5 iterations |
|
Bin/nutch generate dB segments-topn 1000 |
Generate a new crawling list for the Top 1000 URLs |
|
S2 = 'LS-D segments/2 * | tail-1' |
Capture, update, and iterate the analysis link twice. |
|
Bin/nutch fetch $ S2 |
|
|
Bin/nutch updatedb dB $ S2 |
|
Round 3 Capture |
|
|
|
Bin/nutch analyze DB 2 |
|
|
Bin/nutch generate dB segments-topn 1000 |
|
|
S3 = 'LS-D segments/2 * | tail-1' |
|
|
Bin/nutch fetch $ S3 |
|
|
Bin/nutch updatedb dB $ S3 |
|
|
Bin/nutch analyze DB 2 |
(Prepare for the next time ?) |
Index and deduplication |
|
|
|
Bin/nutch index $ S1 |
|
|
Bin/nutch index $ S2 |
|
|
Bin/nutch index $ S3 |
|
|
Bin/nutch dedup segments dedup. tmp |
|
Restart Tomcat |
|
|
|
Catalina. Sh start |
Start in the directory where./segments is located |
3. Script Modification and download
The dmoz file is too large to download. If it is only an experiment, it does not seem necessary to select a URL from it. The URL of the site.
You can download the script for reference. Run the script in the $ nutch_home directory and run the following command:
Sh all. Sh
4. Web Search
I 've been busy for a long time, but I just grabbed the webpage, Parsed the webpage, and indexed it. The following describes how to use the JSP program provided by nutch to provide the retrieval service. CD $ tomcathome/webapps
MV root rootold
Mkdir Root
CD Root
CP $ nutch_home/nutch-2003-09-18.war./root. War
Jar xvf root. War
CD $ nutch_home
$ Tomcat_home/bin/shutdown. Sh
$ Tomcat_home/bin/Catalina. Sh start
At this time, if there is no accident, you should be able to access it.
My trial URL is a http://cdls.nstl.gov.cn/se/ (where I modified it and didn't put it under the root directory) for reference. At this time, do not retrieve Chinese characters. You can only retrieve English letters, such as Hedong or Lucene.
Try it out in a rush. You are welcome to contact us.
References:
On-site full-text retrieval solution based on Lucene/XML
Http://www.chedong.com/tech/weblucene.html
Lucene Study Notes (2)
Http://hedong.3322.org/archives/000208.html
Running. Sh script
---------------------------------------
#! /Bin/bash
Mkdir DB
Mkdir segments
Bin/nutch admin DB-create
Bin/nutch inject DB-urlfile urls.txt
Bin/nutch generate dB segments
S1 = 'LS-D segments/2 * | tail-1'
Echo $ S1
Bin/nutch fetch $ S1
Bin/nutch updatedb dB $ S1
Bin/nutch analyze DB 5
Bin/nutch generate dB segments-topn 100
S2 = 'LS-D segments/2 * | tail-1'
Echo $ S2
Bin/nutch fetch $ S2
Bin/nutch updatedb dB $ S2
Bin/nutch analyze DB 2
Bin/nutch generate dB segments-topn 100
S3 = 'LS-D segments/2 * | tail-1'
Echo $ S3
Bin/nutch fetch $ S3
Bin/nutch updatedb dB $ S3
Bin/nutch analyze DB 2
Bin/nutch index $ S1
Bin/nutch index $ S2
Bin/nutch index $ S3
Bin/nutch dedup segments dedup. tmp
------------------------------