There would have been a chance to catch a search engine project, but eventually the price of the two sides to talk about collapse. To this end, I feel deeply regret that I have lost an excellent opportunity to practice. But I do not want to give up on the search engine learning and practice, on the internet to hear a lot of people recommend Nutch. So I am going to learn nutch, to learn Nutch, or first from the installation and use Nutch start. The following is a record of my installation of Nutch in the XP SP2 environment.
Install the environment required by Nutch
jdk1.4.x or jdk1.5
tomcat4.x above
Cygwin
Software Download Address:
j2se5.0 http://java.sun.com/javase/downloads/index.html
Tomcat5.5 http://tomcat.apache.org/download-55.cgi
Cygwin http://www.cygwin.com/
Nutch-0.7.2 http://lucene.apache.org/nutch/
Installation steps: (The specific installation directory can be arbitrary)
1, install JDK, I see online nutch support is jdk1.4, but I installed is, jdk1.5, in order to install tomcat5.5
My installation path: F:\project\java\jdk5
2, installation Cygwin, methods on the Internet a lot, I recommend the installation of local installation version
My installation path: E:\Program files\cygwin\
3, the installation of Tomcat,nutch instructions to support Tomcat 4.3, I installed is tomcat5.5
My installation path: F:\project\Tomcat 5.5
4. Installation Nutch-0.7.1.zip
Unzip the downloaded compressed package to: F:\project\nutch-0.7.2\
Configuration steps:
1, configure the environment in the Cygwin
E:\Program Files\cygwin\etc\profile
Path= "/usr/local/bin:/usr/bin:/bin: $PATH:/cygdrive/f/project/java/jdk5"
Export Nutch_java_home=/cygdrive/f/project/java/jdk5
Export Java_home=/cygdrive/f/project/java/jdk5
2. Configure Nutch
1 Configure the crawl filter to determine the site address to crawl
Open F:\project\nutch-0.7.2\conf\crawl-urlfilter.txt
# Accept hosts in my. Domain.name
+^http://([a-z0-9]*\.) *gucas.ac.cn/
Change the above gucas.ac.cn to the domain name you need to search