Recently used Nutch, the purpose is to target some of the site to crawl its content, and then do analysis.
Nutch Notes is my use of the Nutch process a series of summaries, write down their own learning and share with you, also hope to get everyone's advice
Okay, cut the crap, get to the end, first article: Quick Start, our goal is to run fast and retrieve the results we want.
The first thing to understand is what Nutch is?
Nutch is an open source search engine based on Lucene, which includes all the things you want and is a complete solution.
One: Install JDK
If you already have the JDK installed and you have set the Java_home, skip this step
Installing JDK
Java code
sudo apt-get install sun-java5-jdk
or download the bin file from the Sun company website to perform the installation
Set the Java_home
Java code
sudo vi ~/.bashrc
On the last side increase
Java code
export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun
export PATH=$PATH:$JAVA_HOME/bin
II: Download the latest version of Nutch nutch0.8.1
Java code
wget http://apache.justdn.org/lucene/nutch/nutch-0.8.1.tar.gz
You can release it.
Java code
tar zxvf nutch-0.8.1.tar.gz