In the configuration of the nutch environment on windows, Cygwin must be installed ]. Cygwin is a unix simulation environment running on windows.
1. Install jdk
Jdk 1.6 ,:
Http://www.sun.com/download/
Installation path: C: \ Program Files \ Java \ jdk1.6.0 _ 23)
Configure the PATH environment variable; % JAVA_HOME % \ bin; % TOMCAT_HOME % \ bin
Configure JAVA_HOME Environment VariablesC: \ Program Files \ Java \ jdk1.6.0 _ 23
Configure JAVA_BIN Environment VariablesC: \ Program Files \ Java \ jdk1.6.0 _ 23 \ bin
Configure the CLASSPATH environment variable % JAVA_HOME %/lib/dt. jar; % JAVA_HOME %/lib/tools. jar
Ii. install Tomcat
The version is 5.0 (you must use Tomcat5.0. If you download Tomcat6.0, some unknown exceptions will occur during the runtime: for example, "Attribute value details. getValue ("url") is quoted with "which must be escaped when used within the value)
:
Http://tomcat.apache.org/
Set the TOMCAT_HOME environment variable c: \ tomcat
3. Install Cygwin. (simulate the linux environment in windows)
Download Cygwin
3. Prepare a Linux virtual environment in windows, that is, Cygwin"
Here I want to write more about Cygwin, because I feel that most of my time is spent on Cygwin when I configure the Nutch environment. First, it is different from general software and cannot be downloaded directly. Only one of his programs similar to the downloader can access the Cygwin image on the Internet. I cannot figure out why I have to use this method. Maybe it is because of the old update, which is convenient for maintenance.
Step 1 download the http://www.cygwin.com/setup.exe only a few K. But this starts to download
He has three download methods:
- It is said that installation on the Internet is time-consuming.
- Download but not install. (This method is recommended)
- Install it locally.
After the download is complete, run setup.exe
Select install locally and click Next.
You don't need to change it. Just click Next (this is where cygwin will be installed)
Select the cygwin directory you have downloaded and click Next to start installation.
After the installation is complete, you can click the Cygwin shortcut equation on the desktop to start up. After the installation is complete, the interface is as follows:
After the above process, Cygwin is installed and ready for use.
Download and configure Nutch
Nutch:
Http://apache.etoak.com//nutch/
Http://apache.etoak.com//nutch/apache-nutch-1.2-bin.zip (the configuration here is 1.2, the latest is 1.3)
Configuration of Nutch:
- Extract Nutch to d: \ nutch \ nutch-1.2
- Create a folder named urls under the d: \ nutch \ nutch-1.2 directory and create the file urls \ nutch.txtunder it, and write the site address to be extracted in the nutch.txt, such as: Enter the address of the website to be crawled, for example, http://www.my400800.cn/(pay attention to the last/must have)
- Open the conf \ crawl-urlfilter.txt file and
# Accept hosts in MY. DOMAIN. NAME
+ ^ Http: // ([a-z0-9] * \.) * MY. DOMAIN. NAME/Change
# Accept hosts in MY. DOMAIN. NAME
+ ^ Http: // ([a-z0-9] * \.) * my400800.cn/(here also have/Yo)
- Open the nutch/conf/nutch-site.xml file and modify <configuration> </configuration>:
<Configuration>
<Property>
<Name> http. agent. name </name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> http. agent. version </name>
<Value> 1.2 </value>
</Property>
</Configuration>
- Start crawling (run the following command from the Cygwin dos window started above ])
/Cygdrive/d/nutch/nutch-1.2/bin/nutch crawl-dir localdownweb-depth 1-threads 1 topN 10 urls> & amp;/crygdrive/d/nutch/nutch-1.2/logs/log1.log
Crawl: notifies nutch. jar to execute the main method of crawl.
Urls: directory of the url txt file that stores Crawlers
-Dir sina: location where the file is saved after crawling
-Depth 2: The number of crawlers or depth, but I still think the frequency is more appropriate. We recommend that you change it to 1 during testing.
-Threads: Specifies the concurrent process. This is set to 4.
-TopN: the maximum number of pages saved by a website.
The following error occurs:In the above directory to establish the [urls] Directory, in the directory to establish the program nutch.txt], the content is: to capture the url address, the format is: http://www.my400800.cn/(the following backslash do not forget it)
The following window is displayed when you run the preceding command again. The startup is successful.