Configuration of the nutch environment on windows (with Cygwin)

Source: Internet
Author: User

In the configuration of the nutch environment on windows, Cygwin must be installed ]. Cygwin is a unix simulation environment running on windows.

1. Install jdk

Jdk 1.6 ,:

Http://www.sun.com/download/

Installation path: C: \ Program Files \ Java \ jdk1.6.0 _ 23)

Configure the PATH environment variable; % JAVA_HOME % \ bin; % TOMCAT_HOME % \ bin

Configure JAVA_HOME Environment VariablesC: \ Program Files \ Java \ jdk1.6.0 _ 23

Configure JAVA_BIN Environment VariablesC: \ Program Files \ Java \ jdk1.6.0 _ 23 \ bin

Configure the CLASSPATH environment variable % JAVA_HOME %/lib/dt. jar; % JAVA_HOME %/lib/tools. jar

 

Ii. install Tomcat

The version is 5.0 (you must use Tomcat5.0. If you download Tomcat6.0, some unknown exceptions will occur during the runtime: for example, "Attribute value details. getValue ("url") is quoted with "which must be escaped when used within the value)
:

Http://tomcat.apache.org/

Set the TOMCAT_HOME environment variable c: \ tomcat

3. Install Cygwin. (simulate the linux environment in windows)

Download Cygwin

3. Prepare a Linux virtual environment in windows, that is, Cygwin"

Here I want to write more about Cygwin, because I feel that most of my time is spent on Cygwin when I configure the Nutch environment. First, it is different from general software and cannot be downloaded directly. Only one of his programs similar to the downloader can access the Cygwin image on the Internet. I cannot figure out why I have to use this method. Maybe it is because of the old update, which is convenient for maintenance.

Step 1 download the http://www.cygwin.com/setup.exe only a few K. But this starts to download

He has three download methods:

  1. It is said that installation on the Internet is time-consuming.
  2. Download but not install. (This method is recommended)
  3. Install it locally.

After the download is complete, run setup.exe

Select install locally and click Next.

You don't need to change it. Just click Next (this is where cygwin will be installed)

 

Select the cygwin directory you have downloaded and click Next to start installation.

After the installation is complete, you can click the Cygwin shortcut equation on the desktop to start up. After the installation is complete, the interface is as follows:

 

After the above process, Cygwin is installed and ready for use.

Download and configure Nutch

Nutch:

Http://apache.etoak.com//nutch/

Http://apache.etoak.com//nutch/apache-nutch-1.2-bin.zip (the configuration here is 1.2, the latest is 1.3)

Configuration of Nutch:

  1. Extract Nutch to d: \ nutch \ nutch-1.2
  2. Create a folder named urls under the d: \ nutch \ nutch-1.2 directory and create the file urls \ nutch.txtunder it, and write the site address to be extracted in the nutch.txt, such as: Enter the address of the website to be crawled, for example, http://www.my400800.cn/(pay attention to the last/must have)
  3. Open the conf \ crawl-urlfilter.txt file and
    # Accept hosts in MY. DOMAIN. NAME
    + ^ Http: // ([a-z0-9] * \.) * MY. DOMAIN. NAME/

    Change
    # Accept hosts in MY. DOMAIN. NAME
    + ^ Http: // ([a-z0-9] * \.) * my400800.cn/(here also have/Yo)

  4. Open the nutch/conf/nutch-site.xml file and modify <configuration> </configuration>:
    <Configuration>
    <Property>
    <Name> http. agent. name </name>
    <Value> HD nutch agent </value>
    </Property>
    <Property>
    <Name> http. agent. version </name>
    <Value> 1.2 </value>
    </Property>
    </Configuration>
  5. Start crawling (run the following command from the Cygwin dos window started above ])
    /Cygdrive/d/nutch/nutch-1.2/bin/nutch crawl-dir localdownweb-depth 1-threads 1 topN 10 urls> & amp;/crygdrive/d/nutch/nutch-1.2/logs/log1.log
    Crawl: notifies nutch. jar to execute the main method of crawl.
    Urls: directory of the url txt file that stores Crawlers
    -Dir sina: location where the file is saved after crawling
    -Depth 2: The number of crawlers or depth, but I still think the frequency is more appropriate. We recommend that you change it to 1 during testing.
    -Threads: Specifies the concurrent process. This is set to 4.
    -TopN: the maximum number of pages saved by a website.
    The following error occurs:

    In the above directory to establish the [urls] Directory, in the directory to establish the program nutch.txt], the content is: to capture the url address, the format is: http://www.my400800.cn/(the following backslash do not forget it)

The following window is displayed when you run the preceding command again. The startup is successful.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.