Detailed introduction to the Second Development of nutch1.2 (2) [Image and text] -- setting up nutch1.2 on the Windows platform

Source: Internet
Author: User

The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing the nutch, but the content is not detailed and there are errors. Therefore, he recorded the personal practice process and corrected someArticleErrors: a detailed process shows a simple secondary development process, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.

Directory:

1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform

2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform

3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)

4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)

This article is from"Watcher Ms"Blog,Decline reprinting!

I. Development Environment Introduction (taking my personal account as an example ):

Personal development end: Windows Server 2003 + cygwin + eclipse3.2

2. steps:

<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)

After the download is complete, decompress the package to the specified folder.

Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.

, Click to View Details

<2>. Start configuring nutch:

First, modify the two files in the conf sub-directory under the nutch directory:

Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)

<Configuration>

<Property>

<Name> HTTP. Agent. Name </Name>

<Value> HD nutch agent </value>

</Property>

<Property>

<Name> HTTP. Agent. version </Name>

<Value> 1.2 </value>

</Property>

</Configuration>

Change the following statement to the desired form in "crawler-urlfilter.txt:

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/

Note: Do not have spaces before "+"

Second, perform the capture operation.

(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.

For example:

Http://www.qq.com/

Http://www.sina.com.cn/

Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"

(2) Open cygwin and execute the command line:

Note: The author's nutch is placed in G:/nutch

Command Line: cd g:

Command Line: CD nutch

 

Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4

NOTE: For the parameters in the command line, refer to them for more information.

In this case, the crawling operation is started and the configuration is successful.

After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.

Run the following command to perform a simple query test:

Bin/nutch org. Apache. nutch. searcher. nutchbean keyword

<3> tomcat configuration

(1). Delete the root under \ webapps under the tomcat installation directory;

(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;

If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.

(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>

<! -- Put site-specific property overrides in this file. -->

<Nutch-conf>

<Property>

<Name> searcher. dir </Name>

<Value> G:/nutch/localweb </value>

</Property>

</Nutch-conf>

The <value> G:/nutch/localweb </value> section must be modified according to your own settings.

Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.

Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,

Click to view the article

The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing nutch, but the content is not detailed and has errors. Therefore, he recorded his actual practice process here, correct some article errors and show a simple secondary development process in detail, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.

Directory:

1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform

2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform

3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)

4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)

This article is from"Watcher Ms"Blog,Decline reprinting!

I. Development Environment Introduction (taking my personal account as an example ):

Personal development end: Windows Server 2003 + cygwin + eclipse3.2

2. steps:

<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)

After the download is complete, decompress the package to the specified folder.

Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.

, Click to View Details

<2>. Start configuring nutch:

First, modify the two files in the conf sub-directory under the nutch directory:

Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)

<Configuration>

<Property>

<Name> HTTP. Agent. Name </Name>

<Value> HD nutch agent </value>

</Property>

<Property>

<Name> HTTP. Agent. version </Name>

<Value> 1.2 </value>

</Property>

</Configuration>

Change the following statement to the desired form in "crawler-urlfilter.txt:

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/

Note: Do not have spaces before "+"

Second, perform the capture operation.

(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.

For example:

Http://www.qq.com/

Http://www.sina.com.cn/

Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"

(2) Open cygwin and execute the command line:

Note: The author's nutch is placed in G:/nutch

Command Line: cd g:

Command Line: CD nutch

 

Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4

NOTE: For the parameters in the command line, refer to them for more information.

In this case, the crawling operation is started and the configuration is successful.

After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.

Run the following command to perform a simple query test:

Bin/nutch org. Apache. nutch. searcher. nutchbean keyword

<3> tomcat configuration

(1). Delete the root under \ webapps under the tomcat installation directory;

(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;

If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.

(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>

<! -- Put site-specific property overrides in this file. -->

<Nutch-conf>

<Property>

<Name> searcher. dir </Name>

<Value> G:/nutch/localweb </value>

</Property>

</Nutch-conf>

The <value> G:/nutch/localweb </value> section must be modified according to your own settings.

Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.

Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,

Click to view the article

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.