Detailed introduction to the Second Development of nutch1.2 (2) [Image and text] -- setting up nutch1.2 on the Windows platform

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing the nutch, but the content is not detailed and there are errors. Therefore, he recorded the personal practice process and corrected someArticleErrors: a detailed process shows a simple secondary development process, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.

Directory:

1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform

2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform

3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)

4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)

This article is from"Watcher Ms"Blog,Decline reprinting!

I. Development Environment Introduction (taking my personal account as an example ):

Personal development end: Windows Server 2003 + cygwin + eclipse3.2

2. steps:

<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)

After the download is complete, decompress the package to the specified folder.

Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.

, Click to View Details

<2>. Start configuring nutch:

First, modify the two files in the conf sub-directory under the nutch directory:

Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)

<Configuration>

<Property>

<Name> HTTP. Agent. Name </Name>

<Value> HD nutch agent </value>

</Property>

<Property>

<Name> HTTP. Agent. version </Name>

<Value> 1.2 </value>

</Property>

</Configuration>

Change the following statement to the desired form in "crawler-urlfilter.txt:

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/

Note: Do not have spaces before "+"

Second, perform the capture operation.

(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.

For example:

Http://www.qq.com/

Http://www.sina.com.cn/

Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"

(2) Open cygwin and execute the command line:

Note: The author's nutch is placed in G:/nutch

Command Line: cd g:

Command Line: CD nutch

Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4

NOTE: For the parameters in the command line, refer to them for more information.

In this case, the crawling operation is started and the configuration is successful.

After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.

Run the following command to perform a simple query test:

Bin/nutch org. Apache. nutch. searcher. nutchbean keyword

<3> tomcat configuration

(1). Delete the root under \ webapps under the tomcat installation directory;

(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;

If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.

(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>

<! -- Put site-specific property overrides in this file. -->

<Nutch-conf>

<Property>

<Name> searcher. dir </Name>

<Value> G:/nutch/localweb </value>

</Property>

</Nutch-conf>

The <value> G:/nutch/localweb </value> section must be modified according to your own settings.

Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.

Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,

Click to view the article

The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing nutch, but the content is not detailed and has errors. Therefore, he recorded his actual practice process here, correct some article errors and show a simple secondary development process in detail, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.

Directory:

1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform

2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform

3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)

4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)

This article is from"Watcher Ms"Blog,Decline reprinting!

I. Development Environment Introduction (taking my personal account as an example ):

Personal development end: Windows Server 2003 + cygwin + eclipse3.2

2. steps:

<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)

After the download is complete, decompress the package to the specified folder.

Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.

, Click to View Details

<2>. Start configuring nutch:

First, modify the two files in the conf sub-directory under the nutch directory:

Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)

<Configuration>

<Property>

<Name> HTTP. Agent. Name </Name>

<Value> HD nutch agent </value>

</Property>

<Property>

<Name> HTTP. Agent. version </Name>

<Value> 1.2 </value>

</Property>

</Configuration>

Change the following statement to the desired form in "crawler-urlfilter.txt:

# Accept hosts in my. domain. Name

+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/

Note: Do not have spaces before "+"

Second, perform the capture operation.

(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.

For example:

Http://www.qq.com/

Http://www.sina.com.cn/

Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"

(2) Open cygwin and execute the command line:

Note: The author's nutch is placed in G:/nutch

Command Line: cd g:

Command Line: CD nutch

Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4

NOTE: For the parameters in the command line, refer to them for more information.

In this case, the crawling operation is started and the configuration is successful.

After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.

Run the following command to perform a simple query test:

Bin/nutch org. Apache. nutch. searcher. nutchbean keyword

<3> tomcat configuration

(1). Delete the root under \ webapps under the tomcat installation directory;

(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;

If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.

(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:

<? XML version = "1.0"?>

<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>

<! -- Put site-specific property overrides in this file. -->

<Nutch-conf>

<Property>

<Name> searcher. dir </Name>

<Value> G:/nutch/localweb </value>

</Property>

</Nutch-conf>

The <value> G:/nutch/localweb </value> section must be modified according to your own settings.

Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.

Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,

Click to view the article

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed introduction to the Second Development of nutch1.2 (2) [Image and text] -- setting up nutch1.2 on the Windows platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed introduction to the Second Development of nutch1.2 (2) [Image and text] -- setting up nutch1.2 on the Windows platform

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support