The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing the nutch, but the content is not detailed and there are errors. Therefore, he recorded the personal practice process and corrected someArticleErrors: a detailed process shows a simple secondary development process, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.
Directory:
1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform
2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform
3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)
4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)
This article is from"Watcher Ms"Blog,Decline reprinting!
I. Development Environment Introduction (taking my personal account as an example ):
Personal development end: Windows Server 2003 + cygwin + eclipse3.2
2. steps:
<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)
After the download is complete, decompress the package to the specified folder.
Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.
, Click to View Details
<2>. Start configuring nutch:
First, modify the two files in the conf sub-directory under the nutch directory:
Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)
<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.2 </value>
</Property>
</Configuration>
Change the following statement to the desired form in "crawler-urlfilter.txt:
# Accept hosts in my. domain. Name
+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/
Note: Do not have spaces before "+"
Second, perform the capture operation.
(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.
For example:
Http://www.qq.com/
Http://www.sina.com.cn/
Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"
(2) Open cygwin and execute the command line:
Note: The author's nutch is placed in G:/nutch
Command Line: cd g:
Command Line: CD nutch
Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4
NOTE: For the parameters in the command line, refer to them for more information.
In this case, the crawling operation is started and the configuration is successful.
After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.
Run the following command to perform a simple query test:
Bin/nutch org. Apache. nutch. searcher. nutchbean keyword
<3> tomcat configuration
(1). Delete the root under \ webapps under the tomcat installation directory;
(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;
If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.
(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>
<! -- Put site-specific property overrides in this file. -->
<Nutch-conf>
<Property>
<Name> searcher. dir </Name>
<Value> G:/nutch/localweb </value>
</Property>
</Nutch-conf>
The <value> G:/nutch/localweb </value> section must be modified according to your own settings.
Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.
Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,
Click to view the article
The author (the watcher ms) has read a lot of Chinese documents in the process of setting up and developing nutch, but the content is not detailed and has errors. Therefore, he recorded his actual practice process here, correct some article errors and show a simple secondary development process in detail, lowering the threshold for beginners. But it cannot be guaranteed that there are no errors at all. If you find any problems, you may want to correct them.
Directory:
1. detailed introduction to the Second Development of nutch1.2 (1) [Image and text] ------ setting up the cygwin environment on the Windows platform
2. detailed introduction to the Second Development of nutch1.2 (2) [Image and text] ------ setting up nutch1.2 on the Windows platform
3. detailed introduction to the Second Development of nutch1.2 (3) [Image and text] ------ Secondary Development of nutch1.2 (about interface modification)
4. detailed introduction to the Second Development of nutch1.2 (4) [Image and text] ------ Secondary Development of nutch1.2 (about Chinese Word Segmentation)
This article is from"Watcher Ms"Blog,Decline reprinting!
I. Development Environment Introduction (taking my personal account as an example ):
Personal development end: Windows Server 2003 + cygwin + eclipse3.2
2. steps:
<1>. Download nutch1.2 (http://labs.renren.com/apache-mirror//nutch)
After the download is complete, decompress the package to the specified folder.
Before you start to test whether the creation of the program is successful, ensure that JDK is installed on the local machine and that the correct java_home environment variable is set. Note: in environment variable settings, you must set the JDK installation root directory to java_home, and then set classpath, path, that is, % java_home %/bin, % java_home %/lib, do not set it to an absolute directory. Otherwise, an error will occur when you execute the nutch command.
, Click to View Details
<2>. Start configuring nutch:
First, modify the two files in the conf sub-directory under the nutch directory:
Add an HTTP. Agent. Name node under the configuration of the nutch-site.xml (cannot be crawled if not modified)
<Configuration>
<Property>
<Name> HTTP. Agent. Name </Name>
<Value> HD nutch agent </value>
</Property>
<Property>
<Name> HTTP. Agent. version </Name>
<Value> 1.2 </value>
</Property>
</Configuration>
Change the following statement to the desired form in "crawler-urlfilter.txt:
# Accept hosts in my. domain. Name
+ ^ Http: // ([a-z0-9] * \.) * com.cn/
+ ^ Http: // ([a-z0-9] * \.) * cN/
+ ^ Http: // ([a-z0-9] * \.) * COM/
Note: Do not have spaces before "+"
Second, perform the capture operation.
(1).create a new url.txt file in the nutchroot directory, and enter the domain name you want to crawl in each line.
For example:
Http://www.qq.com/
Http://www.sina.com.cn/
Note: enter a domain name in each row as the unit of action. The domain name format follows the preceding example and "/" is added "/"
(2) Open cygwin and execute the command line:
Note: The author's nutch is placed in G:/nutch
Command Line: cd g:
Command Line: CD nutch
Command Line: Bin/Crawl url.txt-Dir localweb-depth 3-threads 4
NOTE: For the parameters in the command line, refer to them for more information.
In this case, the crawling operation is started and the configuration is successful.
After the preceding steps, the backend operations are basically completed. In this case, you can go to the nutch root directory in cygwin.
Run the following command to perform a simple query test:
Bin/nutch org. Apache. nutch. searcher. nutchbean keyword
<3> tomcat configuration
(1). Delete the root under \ webapps under the tomcat installation directory;
(2) copy the nutch-1.2.war of the nutch directory to Tomcat \ webapps and rename it root. War;
If Tomcat is running, the root. War automatically generates the root folder. If the root folder is not running, the root folder is automatically generated after Tomcat is started.
(3) Open the WEB-INF file under Root \ nutch-site.xml \ Classes, modify to the following form:
<? XML version = "1.0"?>
<? XML-stylesheet type = "text/XSL" href = "nutch-conf.xsl"?>
<! -- Put site-specific property overrides in this file. -->
<Nutch-conf>
<Property>
<Name> searcher. dir </Name>
<Value> G:/nutch/localweb </value>
</Property>
</Nutch-conf>
The <value> G:/nutch/localweb </value> section must be modified according to your own settings.
Start tomcat, open the browser, and enter http: // localhost: 8080 in the address bar. Then, you can see the search page of nutch.
Now, the simple configuration of nutch is complete. Next, let's talk about how to import and debug it in eclipse,
Click to view the article