Install nutch1.2 in myeclipse in Windows (no error is found at last)
1. Download and install cygwin. The installation and environment configuration are not detailed. Add % cygwin_home % \ bin to path.
2. Import to eclipse
① Add File> New> Project> JAVA project in eclipse.
Project name casually, select "Create project from existing source", in the Browse select the extract path of the nutch, such as D: \ nutch-1.2
② Select conf in "add class folder" File Folder.
③ Define a "Default ouput folder" with any name. Note that you cannot select the bin folder, because if you select the bin folder as the default Output Folder, the folder will be cleared during compilation, and other files under the bin will be deleted, leading to other problems.
④ Finish.
3. modify the configuration file of the nutch. Here we take the capture of www.163.com as an example.
① Modify the nutch-1.2 configuration under D: \ nutch-site.xml \ conf
-
- <? XML version = "1.0"?>
- <? XML-stylesheet href = "configuration. XSL"?>
-
- <! -- Put site-specific property overrides in this file. -->
- <Configuration>
-
- <Property>
-
- <Name> HTTP. Agent. Name </Name>
- <Value> nutch-1.2 </value>
-
- <Description> HTTP 'user-agent' </description>
- </Property>
-
- <Property>
-
- <Name> searcher. dir </Name>
- <Value> D: \ nutch-1.2 \ crawl </value>
-
- <Description> path to root of Crawl. </description>
- </Property>
-
- </Configuration>
CopyCode
② Modify the crawler l-urlfilter.txt under D: \ nutch-1.2 \ conf
- # Accept hosts in my. domain. Name
-
- + ^ Http: // ([a-z0-9] * \.) * 163.info/
- # Skip everything else
Copy code
③ Modify the nutch-1.2 under D: \ nutch-default.xml \ conf
- <Property>
-
- <Name> plugin. Folders </Name>
- <Value>./src/plugin </value>
-
- <Description> directories where nutch plugins are located. Each
- element may be a relative or absolute path. if absolute, it is u SED
- As is. If relative, It is searched for on the classpath. </description>
-
- </Property>
Copy code
(4) In D: \ nutch-1.2 \, create a file named urlsfolder, and create url.txt text in the file folder, write
Http://www.163.com/
Copy code
4. Run nutch in eclipse
① Run-Open Run Dialog
② Name casually written
③ Fill in the main class
- Org.Apache. Nutch. Crawl. Crawl
Copy code
④ Fill in arguments
- URLs-Dir crawl-depth 3-topn 50
Copy code
⑤ Fill in VM arguments
- -Dhadoop. log. dir = logs-dhadoop. log. File = hadoop. Log
Copy code
OK. Run the job. Check if the job is crawling.