Install nutch1.2 in myeclipse in Windows (no error is found at last)

Source: Internet
Author: User
Install nutch1.2 in myeclipse in Windows (no error is found at last)

 

1. Download and install cygwin. The installation and environment configuration are not detailed. Add % cygwin_home % \ bin to path.


2. Import to eclipse


① Add File> New> Project> JAVA project in eclipse.

Project name casually, select "Create project from existing source", in the Browse select the extract path of the nutch, such as D: \ nutch-1.2


② Select conf in "add class folder" File Folder.


③ Define a "Default ouput folder" with any name. Note that you cannot select the bin folder, because if you select the bin folder as the default Output Folder, the folder will be cleared during compilation, and other files under the bin will be deleted, leading to other problems.


④ Finish.


3. modify the configuration file of the nutch. Here we take the capture of www.163.com as an example.


① Modify the nutch-1.2 configuration under D: \ nutch-site.xml \ conf

  1. <? XML version = "1.0"?>
  2. <? XML-stylesheet href = "configuration. XSL"?>
  3. <! -- Put site-specific property overrides in this file. -->
  4. <Configuration>

  5. <Property>
  6. <Name> HTTP. Agent. Name </Name>
  7. <Value> nutch-1.2 </value>
  8. <Description> HTTP 'user-agent' </description>
  9. </Property>

  10. <Property>
  11. <Name> searcher. dir </Name>
  12. <Value> D: \ nutch-1.2 \ crawl </value>
  13. <Description> path to root of Crawl. </description>
  14. </Property>

  15. </Configuration>

CopyCode

② Modify the crawler l-urlfilter.txt under D: \ nutch-1.2 \ conf

  1. # Accept hosts in my. domain. Name
  2. + ^ Http: // ([a-z0-9] * \.) * 163.info/
  3. # Skip everything else

Copy code

③ Modify the nutch-1.2 under D: \ nutch-default.xml \ conf

  1. <Property>
  2. <Name> plugin. Folders </Name>
  3. <Value>./src/plugin </value>
  4. <Description> directories where nutch plugins are located. Each
  5. element may be a relative or absolute path. if absolute, it is u SED
  6. As is. If relative, It is searched for on the classpath. </description>
  7. </Property>

Copy code

(4) In D: \ nutch-1.2 \, create a file named urlsfolder, and create url.txt text in the file folder, write

    1. Http://www.163.com/

Copy code

4. Run nutch in eclipse


① Run-Open Run Dialog


② Name casually written


③ Fill in the main class

    1. Org.Apache. Nutch. Crawl. Crawl

Copy code

④ Fill in arguments

    1. URLs-Dir crawl-depth 3-topn 50

Copy code

⑤ Fill in VM arguments

    1. -Dhadoop. log. dir = logs-dhadoop. log. File = hadoop. Log

Copy code

OK. Run the job. Check if the job is crawling.

 

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.