ZT 'eclipse deployment Nutch-1.0

Source: Internet
Author: User


The solution (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) found in the nutch section on the Wiki, although it is e, but it is not difficult, according to The done, but found that the previous reasons according to the online configuration is not successful, the original nutch-1.0 if do not modify the code, imported into there are two errors, and those articles did not mention at all, really speechless, the following is your configuration successful method.

1. this step is very important to configure cygwin environment variables. If no configuration is available, "failed to get the current user's information" or 'login failed: cannot run program "bash" 'error.

2. Create a project, take a name, select "Create project from existing source", point to the directory of your nutch-1.0.

3. Click Next, switch to "Libraries", select "add class folder...", and select "conf" from the list ". I would like to talk about it here. I have read many posts and this step is not the same.

4. Switch to "order and export" and find "conf" and move it to the top.

5. Switch to "Source" and set the Output Folder to nutch/bin/tmp_build (this step depends on your situation). Click Finish to complete the import.

6.modify nutch-defaul.xml,nutch-site.xml,crawl-urlfilter.txt.

7. Run the following command in the Development and src/plugin/parse-RTF/lib/folders:

8. right-click the project folder and choose build path> Configure build path... in the displayed window, switch to libraries and select Add jars ..., add the downloaded JAR file to the project.

9. At this step, there will be two errors in the general project. In the official 1.0 Release Version of nutch, these two problems are not fixed because the licensing issues. The next step is the most important part.

Modify rtfparsefactory. Java in SRC \ plugin \ parse-RTF \ SRC \ Java \ org \ apache \ nutch \ parse \ RTF

Add import org. Apache. nutch. parse. parseresult;

Set public parse getparse (content ){

Change to public parseresult getparse (content ){

Return new parsestatus (parsestatus. failed,

Parsestatus. failed_exception,

E. tostring (). getemptyparse (CONF );

Change return New parsestatus (parsestatus. failed,

Parsestatus. failed_exception,

E. tostring (). getemptyparseresult (content. geturl (), getconf ());

Return new parseimpl (text,

New parsedata (parsestatus. STATUS_SUCCESS,


Outlinkextractor. getoutlinks (text, this. conf ),

Content. getmetadata (),

Metadata ));

Change return parseresult. createparseresult (content. geturl (),

New parseimpl (text,

New parsedata (parsestatus. STATUS_SUCCESS,


Outlinkextractor. getoutlinks (text, this. conf ),

Content. getmetadata (),

Metadata )));

Modify testrtfparser. Java under SRC \ plugin \ parse-RTF \ SRC \ test \ org \ apache \ nutch \ parse \ RTF

Set parse = new parseutil (CONF). parsebyextensionid ("parse-rtf", content );

Change to parse = new parseutil (CONF). parsebyextensionid ("parse-rtf", content). Get (urlstring );

At this step, there will be no errors in the Eclipse project.

10. choose run-> Run as-> JAVA application on the pop-up select Java application select Crawl-org.apache.nutch.crawl, the first run since no parameters are set, so there is no, next, choose run> run commands... In the left-side Java application, there will be the crawl item. Select it,

Switch to arguments. The content of program arguments is the parameter to be set. Fill in URLs-Dir crawl-depth 3-topn 50 (the URL is the link based on your actual situation) enter-dhadoop under VM arguments. log. dir = logs-dhadoop. log. file = hadoop. log

Select Run. In general, if there is no problem, you can see the page capture process, but I encountered a problem here, is the Java heap size problem. Check logs/hadoop. log File. lang. outofmemoryerror: Java heap space Statement, which is usually the problem. eclipse-> window-> preferences-> JAVA-> installed jres-> edit-> default VM arguments

Set to-xms5m-xmx250m, where XMS is the minimum memory and xmx is the maximum memory.

Below are some errors

Eclipse: cannot create project content in Workspace

The nutch source code must be out of the workspace folder. my first attempt was download the Code with eclipse (SVN) under my workspace. when I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I
Use the source code out of my workspace and it work fine.

Plugin dir not found

Make sure you set your plugin. Folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml


<Name> plugin. Folders </Name>

<Value>/home/.../nutch-0.9/src/plugin </value>

No plugins loaded during unit tests in eclipse

During unit testing, eclipse ignored CONF/nutch-site.xml in favor of SRC/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

Unit Tests work in eclipse but fail when running ant in the command line

Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line-including the ones you haven't modified. check if you defined the plugin. folders property in hadoop-site.xml. in that case, try removing
It from that file and adding it directly to nutch-site.xml

Run ant test again. That shoshould have solved the problem.

If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin \ build. XML, on the test target?


• Open the class itself, rightclick

• Refresh the build dir

Debugging hadoop classes

• Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:

O remove the hadoopxxx. jar from your classpath Libraries

O Checkout the hadoop brunch that is used within nutch

O configure a hadoop project similar to the nutch project within your eclipse

O add the hadoop project as a dependent project of nutch Project

O you can now also set break points within hadoop classes lik inputformat implementations etc.

Failed to get the current user's information

On Windows, if the crawler throws an exception complaining it "failed to get the current user's information" or 'login failed: cannot run program "bash "', it is likely you forgot to set the path to point to cygwin. open a new command line window (all programs
> Accessories> command prompt) and type "bash ". this shoshould start cygwin. if it doesn' t, type "path" to see your path. you shoshould see within the path the cygwin bin directory (e.g ., c: \ cygwin \ bin ). see the steps to adding this to your path at the top of
Article under "For Windows users". After setting the path, you will likely need to restart eclipse so it will use the new path.

Come from http://yang7229693.javaeye.com/blog/436611

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.