Running Nutch in Eclipse_

Running Nutch in Eclipse__nutch

Last Update:2018-08-21 Source: Internet

Author: User

Tags generator svn switch case

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Running Nutch in Eclipse

Here are instructions for setting up a development environment for Nutch the Eclipse IDE. It is intended to provide a comprehensive beginning resource for the configuration, building, crawling and debugging of Nu Tch trunk in the above context.

Contents Running Nutch in Eclipse before your start prerequisites Steps Checkout and build Nutch Load project in Eclipse Cr Eate Eclipse launcher Debug Nutch in Eclipse Remote debugging in Eclipse (not verified) debugging and timeouts Ting Eclipse:cannot Create project content in Workspace Plugin directory not found No plugins loaded the unit during Eclipse Debugging Hadoop Classes non-ported Plugins to 2.x

before you start

Setting up Nutch to run into Eclipse can is tricky, and most of the time for you are more than faster if you edit Nutch in eclipse But run the scripts from the command line. However, it ' s very useful to being able to debug Nutch in Eclipse and are also extremely when useful and applying testing Hes as it enables you to the them working in a larger context. This is being said, you'll still benefit greatly by looking at the Hadoop.log output. This tutorial covers a fully internal eclipse/nutch set up, using only Eclipse tools and associated plugins.

Prerequisites

You are need to have Apache Ant installed and configured on your system.

Grab the newest version of Eclipse available here.

All of the following should is available from the Eclipse marketplace. However if not, can download them throughout Eclipse as follows.

Once ' ve set up Eclipse, download Subclipse AS/here. N.B. If you experience a error with the 1.8.x release, try 1.6.x. This is tends to solve compatibility problems.

Grab Ivyde plugin for Eclipse as here.

Grab m2e plugin for Eclipse

Steps

Checkout and build Nutch Get the latest source code from SVN using Terminal. For Nutch 1.x (Ie.trunk) Run this:

SVN Co https://svn.apache.org/repos/asf/nutch/trunk
 CD trunk

For Nutch 2.x run this:

SVN Co https://svn.apache.org/repos/asf/nutch/branches/2.x
 CD 2.x

For Nutch 1.x (ie. trunk), skip ahead to step #5.

At it you should have decided which data store for you want to use. Theapache Gora documentation to get more information about it. Here are few of the available options of storage classes:

  Org.apache.gora.hbase.store.HBaseStore
  Org.apache.gora.cassandra.store.CassandraStore
  Org.apache.gora.accumulo.store.AccumuloStore
  Org.apache.gora.avro.store.AvroStore
  Org.apache.gora.avro.store.DataFileAvroStore

In "Conf/nutch-site.xml" Add the storage class name. eg. Say you pick HBase as Datastore, add this to "Conf/nutch-site.xml":

<property>
  <name>storage.data.store.class</name>
  <value> Org.apache.gora.hbase.store.hbasestore</value>
  <description>default class for storing data</ Description>
 </property>

In Ivy/ivy.xml:uncomment, the dependency for the data store, that's you selected. eg. If you would use the HBase, uncomment this line:

  <dependency org= "Org.apache.gora" name= "gora-hbase" rev= "0.3" conf= "*->default"/>

Set the default datastore in Conf/gora.properties. eg. For HBase as Datastore, put this in conf/gora.properties:

Gora.datastore.default=org.apache.gora.hbase.store.hbasestore

Add "Http.agent.name" and "http.robots.agents" with appropiate values in "Conf/nutch-site.xml". Conf/nutch-default.xml for the description of these properties. Also, add "plugin.folders" and set it to {Path_to_nutch_checkout}/build/plugins. eg. If Nutch is present at '/home/tejas/desktop/2.x ', set the property to:

<property>
   <name>plugin.folders</name>
   <value>/home/tejas/desktop/2.x/build/ Plugins</value>
 </property>

Run This command:

  Ant Eclipse

Load project in Eclipse

In Eclipse, click ' File '-> ' Import ... ' Select "Existing Projects into Workspace"

In the next window, set the root directory to the location where you took the checkout of Nutch 2.x (or trunk). Click "Finish". You'll now have a new project named 2.x (or trunk) being added in the workspace. Wait for a moment until Eclipse refreshes the SVN cache and builds its workspace. You can be the status at the bottom right corner of Eclipse.

In Package Explorer, right click on the project ' 2.x ' (or trunk), select ' Build path '-> ' Configure build path '

In the ' Order and Export ' tab, scroll down and select ' 2.x/conf ' (or trunk/conf). Click on the "Top" button. Sadly, Eclipse would again build the workspace but this time it won ' t take much.

Create Eclipse Launcher

Now, lets get geared to run something. Lets start off with the inject operation. Right click on the project in "Package Explorer"-> select "Run as"-> Select "Run Configurations". Create a new configuration. Name it as "inject". For 1.x IE Trunk:set the main class As:org.apache.nutch.crawl.Injector

For 2.x:set the main class As:org.apache.nutch.crawl.InjectorJob

In the Arguments tab, for program arguments, provide the path of the input directory which has seed URLs. Set VM Arguments to "-dhadoop.log.dir=logs-dhadoop.log.file=hadoop.log"

Click ' Apply ' and then click ' Run '. If everything is set perfectly, then you should the inject on console.

If you are want to find out the Java class corresponding to any command, just peek inside "src/bin/nutch" script and at the Bottom you would find a switch case with the a case corresponding to each command. Here are the important classes corresponding to the crawl cycle:

tr>

Operation	Class in Nutch 1.x (i.e.trunk)	Class in Nutch 2.x
Inject	Org.apache.nutch.crawl.Injector	Org.apache.nutch.crawl.InjectorJob
Generate	P>org.apache.nutch.crawl.generator	Org.apache.nutch.crawl.GeneratorJob
Fetch	Org.apache.nutch.fetcher.Fetcher	Org.apache.nutch.fetcher.FetcherJob
Parse	org.apache.nutch.parse.ParseSegment	Org.apache.nutch.par Se. Parserjob
UpdateDB	Org.apache.nutch.crawl.CrawlDb	Org.apache.nutch.crawl.DbUpdaterJob

Debug Nutch in Eclipse Set breakpoints and debug a crawl It can be tricky to find out where to set the breakpoint, because the Hadoop jobs. Here are a few good places to set breakpoints in the 1.x codebase:

Fetcher [line:1115]-run
fetcher [line:530]-fetch
fetcher$fetcherthread [line:560]-run ()
Generator [Li NE:443]-Generate
generator$selector [line:108]-map
outlinkextractor [line:71 &]-getoutlinks

Here are a few good places to set breakpoints in the 2.x codebase:

Fetcherreducer$fetcherthread run (): Line 487:log.info ("fetching" + Fit.url ...
                                   : line 519:final protocolstatus Status = Output.getstatus ();

Generatormapper:map (): Line
generatorreducer:reduce (): Line
outlinkextractor:getoutlinks (): Line 84

Remote Debugging in Eclipse (not verified)

Create a new Debug Configuration as Remote Java application and remember the port (here:37649)

Launch Nutch from command-line but add options to use the Java Debugger JDWP Agent Library, e.g. from bash:

% Export nutch_opts= "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:37649"
% $NUTCH _ Home/bin/nutch Parsechecker http://myurl.com/

The application is suspended just after launch now go to Eclipse, set appropriate break-points, and run the PREVIOUSL Y created Debug Configuration

Instead of creating a extra launch configuration for every tool your want to debug, a single configuration was enough to Debug any tool (Parsechecker, Indexchecher, URL filter, etc.) and that even remotely (Crawler/tool running on server, ECLI PSE debugger locally).

Debugging and Timeouts

Debugging takes time, esp. when inspecting variables, stack traces, etc. Usually too much of the time, so that some timeout would apply and stop the application. Set timeouts in the nutch-site.xml used for debugging to a rather high value (or-1 for unlimited), e.g., when debugging T He parser:

<property>
  <name>parser.timeout</name>
  <value>-1</value>
</ Property>

Troubleshooting

Eclipse:cannot Create project content in Workspace

The Nutch source code must is out of the workspace folder. Alternatively can download the code with Eclipse (SVN) under your workspace rather than try to create the project Usin G Existing code, Eclipse sometimes doesn ' t let your do it from source code into the workspace.

Plugin directory not found

Make sure your set your Plugin.folders property correct instead the using a relative path you can use a absolute one as wel L in Nutch-default.xml or even better in nutch-site.xml. Ideally all efforts should is made to keep nutch-default.xml. Completely intact.

<property>
  <name>plugin.folders</name>
  <value>/home/....../trunk/src/plugin </value>

No plugins loaded during unit tests in Eclipse

During unit testing, Eclipse ignored conf/nutch-site.xml at favor of Src/test/nutch-site.xml, so you might need to add the Plugin directory configuration to that file as.

Debugging Hadoop Classes

Sometimes (fairly often) it makes sense to also have the Hadoop classes available during. This is should really second nature as Nutch heavily relies upon the underlying HADOOP infrastructure. Therefore can check out the Hadoop sources into your Eclipse IDE and combine to debug this way. You can:checkout the Hadoop version this should be used within Nutch trunk

Configure a Hadoop project similar to the Nutch project within your Eclipse IDE. Seethis. ADD the Hadoop project as a dependent project of Nutch project you can now also set break points within Hadoop classes lik e InputFormat implementations etc.

non-ported Plugins to 2.x Few plugins were not ported to Nutch 2.x series yet. If you are following the above tutorial for building Nutch 2.x, please check nutch2plugins for more information

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More