Use ant to compile nutch2.x & Configure nutch2.x in Ubuntu environment

Source: Internet
Author: User
Tags solr

Using ant to compile nutch2.x

See: 1. http://blog.javachen.com/2014/05/20/nutch-intro/

2. wiki.apache.org/nutch/nutch2tutorial

Prerequisites: Configure Ant (http://www.cnblogs.com/xxx0624/p/4172277.html)

1. Download Nutch(ex: My is apache-nutch-2.2.1-src.tar.gz)

Unzip, rename the Nutch folder (named Nutch), and then move the folder to the/home folder

2. Compiling Nutch

CD Nutchant

2.1 You may encounter this error:

Trying to override old definition of task Javac  [taskdef] Could not load definitions from resource Org/sonar/ant/antli B.xml. It could not being found.ivy-probe-antlib:ivy-download:  [taskdef] could not load definitions from resource Org/son Ar/ant/antlib.xml. It could not being found.

Cause: The appropriate jar file is missing

Workaround:

(1) Download the Sonar-ant-task-2.1.jar and put it in the Nutch folder directory

(2) Modify the Build.xml file to introduce this new jar

<!--Define The Sonar task if this hasn ' t been do in a common script--><taskdef uri= "Antlib:org.sonar.ant" res Ource= "Org/sonar/ant/antlib.xml" >    <classpath path= "${ant.library.dir}"/> <classpath    path= "${ Mysql.library.dir} "/>    <classpath><fileset dir=". "Includes=" Sonar*.jar "/></classpath> </taskdef>

Find the appropriate place and add more content.

2.2 Compile time too long

Nutch is built using ivy, so it takes a long time to compile. If the time is too long, this solution can be used.

Modify the file:ivy/ivysettings.xml

http://mirrors.ibiblio.org/maven2/

Replace

http://repo1.maven.org/maven2/

2.3 After compiling the directory:

. ├──build├──build.xml├──build.xml~├──changes.txt├──conf├──default.properties├──docs├──ivy├──lib├── License.txt├──notice.txt├──readme.txt├──runtime├──sonar-ant-task-2.1.jar└──src7 directories, 8 files

3. Modify the Nutch configuration file

The nutch2.x version store uses Gora to access Cassandra, HBase, Accumulo, Avro, and so on, and needs to develop Gora attributes in the file.

3.1 Modificationsconf/nutch-site.xml

<property>  <name>storage.data.store.class</name>  <value> Org.apache.gora.hbase.store.hbasestore</value>  <description>default class for storing data</ Description></property>

3.2 Modificationsivy/ivy.xml

<!--uncomment this to use HBase as Gora backend. --><dependency org= "Org.apache.gora" name= "gora-hbase" rev= "0.3" conf= "*->default"/>

3.3 Modificationsconf/gora.properties

Gora.datastore.default=org.apache.gora.hbase.store.hbasestore

/************************************************************************************************************** ***************/

Configure Nutch

(The Nutch folder is already in the/home directory)

1. modifying system environment variables

sudo gedit/etc/profile

Increase

#set nutchexport Path=/home/nutch/runtime/local/bin: $PATH

2. Test (Nutch/runtime/local/bin./nutch &./crawl)

Nutch
The results are as follows: Usage:nutch Commandwhere COMMAND is a of:injectinject new URLs into the database hostinject creates or UPDA TES an existing host table from a text file generate generate new batches to fetch from crawl DB fetch fetch URLS marked D Uring generate parse URLs marked during FETCH updatedb Update Web table after parsing updatehostdb Update host tab   Le after parsing readdb read/dump records from page database readhostdb display entries from the Hostdb elasticindex Run the Elasticsearch indexer Solrindex run the SOLR indexer on parsed batches solrdedup remove duplicates from SOLR pars  Echecker Check the parser for a given URL indexchecker check the indexing filters for a given URL plugin load a plugin  and run one of its classes main () Nutchserver run a (local) Nutch server on a user defined port JUnit runs the Given JUnit test or CLASSNAME run the class named Classnamemost commands print help when invoked w/o parameters.

Crawl
The results are as follows: Missing seeddir:crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

Use ant to compile nutch2.x & Configure nutch2.x in Ubuntu environment

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.