Using ant to compile nutch2.x
See: 1. http://blog.javachen.com/2014/05/20/nutch-intro/
2. wiki.apache.org/nutch/nutch2tutorial
Prerequisites: Configure Ant (http://www.cnblogs.com/xxx0624/p/4172277.html)
1. Download Nutch(ex: My is apache-nutch-2.2.1-src.tar.gz)
Unzip, rename the Nutch folder (named Nutch), and then move the folder to the/home folder
2. Compiling Nutch
CD Nutchant
2.1 You may encounter this error:
Trying to override old definition of task Javac [taskdef] Could not load definitions from resource Org/sonar/ant/antli B.xml. It could not being found.ivy-probe-antlib:ivy-download: [taskdef] could not load definitions from resource Org/son Ar/ant/antlib.xml. It could not being found.
Cause: The appropriate jar file is missing
Workaround:
(1) Download the Sonar-ant-task-2.1.jar and put it in the Nutch folder directory
(2) Modify the Build.xml file to introduce this new jar
<!--Define The Sonar task if this hasn ' t been do in a common script--><taskdef uri= "Antlib:org.sonar.ant" res Ource= "Org/sonar/ant/antlib.xml" > <classpath path= "${ant.library.dir}"/> <classpath path= "${ Mysql.library.dir} "/> <classpath><fileset dir=". "Includes=" Sonar*.jar "/></classpath> </taskdef>
Find the appropriate place and add more content.
2.2 Compile time too long
Nutch is built using ivy, so it takes a long time to compile. If the time is too long, this solution can be used.
Modify the file:ivy/ivysettings.xml
http://mirrors.ibiblio.org/maven2/
Replace
http://repo1.maven.org/maven2/
2.3 After compiling the directory:
. ├──build├──build.xml├──build.xml~├──changes.txt├──conf├──default.properties├──docs├──ivy├──lib├── License.txt├──notice.txt├──readme.txt├──runtime├──sonar-ant-task-2.1.jar└──src7 directories, 8 files
3. Modify the Nutch configuration file
The nutch2.x version store uses Gora to access Cassandra, HBase, Accumulo, Avro, and so on, and needs to develop Gora attributes in the file.
3.1 Modificationsconf/nutch-site.xml
<property> <name>storage.data.store.class</name> <value> Org.apache.gora.hbase.store.hbasestore</value> <description>default class for storing data</ Description></property>
3.2 Modificationsivy/ivy.xml
<!--uncomment this to use HBase as Gora backend. --><dependency org= "Org.apache.gora" name= "gora-hbase" rev= "0.3" conf= "*->default"/>
3.3 Modificationsconf/gora.properties
Gora.datastore.default=org.apache.gora.hbase.store.hbasestore
/************************************************************************************************************** ***************/
Configure Nutch
(The Nutch folder is already in the/home directory)
1. modifying system environment variables
sudo gedit/etc/profile
Increase
#set nutchexport Path=/home/nutch/runtime/local/bin: $PATH
2. Test (Nutch/runtime/local/bin./nutch &./crawl)
Nutch
The results are as follows: Usage:nutch Commandwhere COMMAND is a of:injectinject new URLs into the database hostinject creates or UPDA TES an existing host table from a text file generate generate new batches to fetch from crawl DB fetch fetch URLS marked D Uring generate parse URLs marked during FETCH updatedb Update Web table after parsing updatehostdb Update host tab Le after parsing readdb read/dump records from page database readhostdb display entries from the Hostdb elasticindex Run the Elasticsearch indexer Solrindex run the SOLR indexer on parsed batches solrdedup remove duplicates from SOLR pars Echecker Check the parser for a given URL indexchecker check the indexing filters for a given URL plugin load a plugin and run one of its classes main () Nutchserver run a (local) Nutch server on a user defined port JUnit runs the Given JUnit test or CLASSNAME run the class named Classnamemost commands print help when invoked w/o parameters.
Crawl
The results are as follows: Missing seeddir:crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
Use ant to compile nutch2.x & Configure nutch2.x in Ubuntu environment