Reference Http://wiki.apache.org/nutch/RunNutchInEclipse
First, the Environment preparation
1. Download nutch2.3 Source code
wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz
or download the latest version in development
SVN Co https://svn.apache.org/repos/asf/nutch/branches/2.x
2. Select the type of database to use, take hbase as an example
In Conf/nutch.xml, add the following properties:
<property> <name>storage.data.store.class</name> <value> Org.apache.gora.hbase.store.hbasestore</value> <description>default class for storing data</ Description> </property>
3. Add hbase-related dependencies in Ivy/ivy.xml, this item already exists, but is commented out, remove the comment
<dependency org= "Org.apache.gora" name= "gora-hbase" rev= "0.5" conf= "*->default"/>
Note that the rev=0.5 corresponds to the hbase0.94,rev=0.3 hbase0.90.4
4. Add the following 3 properties in Nutch.xml
<property> <name>http.agent.name</name> <value>my Nutch spider</value> < /property><property> <name>http.robots.agents</name> <value>none</value > </property><property> <name>plugin.folders</name> <value>/users/ Liaoliuqing/0_search/1_nutch/1_official/apache-nutch-2.3/build/plugins</value> </property>
Where the value of Plugin.folders is $nutch_home/build/plugins
5. Execute Ant Eclipse
Second, import project
1. Import Project
2, in Build path, put apche-nutch-2.3/conf on top, that is, click the top button
Third, run the program
1. Run as----> Run configuration, select Project and Main class
2. Fill in the parameters
/users/liaoliuqing/downloads/seed.txt
-dhadoop.log.dir=logs-dhadoop.log.file=hadoop.log
3. Click Run and the output is as follows:
Injectorjob:starting at 2015-01-28 16:27:43
Injectorjob:injecting Urldir:/users/liaoliuqing/downloads/seed.txt
Injectorjob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
Injectorjob:total number of URLs rejected by filters:0
Injectorjob:total number of URLs injected after normalization and filtering:1
Injector:finished at 2015-01-28 16:27:47, elapsed:00:00:04
Note that before running the program, the machine needs to start hbase first.
4. View data in HBase
HBase (main):003:0> scan ' webpage ' ROW Column+cell com.163.www:http/ Column=f:fi, timestamp=1422433667377, value=\x00 ' \x8d\x00 Com.163.www:http/column=f:ts, timestamp=1422433667377, value=\x00\x00\x01k/\xa7:\x14 Com.163.www:http/column=mk:_injmrk_, Timest amp=1422433667377, Value=y com.163.www:http/ Column=mk:dist, timestamp=1422433667377, value=0 Com.163.www:http/column=mtdt:_csh_, timestamp=1422433667377, value=?\x80\ x00\x00 Com.163.www:http/column=s:s, timestamp=142243366737 7, value=?\x80\x00\x00 1 row (s) in 0.2970 seconds
Run Nutch2.3 in eclipse