After running the ant runtime on the Nutch source code, a runtime directory is created with the deploy and local 2 directories in the runtime directory.
[[email protected] runtime]$ ls
Deploy local
These 2 directories represent the 2 modes of operation of Nutch: Deployment mode and local mode.
The following inject, for example, demonstrates 2 modes of operation.
First, local mode
1. Basic usage:
$ bin/nutch Inject usage:injectorjob <url_dir> [-crawlid <id>]
Usage One: No ID specified
Liaoliuqingdemacbook-air:local liaoliuqing$ Bin/nutch inject urlsinjectorjob:starting at 2014-12-20 22:32:01injectorjob:injecting UrlDir:urlsInjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora Storage class. Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:32:15, elapsed:00:00:14
Usage two: Specify ID
$ bin/nutch Inject urls-crawlid 2injectorjob:starting at 2014-12-20 22:34:01injectorjob:injecting Urldir:urlsinjectorj Ob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class. Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:34:15, elapsed:00:00:14
2. Data changes in the database
The above command will create a new table in the HBase database with the table named ${id}_webpage, and if no ID is specified, the table name is webpage.
The contents of the file in the URLs directory are then written to the table as a crawler seed.
HBase (main):003:0> scan ' webpage ' ROW column+cell com.163.www:http/ Column=f:fi, timestamp= 1419085934952, value=\x00 ' \x8d\x00 com.163.www:http/ column=f:ts, timestamp=1419085934952, value=\x00\x00 \x01jh \x1c\xbc7 com.163.www:http/ column=mk:_injmrk_, timestamp=1419085934952, Value=y com.163.www:http/ column=mk:dist, timestamp=1419085934952, value=0 com.163.www:http/ column=mtdt:_ Csh_, timestamp=1419085934952, value=?\x80\x0 0\x00 com.163.www:http/ column=s:s, timestamp= 1419085934952, value=?\x80\x00\x00 1 row (s) in 0.6140 seconds
When the inject command is executed again, a new URL is added into the table.
3. Other running scripts
Where COMMAND is one of:inject inject new URLs to the database hostinject creates or updates an existing Ho St table from a text file generate generate new batches to fetch from crawl DB fetch fetch URLS marked Duri Ng Generate Parse parse URLs marked during FETCH updatedb Update Web table after parsing Updatehostdb upd Ate host table after parsing READDB read/dump records from page database readhostdb display entries from the H Ostdb Elasticindex Run the Elasticsearch indexer Solrindex run the SOLR indexer on parsed batches solrdedup re Move duplicates from SOLR parsechecker Check the parser for a given URL indexchecker check the indexing filters for a Given URL plugin load a plugin and run one of its classes main () Nutchserver run a (local) Nutch server on a US Er defined port JUnit runs the given JUnit test or CLASSNAME run the class named Classnamemost commands Prin T help when invoked w/oParameters.
You can step through each step of a complete crawl process to form a holistic process.
When using the crawl command to crawl a task, its basic process steps are as follows:
(1) Injectorjob
Start the first iteration
(2) Generatorjob
(3) Fetcherjob
(4) Parserjob
(5) Dbupdaterjob
(6) Solrindexerjob
Start a second iteration
(2) Generatorjob
(3) Fetcherjob
(4) Parserjob
(5) Dbupdaterjob
(6) Solrindexerjob
Start a third iteration
For each step of the execution, see http://blog.csdn.net/jediael_lu/article/details/38591067
4. Nutch encapsulates a crawl script that encapsulates the key steps, eliminating the need to run the crawl process incrementally.
[Email protected] local]$ bin/crawl Missing seeddir:crawl <seedDir> <crawlID> <solrURL> <numberof Rounds>
Such as:
[Email protected] bin]#/crawl seed.txt testcrawl HTTP://LOCALHOST:8983/SOLR 2
II. Deployment model
1. Run with hadoop command
Note: Hadoop and HBase must be started first.
[[email protected] deploy]$ Hadoop jar Apache-nutch-2.2.1.job Org.apache.nutch.crawl.InjectorJob file:///opt/ JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS/14/12/20 23:26:50 INFO Crawl. InjectorJob:InjectorJob:starting at 2014-12-20 23:26:5014/12/20 23:26:50 INFO crawl. InjectorJob:InjectorJob:Injecting URLDIR:FILE:/OPT/JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 gmt14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:host.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.version=1.7.0_5114/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.vendor=oracle corporation14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.home=/usr/java/jdk1.7.0_51/jre14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/. /conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/ opt/jediael/hadoop-1.2.1/libexec/. /hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /LIB/JSP-2.1/JSP-API-2.1.JAR14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/. /LIB/NATIVE/LINUX-AMD64-6414/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.io.tmpdir=/tmp14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.compiler=<na>14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.name=linux14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.arch=amd6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.version=2.6.32-431.17.1.el6.x86_6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.home=/home/jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy14/12/20 23:26:52 INFO Zookeeper. Zookeeper:initiating Client CoNnection, connectstring=localhost:2181 sessiontimeout=180000 watcher=hconnection14/12/20 23:26:52 INFO zookeeper. Clientcnxn:opening socket connection to server LOCALHOST/127.0.0.1:218114/12/20 23:26:52 INFO zookeeper. Clientcnxn:socket connection established to localhost/127.0.0.1:2181, initiating SESSION14/12/20 23:26:52 INFO Zookeeper. Clientcnxn:session establishment complete on server localhost/127.0.0.1:2181, SessionID = 0x14a5c24c9cf0657, negotiated Timeout = 4000014/12/20 23:26:52 INFO crawl. InjectorJob:InjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.14/12/20 23:26:55 INFO input. Fileinputformat:total input paths to process:114/12/20 23:26:55 INFO util. nativecodeloader:loaded the Native-hadoop library14/12/20 23:26:55 WARN Snappy. Loadsnappy:snappy Native Library not loaded14/12/20 23:26:56 INFO mapred. Jobclient:running job:job_201412202325_000214/12/20 23:26:57 INFO mapred. Jobclient:map 0% reduce 0%14/12/20 23:27:15 INFO mapred. Jobclient:map 100% reduce 0%14/12/20 23:27:17 INFO mapred. Jobclient:job complete:job_201412202325_000214/12/20 23:27:18 INFO mapred. JOBCLIENT:COUNTERS:2014/12/20 23:27:18 INFO mapred. Jobclient:job Counters 14/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_MAPS=1405814/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:rack-local map tasks=114/12/20 23:27:18 INFO mapred. jobclient:launched map tasks=114/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_REDUCES=014/12/20 23:27:18 INFO mapred. Jobclient:file Output Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes written=014/12/20 23:27:18 INFO mapred. JOBCLIENT:INJECTOR14/12/20 23:27:18 INFO mapred. JOBCLIENT:URLS_INJECTED=314/12/20 23:27:18 INFO mapred. Jobclient:filesysteMCOUNTERS14/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_READ=14914/12/20 23:27:18 INFO mapred. JOBCLIENT:HDFS_BYTES_READ=13014/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_WRITTEN=7848814/12/20 23:27:18 INFO mapred. Jobclient:file Input Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes read=14914/12/20 23:27:18 INFO mapred. Jobclient:map-reduce framework14/12/20 23:27:18 INFO mapred. Jobclient:map input records=614/12/20 23:27:18 INFO mapred. Jobclient:physical memory (bytes) snapshot=10631168014/12/20 23:27:18 INFO mapred. jobclient:spilled records=014/12/20 23:27:18 INFO mapred. Jobclient:cpu Time Spent (ms) =242014/12/20 23:27:18 INFO mapred. Jobclient:total committed heap usage (bytes) =2975334414/12/20 23:27:18 INFO mapred. Jobclient:virtual memory (bytes) snapshot=73679667214/12/20 23:27:18 INFO mapred. Jobclient:map output records=314/12/20 23:27:18 INFO mapred. JOBCLIENT:SPLIT_RAW_BYTES=13014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs rejected by FILTERS:014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs injected after normalization and filtering:314/12/20 23:27:18 INFO crawl.i NjectorJob:Injector:finished at 2014-12-20 23:27:18, elapsed:00:00:27
Iii. How to run Nutch with Eclipse
This method is essentially consistent with the deployment pattern.
Run Injectorjob with Eclipse
Eclipse Output Content:
Injectorjob:starting at 2014-12-20 23:13:24injectorjob:injecting Urldir:/users/liaoliuqing/99_project/2.x/ Urlsinjectorjob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class. Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 23:13:27, elapsed:00:00:02
"Nutch Basic Tutorial Seven" Nutch 2 modes of operation: local and deploy