"Nutch Basic Tutorial Seven" Nutch 2 modes of operation: local and deploy

Last Update:2014-12-21 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

After running the ant runtime on the Nutch source code, a runtime directory is created with the deploy and local 2 directories in the runtime directory.

[[email protected] runtime]$ ls

Deploy local

These 2 directories represent the 2 modes of operation of Nutch: Deployment mode and local mode.

The following inject, for example, demonstrates 2 modes of operation.

First, local mode

1. Basic usage:

$ bin/nutch Inject usage:injectorjob <url_dir> [-crawlid <id>]

Usage One: No ID specified

Liaoliuqingdemacbook-air:local liaoliuqing$ Bin/nutch inject urlsinjectorjob:starting at 2014-12-20 22:32:01injectorjob:injecting UrlDir:urlsInjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora Storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:32:15, elapsed:00:00:14

Usage two: Specify ID

$ bin/nutch Inject urls-crawlid 2injectorjob:starting at 2014-12-20 22:34:01injectorjob:injecting Urldir:urlsinjectorj Ob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:34:15, elapsed:00:00:14

2. Data changes in the database

The above command will create a new table in the HBase database with the table named ${id}_webpage, and if no ID is specified, the table name is webpage.

The contents of the file in the URLs directory are then written to the table as a crawler seed.

HBase (main):003:0> scan ' webpage ' ROW                   column+cell                                                com.163.www:http/    Column=f:fi, timestamp= 1419085934952, value=\x00 ' \x8d\x00  com.163.www:http/    column=f:ts, timestamp=1419085934952, value=\x00\x00 \x01jh                      \x1c\xbc7                                                  com.163.www:http/    column=mk:_injmrk_, timestamp=1419085934952, Value=y       com.163.www:http/    column=mk:dist, timestamp=1419085934952, value=0           com.163.www:http/    column=mtdt:_ Csh_, timestamp=1419085934952, value=?\x80\x0                      0\x00                                                      com.163.www:http/    column=s:s, timestamp= 1419085934952, value=?\x80\x00\x00  1 row (s) in 0.6140 seconds

When the inject command is executed again, a new URL is added into the table.

3. Other running scripts

Where COMMAND is one of:inject inject new URLs to the database hostinject creates or updates an existing Ho St table from a text file generate generate new batches to fetch from crawl DB fetch fetch URLS marked Duri Ng Generate Parse parse URLs marked during FETCH updatedb Update Web table after parsing Updatehostdb upd Ate host table after parsing READDB read/dump records from page database readhostdb display entries from the H Ostdb Elasticindex Run the Elasticsearch indexer Solrindex run the SOLR indexer on parsed batches solrdedup re Move duplicates from SOLR parsechecker Check the parser for a given URL indexchecker check the indexing filters for a Given URL plugin load a plugin and run one of its classes main () Nutchserver run a (local) Nutch server on a US Er defined port JUnit runs the given JUnit test or CLASSNAME run the class named Classnamemost commands Prin T help when invoked w/oParameters.

You can step through each step of a complete crawl process to form a holistic process.

When using the crawl command to crawl a task, its basic process steps are as follows:

(1) Injectorjob

Start the first iteration

(2) Generatorjob

(3) Fetcherjob

(4) Parserjob

(5) Dbupdaterjob

(6) Solrindexerjob

Start a second iteration

(2) Generatorjob

(3) Fetcherjob

(4) Parserjob

(5) Dbupdaterjob

(6) Solrindexerjob

Start a third iteration

For each step of the execution, see http://blog.csdn.net/jediael_lu/article/details/38591067

4. Nutch encapsulates a crawl script that encapsulates the key steps, eliminating the need to run the crawl process incrementally.

[Email protected] local]$ bin/crawl Missing seeddir:crawl <seedDir> <crawlID> <solrURL> <numberof Rounds>

Such as:

[Email protected] bin]#/crawl seed.txt testcrawl HTTP://LOCALHOST:8983/SOLR 2

II. Deployment model

1. Run with hadoop command

Note: Hadoop and HBase must be started first.

[[email protected] deploy]$ Hadoop jar Apache-nutch-2.2.1.job Org.apache.nutch.crawl.InjectorJob file:///opt/ JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS/14/12/20 23:26:50 INFO Crawl. InjectorJob:InjectorJob:starting at 2014-12-20 23:26:5014/12/20 23:26:50 INFO crawl. InjectorJob:InjectorJob:Injecting URLDIR:FILE:/OPT/JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 gmt14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:host.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.version=1.7.0_5114/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.vendor=oracle corporation14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.home=/usr/java/jdk1.7.0_51/jre14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/. /conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/ opt/jediael/hadoop-1.2.1/libexec/. /hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /LIB/JSP-2.1/JSP-API-2.1.JAR14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/. /LIB/NATIVE/LINUX-AMD64-6414/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.io.tmpdir=/tmp14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.compiler=<na>14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.name=linux14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.arch=amd6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.version=2.6.32-431.17.1.el6.x86_6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.home=/home/jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy14/12/20 23:26:52 INFO Zookeeper. Zookeeper:initiating Client CoNnection, connectstring=localhost:2181 sessiontimeout=180000 watcher=hconnection14/12/20 23:26:52 INFO zookeeper. Clientcnxn:opening socket connection to server LOCALHOST/127.0.0.1:218114/12/20 23:26:52 INFO zookeeper. Clientcnxn:socket connection established to localhost/127.0.0.1:2181, initiating SESSION14/12/20 23:26:52 INFO Zookeeper. Clientcnxn:session establishment complete on server localhost/127.0.0.1:2181, SessionID = 0x14a5c24c9cf0657, negotiated Timeout = 4000014/12/20 23:26:52 INFO crawl. InjectorJob:InjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.14/12/20 23:26:55 INFO input. Fileinputformat:total input paths to process:114/12/20 23:26:55 INFO util. nativecodeloader:loaded the Native-hadoop library14/12/20 23:26:55 WARN Snappy. Loadsnappy:snappy Native Library not loaded14/12/20 23:26:56 INFO mapred. Jobclient:running job:job_201412202325_000214/12/20 23:26:57 INFO mapred. Jobclient:map 0% reduce 0%14/12/20 23:27:15 INFO mapred. Jobclient:map 100% reduce 0%14/12/20 23:27:17 INFO mapred. Jobclient:job complete:job_201412202325_000214/12/20 23:27:18 INFO mapred. JOBCLIENT:COUNTERS:2014/12/20 23:27:18 INFO mapred. Jobclient:job Counters 14/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_MAPS=1405814/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:rack-local map tasks=114/12/20 23:27:18 INFO mapred. jobclient:launched map tasks=114/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_REDUCES=014/12/20 23:27:18 INFO mapred. Jobclient:file Output Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes written=014/12/20 23:27:18 INFO mapred. JOBCLIENT:INJECTOR14/12/20 23:27:18 INFO mapred. JOBCLIENT:URLS_INJECTED=314/12/20 23:27:18 INFO mapred. Jobclient:filesysteMCOUNTERS14/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_READ=14914/12/20 23:27:18 INFO mapred. JOBCLIENT:HDFS_BYTES_READ=13014/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_WRITTEN=7848814/12/20 23:27:18 INFO mapred. Jobclient:file Input Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes read=14914/12/20 23:27:18 INFO mapred. Jobclient:map-reduce framework14/12/20 23:27:18 INFO mapred. Jobclient:map input records=614/12/20 23:27:18 INFO mapred. Jobclient:physical memory (bytes) snapshot=10631168014/12/20 23:27:18 INFO mapred. jobclient:spilled records=014/12/20 23:27:18 INFO mapred. Jobclient:cpu Time Spent (ms) =242014/12/20 23:27:18 INFO mapred. Jobclient:total committed heap usage (bytes) =2975334414/12/20 23:27:18 INFO mapred. Jobclient:virtual memory (bytes) snapshot=73679667214/12/20 23:27:18 INFO mapred. Jobclient:map output records=314/12/20 23:27:18 INFO mapred. JOBCLIENT:SPLIT_RAW_BYTES=13014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs rejected by FILTERS:014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs injected after normalization and filtering:314/12/20 23:27:18 INFO crawl.i NjectorJob:Injector:finished at 2014-12-20 23:27:18, elapsed:00:00:27

Iii. How to run Nutch with Eclipse

This method is essentially consistent with the deployment pattern.

Run Injectorjob with Eclipse

Eclipse Output Content:

Injectorjob:starting at 2014-12-20 23:13:24injectorjob:injecting Urldir:/users/liaoliuqing/99_project/2.x/ Urlsinjectorjob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 23:13:27, elapsed:00:00:02

"Nutch Basic Tutorial Seven" Nutch 2 modes of operation: local and deploy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More