"Nutch Basic Tutorial Seven" Nutch 2 modes of operation: local and deploy

Source: Internet
Author: User
Tags solr

After running the ant runtime on the Nutch source code, a runtime directory is created with the deploy and local 2 directories in the runtime directory.

[[email protected] runtime]$ ls

Deploy local

These 2 directories represent the 2 modes of operation of Nutch: Deployment mode and local mode.



The following inject, for example, demonstrates 2 modes of operation.

First, local mode

1. Basic usage:

$ bin/nutch Inject usage:injectorjob <url_dir> [-crawlid <id>]

Usage One: No ID specified

Liaoliuqingdemacbook-air:local liaoliuqing$ Bin/nutch inject urlsinjectorjob:starting at 2014-12-20 22:32:01injectorjob:injecting UrlDir:urlsInjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora Storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:32:15, elapsed:00:00:14

Usage two: Specify ID

$ bin/nutch Inject urls-crawlid 2injectorjob:starting at 2014-12-20 22:34:01injectorjob:injecting Urldir:urlsinjectorj Ob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 22:34:15, elapsed:00:00:14

2. Data changes in the database

The above command will create a new table in the HBase database with the table named ${id}_webpage, and if no ID is specified, the table name is webpage.

The contents of the file in the URLs directory are then written to the table as a crawler seed.

HBase (main):003:0> scan ' webpage ' ROW                   column+cell                                                com.163.www:http/    Column=f:fi, timestamp= 1419085934952, value=\x00 ' \x8d\x00  com.163.www:http/    column=f:ts, timestamp=1419085934952, value=\x00\x00 \x01jh                      \x1c\xbc7                                                  com.163.www:http/    column=mk:_injmrk_, timestamp=1419085934952, Value=y       com.163.www:http/    column=mk:dist, timestamp=1419085934952, value=0           com.163.www:http/    column=mtdt:_ Csh_, timestamp=1419085934952, value=?\x80\x0                      0\x00                                                      com.163.www:http/    column=s:s, timestamp= 1419085934952, value=?\x80\x00\x00  1 row (s) in 0.6140 seconds

When the inject command is executed again, a new URL is added into the table.


3. Other running scripts

Where COMMAND is one of:inject inject new URLs to the database hostinject creates or updates an existing Ho St table from a text file generate generate new batches to fetch from crawl DB fetch fetch URLS marked Duri Ng Generate Parse parse URLs marked during FETCH updatedb Update Web table after parsing Updatehostdb upd Ate host table after parsing READDB read/dump records from page database readhostdb display entries from the H Ostdb Elasticindex Run the Elasticsearch indexer Solrindex run the SOLR indexer on parsed batches solrdedup re Move duplicates from SOLR parsechecker Check the parser for a given URL indexchecker check the indexing filters for a Given URL plugin load a plugin and run one of its classes main () Nutchserver run a (local) Nutch server on a US Er defined port JUnit runs the given JUnit test or CLASSNAME run the class named Classnamemost commands Prin T help when invoked w/oParameters. 

You can step through each step of a complete crawl process to form a holistic process.

When using the crawl command to crawl a task, its basic process steps are as follows:

(1) Injectorjob

Start the first iteration

(2) Generatorjob

(3) Fetcherjob

(4) Parserjob

(5) Dbupdaterjob

(6) Solrindexerjob

Start a second iteration

(2) Generatorjob

(3) Fetcherjob

(4) Parserjob

(5) Dbupdaterjob

(6) Solrindexerjob

Start a third iteration

For each step of the execution, see http://blog.csdn.net/jediael_lu/article/details/38591067


4. Nutch encapsulates a crawl script that encapsulates the key steps, eliminating the need to run the crawl process incrementally.

[Email protected] local]$ bin/crawl Missing seeddir:crawl <seedDir> <crawlID> <solrURL> <numberof Rounds>

Such as:

[Email protected] bin]#/crawl seed.txt testcrawl HTTP://LOCALHOST:8983/SOLR 2

II. Deployment model

1. Run with hadoop command


Note: Hadoop and HBase must be started first.

[[email protected] deploy]$ Hadoop jar Apache-nutch-2.2.1.job Org.apache.nutch.crawl.InjectorJob file:///opt/ JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS/14/12/20 23:26:50 INFO Crawl. InjectorJob:InjectorJob:starting at 2014-12-20 23:26:5014/12/20 23:26:50 INFO crawl. InjectorJob:InjectorJob:Injecting URLDIR:FILE:/OPT/JEDIAEL/APACHE-NUTCH-2.2.1/RUNTIME/DEPLOY/URLS14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 gmt14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:host.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.version=1.7.0_5114/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.vendor=oracle corporation14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.home=/usr/java/jdk1.7.0_51/jre14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/. /conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/ opt/jediael/hadoop-1.2.1/libexec/. /hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/. /lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/. /LIB/JSP-2.1/JSP-API-2.1.JAR14/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/. /LIB/NATIVE/LINUX-AMD64-6414/12/20 23:26:52 INFO Zookeeper. Zookeeper:client environment:java.io.tmpdir=/tmp14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:java.compiler=<na>14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.name=linux14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.arch=amd6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:os.version=2.6.32-431.17.1.el6.x86_6414/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.name=jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.home=/home/jediael14/12/20 23:26:52 INFO ZooKeeper. Zookeeper:client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy14/12/20 23:26:52 INFO Zookeeper. Zookeeper:initiating Client CoNnection, connectstring=localhost:2181 sessiontimeout=180000 watcher=hconnection14/12/20 23:26:52 INFO zookeeper. Clientcnxn:opening socket connection to server LOCALHOST/127.0.0.1:218114/12/20 23:26:52 INFO zookeeper. Clientcnxn:socket connection established to localhost/127.0.0.1:2181, initiating SESSION14/12/20 23:26:52 INFO Zookeeper. Clientcnxn:session establishment complete on server localhost/127.0.0.1:2181, SessionID = 0x14a5c24c9cf0657, negotiated Timeout = 4000014/12/20 23:26:52 INFO crawl. InjectorJob:InjectorJob:Using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.14/12/20 23:26:55 INFO input. Fileinputformat:total input paths to process:114/12/20 23:26:55 INFO util. nativecodeloader:loaded the Native-hadoop library14/12/20 23:26:55 WARN Snappy. Loadsnappy:snappy Native Library not loaded14/12/20 23:26:56 INFO mapred. Jobclient:running job:job_201412202325_000214/12/20 23:26:57 INFO mapred. Jobclient:map 0% reduce 0%14/12/20 23:27:15 INFO mapred. Jobclient:map 100% reduce 0%14/12/20 23:27:17 INFO mapred. Jobclient:job complete:job_201412202325_000214/12/20 23:27:18 INFO mapred. JOBCLIENT:COUNTERS:2014/12/20 23:27:18 INFO mapred. Jobclient:job Counters 14/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_MAPS=1405814/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all reduces waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:total time spent by all maps waiting after reserving slots (ms) =014/12/20 23:27:18 INFO mapred. Jobclient:rack-local map tasks=114/12/20 23:27:18 INFO mapred. jobclient:launched map tasks=114/12/20 23:27:18 INFO mapred. JOBCLIENT:SLOTS_MILLIS_REDUCES=014/12/20 23:27:18 INFO mapred. Jobclient:file Output Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes written=014/12/20 23:27:18 INFO mapred. JOBCLIENT:INJECTOR14/12/20 23:27:18 INFO mapred. JOBCLIENT:URLS_INJECTED=314/12/20 23:27:18 INFO mapred. Jobclient:filesysteMCOUNTERS14/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_READ=14914/12/20 23:27:18 INFO mapred. JOBCLIENT:HDFS_BYTES_READ=13014/12/20 23:27:18 INFO mapred. JOBCLIENT:FILE_BYTES_WRITTEN=7848814/12/20 23:27:18 INFO mapred. Jobclient:file Input Format Counters 14/12/20 23:27:18 INFO mapred. Jobclient:bytes read=14914/12/20 23:27:18 INFO mapred. Jobclient:map-reduce framework14/12/20 23:27:18 INFO mapred. Jobclient:map input records=614/12/20 23:27:18 INFO mapred. Jobclient:physical memory (bytes) snapshot=10631168014/12/20 23:27:18 INFO mapred. jobclient:spilled records=014/12/20 23:27:18 INFO mapred. Jobclient:cpu Time Spent (ms) =242014/12/20 23:27:18 INFO mapred. Jobclient:total committed heap usage (bytes) =2975334414/12/20 23:27:18 INFO mapred. Jobclient:virtual memory (bytes) snapshot=73679667214/12/20 23:27:18 INFO mapred. Jobclient:map output records=314/12/20 23:27:18 INFO mapred. JOBCLIENT:SPLIT_RAW_BYTES=13014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs rejected by FILTERS:014/12/20 23:27:18 INFO crawl. InjectorJob:InjectorJob:total number of URLs injected after normalization and filtering:314/12/20 23:27:18 INFO crawl.i NjectorJob:Injector:finished at 2014-12-20 23:27:18, elapsed:00:00:27


Iii. How to run Nutch with Eclipse

This method is essentially consistent with the deployment pattern.


Run Injectorjob with Eclipse



Eclipse Output Content:

Injectorjob:starting at 2014-12-20 23:13:24injectorjob:injecting Urldir:/users/liaoliuqing/99_project/2.x/ Urlsinjectorjob:using class Org.apache.gora.hbase.store.HBaseStore as the Gora storage class.  Injectorjob:total number of URLs rejected by filters:0injectorjob:total number of URLs injected after normalization and Filtering:1injector:finished at 2014-12-20 23:13:27, elapsed:00:00:02



"Nutch Basic Tutorial Seven" Nutch 2 modes of operation: local and deploy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.