nutch2.3 Command parameter parsing

Last Update:2015-08-05 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

List of commands that can be executed in Nutch

[[Email protected] ~]# Nutchusage:nutch COMMANDwhereCOMMAND isOne Of:inject injectNewURLs into the database hostinject creates or updates an existing host table froma text file generate generateNewBatches to fetch fromCrawl DB fetch fetch URLS marked during Generate parse parse URLs marked during fetch updatedb Update Web table after parsing updatehostdb Update host table after parsing READDB read/dump Records fromPage Database readhostdb display entries fromThe hostdb index run the plugin-based indexer on parsed batches Elasticindex run the Elasticsearch indexer-DEPRECATED Use the index command instead Solrindex run, the SOLR indexer on parsed batches -DEPRECATED Use the Inde x command instead solrdedup remove duplicates fromsolr Solrclean Remove HTTP301and404Documents fromSOLR-DEPRECATED use the "clean" command instead clean remove HTTP301and404Documents and duplicates fromindexing B ackends configured via plugins parsechecker check the parser fora given URL indexchecker check the indexing filters fora given URL plugin load a plugin and run one of its classes main () Nutchserver run a (local) Nutch server On a user defined port WebApp run a local Nutch Web application JUnit runs the given JUnit test or CLASSN AME Run theclassnamed Classnamemost commands print help when invoked W/o parameters.

Crawl

Usage:crawl <seedDir> <crawlID> [<solrurl>] <numberOfRounds>

Parameter description:

<seeddir>: A text file that includes a URL list, which is a folder that already exists.

<CRAWLID>: ID number of the crawl

[<SOLRURL>]:SOLR resolved indexed address

<numberofrounds>: The crawl of the rounds

Nutch inject

Usage:injectorjob <url_dir> [-crawlid <id>]

Parameter description:

<url_dir>: A text file that includes a URL list, which is a folder that already exists.

Nutch Generate

Usage:generatorjob [-topn N] [-crawlid ID] [-nofilter] [-nonorm] [-adddays numdays]

Parameter description:

[-TOPN N]: Select the number of links before the default value is Long.max_value

[-nofilter]: Do not activate filter plugin filter URL, default is True

[-nonorm]: does not activate the Normalizer plugin normalized URL, the default is True

[-adddays numdays]: Add <numDays> to current time, configure crawling URLs to be crawled quickly db.default.fetch.interval the default value is 0. The crawl end time is before the current time.

Nutch Fetch

Usage:fetcherjob (<batchId> |-all) [-crawlid <id>] [-threads n] [-resume] [-numtasks N]

Parameter description:

[-crawlid <id>]:

[-threads N]: Number of fetcher threads running default value 10, Configuration Key, Fetcher.threads.fetch

[-resume]: resuming interrupted work

[-numtasks N]: If n>0, use the set N to reduce the crawl task (default: Mapred.map.tasks)

Nutch Parse

Usage:parserjob (<batchId> |-all) [-crawlid <id>] [-resume] [-force]

Parameter description:

[-crawlid <id>]:

[-resume]: Resuming a previously interrupted task

[-force]: Forced to re-parse this page, even if the page has been parsed

Nutch UpdateDB

Usage:dbupdaterjob (<batchId> |-all) [-crawlid <id>] <batchId>-crawl identifier returned by Generato R, Or-all for all
Generated batchid-s
-crawlid <id>-The ID to prefix the schemas to operate on,
(default:storage.crawl.id)

Parameter description:

Nutch Index

Usage:indexingjob (<batchId> |-all |-reindex) [-crawlid <id>]

Parameter description:

nutch2.3 Command parameter parsing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More