Detailed instructions on the commands of nutch

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nutch uses a command to work. Its command can be a single LAN command or a step-by-step command to crawl the entire web. The main Commands are as follows:

1. Crawl
Crawl is an alias for org. Apache. nutch. Crawl. Crawl. It is a complete crawling and indexing process command.
Usage:
Shell code
$ Bin/nutch crawl <urldir> [-Dir D] [-threads N] [-depth I] [-topn]

Parameter description:
<Urldir>: a text file that includes the URL list. It is an existing folder.
[-Dir <D>]: Specifies the working directory where the climbing record is saved in the nutch file. The default value is./Crawl-[date], where [date] indicates the current period.
[-Threads <n>]: Number of fetcher threads, overwrite the value of Fetcher. threads. Fetch in the default configuration file (10 by default ).
[-Depth <I>]: depth of the crawling iteration of the nutch. The default value is 5.
[-Topn <num>]: Limit the first N records in each iteration. The default value is integer. max_value.

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Crawl-tool.xml

Other files:
Crawl-urlfilter.txt

2. readdb
The readdb command is an alias for org. Apache. nutch. Crawl. crawldbreader. It returns or exports the information in the crawl database (crawldb.
Usage:
Shell code

$ Bin/nutch readdb <crawldb> (-Stats |-dump <out_dir> |-URL <URL>)

Parameter description:
<Crawldb>: crawldb directory.
[-Stats]: print all statistics on the console
[-Dump <out_dir>]: export the crawldb information to a file in the specified folder.
[-URL <URL>]: print the statistics of the specified URL
Instance:
Shell code

$ Bin/nutch readdb fullindex/crawldb-Stats

Crawldb statistics start: fullindex/crawldb
Statistics for crawldb: fullindex/crawldb
Total urls: 468030
Retry 0: 467361
Retry 1: 622
Retry 2: 32
Retry 3: 15
Min score: 0.0
AVG score: 0.0034686408
Max scores: 61.401
Status 1 (db_unfetched): 312748
Status 2 (db_fetched): 80671
Status 3 (db_gone): 69927
Status 4 (db_redir_temp): 1497
Status 5 (db_redir_perm): 3187
Crawldb statistics: Done
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

Note:
-The stats command is a useful job to quickly view crawling information. Its output information indicates:
Db_unfetched: Number of pages that have been crawled but not yet crawled (the reason is that they are not filtered by the url filter, or they are discarded by nutch outside of topn)
Db_gone: indicates that Error 404 or some other hypothetical errors have occurred. This status prevents subsequent crawling.
Db_fetched indicates the page that has been crawled and indexed. If the value is 0, an error is returned.

3. readlinkdb
It is an alias for "org. Apache. nutch. Crawl. linkdbreader". It exports information from the link library or returns a URL.
Usage:
Shell code

$ Bin/nutch readlinkdb <linkdb> (-dump <out_dir> |-URL <URL>)

Parameter description:
<Linkdb>: linkdb working directory
[-Dump <out_dir>]: export information to a folder.
[-URL <URL>]: prints the statistics of a URL.
Instance:
Shell code

$ Bin/nutch readlinkdb fullindex/linkdb-URL www.ccnu.edu.cn-no link information

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

4. Inject
It is an alias for "org. Apache. nutch. Crawl. injector" and injects a new URL into the crawler.
Usage:
Shell code

$ Bin/nutch injector <crawldb> <urldir>

Parameter description:
<Crawldb>: crawldb folder
<Urldir>: folder directory for storing files with URLs
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

The following configuration file parameters affect the injection method:
DB. Default. Fetch. interval -- sets the crawling interval by day. The default value is 30366f.
DB. Score. Injected -- set the default score for the URL. The default value is 1.0f.
Urlnormalizer. Class -- normalization URL class. The default value is org.apache.nutch.net. basicurlnormalizer.

5. Generate
It is "org. Apache. nutch. Crawl. Generator" and captures the new segment from crawldb.
Usage:
Shell code

$ Bin/nutch generator <crawldb> <segments_dir> [-topn <num>] [-numfetchers <fetchers>] [-adddays <days>]

Parameter description:
<Crawldb>: crawldb directory
<Segments_dir>: the newly created climbet Segment directory.
[-Topn <num>]: select the first number of links. The default value is long. max_value.
[-Numfetchers <fetchers>]: Number of captured partitions. Default: configuration key-> mapred. Map. Tasks-> 1
[-Adddays <days>]: Add <days> to the current time and configure the crawler URLs to quickly crawl dB. Default. Fetch. interval. The default value is 0. The crawling end time before the current time.
Example:
Shell code

$ Bin/nutch generate/My/crawldb/My/segments-topn 100-adddays 20

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Generate. Max. Per. Host-sets the maximum number of URLs for a single host. The default value is unlimited.

6. Fetch
It is the alias for "org. Apache. nutch. Fetcher. fetcher" and is responsible for crawling a segment.
Usage:
Shell code

$ Bin/nutch fetcher <segment> [-threads <n>] [-noparsing]

Parameter description:
<Segment>: Segment directory
[-Threads <n>]: The number of running fetcher threads. The default value is configuration key-> Fetcher. threads. Fetch-> 10.
[-Noparsing]: Disable auto-resolution of segment data
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Fetcher depends on multiple plug-ins to crawl different protocols. The existing protocols and support plug-ins are as follows:
HTTP:
Protocol-HTTP
Protocol-httpclient
Https:
Protocol-httpclient
FTP:
Protocol-FTP
File:
Protocol-File
When crawling online documents, you should not use protocol-file because it is used to crawl local files. If you want to crawl HTTP and https, you should use protocol-httpclient.
7. parse
It is the name of org. Apache. nutch. parse. parsesegment. It runs parsesegment on a segment.
Usage:
Shell code

$ Bin/nutch parse <segment>

Parameter description:
<Segment>: Segment folder
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
It relies on multiple plug-ins to parse content in different formats. supported formats and plug-ins include:
Content format plug-in remarks
Text/html parse-HTML parse HTML using nekohtml or tagsoup
Application/X-JavaScript parse-js parse JavaScript document (. JS ).
Audio/MPEG parse-mp3 analysis MP3 audio文((audio ).
Application/vnd. MS-Excel parse-MSExcel parse MSExcel document (.xls ).
Application/vnd. MS-PowerPoint parse-mspowerpoint parse mspower! Point Document
Application/MSWord parse-MSWord parse MSWord document
Application/RSS + XML Parse-RSS parse RSS document (. RSS)
Application/RTF parse-RTF Parsing RTF document (. rtf)
Application/pdf parse-PDF parsing PDF document
Application/X-Shockwave-flash parse-SwF parse flash document (.swf)
Text-plain parse-text into textfile (.txt)
Application/zip parse-zip unzip zip document (.zip)
Other types parse-ext parses documents using external commands Based on Content-Type or path prefix
By default, only the TXT, HTML, JS format plug-in is available, other needs to be configured in the nutch-site.xml to use.
8. segread
"Segread" is the alias for "org. Apache. nutch. segment. segmentreader". It reads and exports segment data.
Usage:
Shell code

$ Bin/nutch segread <segment>

Parameter description:
<Segment>: Segment folder
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Changed to readseg in Versions later than nutch0.9
9. updatedb
It is the alias of "org. Apache. nutch. Crawl. crawldb", and crawldb is updated with the information obtained during the fetch process.
Usage:
Shell code

$ Bin/nutch updatedb <crawldb> <segment> [-noadditions]

Parameter description:
<Crawldb>: crawldb directory
<Segment>: The crawled Segment directory.
[-Noadditions]: whether to add a new link to crawldb
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

10. invertlinks
It is the name of org. Apache. nutch. Crawl. linkdb. It updates linkdb with the information obtained from the segment.
Usage:
Shell code

$ Bin/nutch invertlinks <linkdb> (-Dir segmentsdir | segment1 segment2 ...)

Parameter description:
<Linkdb>: linkdb directory
<Segment>: Segment directory. You can specify at least one folder.

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
11. Index
It is called "org. Apache. nutch. indexer. indexer". It creates a segment index and uses the data in crawldb and linkdb to rate the pages in the index.
Usage:
Shell code

$ Bin/nutch index <index> <crawldb> <linkdb> <segment>...

Parameter description:
<Index>: storage directory after the index is created
<Crawldb>: crawldb directory
<Linkdb>: linkdb directory
<Segment>: Segment directory. You can specify multiple segments.
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

12. Merge
Merge is the alias for "org. Apache. nutch. indexer. indexmerger", which combines multiple segment indexes.
Usage:

$ Bin/nutch merge [-workingdir <workingdir>] <outputindex> <indexesdir>...

Parameter description:
[-Workingdir <workingdir>]: Specifies the working directory.
<Outputindex>: merged index storage directory
<Indexesdir>: contains the index directories to be merged. You can specify multiple

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
13. mergedb
It is the alias of "org. Apache. nutch. Crawl. crawldbmerger". Combining Multiple crawldbs, urlfilter can selectively filter specified content.
You can merge multiple databases into one. It is useful when you run crawlers separately and want to merge the databases. Optional. You can run the current urlfilter to filter URLs in the database to filter unnecessary URLs. It is also useful when there is only one DB, which means that you can filter out the URLs you don't want in those DB through this job.
It is also possible to use this tool only for filtering. In this case, only one crawldb is specified.
If the same URL is included in multiple crawldbs, only the latest versions are retained, which is determined by the org. Apache. nutch. Crawl. crawldatum. getfetchtime () value. However, the metadata of all versions is aggregated, and the new value replaces the previous value.
Usage:

$ Bin/nutch merge output_crawldb crawldb1 [crawler ldb3...] [-filter]

Parameter description:
Output_crawldb: Output Folder of crawldb
Crawldb1 [crawldb2 crawldb3...]: one or more crawldb (s ).
-Filter: used urlfilters
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

14. mergelinkdb
It is the alias for org. Apache. nutch. Crawl. linkdbmerger. It is used to merge multiple linkdb and can be used to filter specified content selectively.
It is useful when linkdb is created in a distributed manner from multiple segment groups and needs to be merged into one. Alternatively, you can specify a single linkdb to filter URLs.
It is also possible to use this tool only for filtering. In this case, only one linkdb is specified.
If a URL is included in multiple linkdb, all internal links are aggregated, but the number of Inner Links specified by DB. Max. inlinks is added. If it is activated, urlfilter can be applied to all target URLs and their internal chains. If the target link is disabled, all the internal links of the target link will be moved together with the target link. If some internal links are forbidden, they will only be removed. They will not be included when verifying the maximum number of restrictions mentioned above.
Usage:

$ Bin/nutch mergelinkdb output_linkdb linkdb1 [linkdb2 linkdb3...] [-filter]

Parameter description:
Output_linkdb: Output linkdb
Linkdb1 [linkdb2 linkdb3...]: more than one input linkdb (s)
-Filter: actual urlfilters to be applied on URLs and links in linkdb (s ).
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
15. mergesegs
It is the alias of "org. Apache. nutch. segment. segmentmerger". It is used to merge multiple segments and can be selectively output to one or more fixed-size segments.
Usage:
Shell code

$ Bin/nutch mergesegs output_dir (-Dir segments | seg1 seg2...) [-filter] [-slice NNNN]

Parameter description:
Output_dir: the name of the result segment or the parent directory of the segment part.
-Dir segments: parent directory, including multiple segments
Seg1 seg2...: Segment directory list
-Filter: Filter by urlfilters
-Slice NNNN: creates multiple output segments, each of which includes NNNN URLs.

Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
16. dedup
"Dedup" is the alias of "org. Apache. nutch. indexer. deleteduplicates". It removes duplicate pages from the segment indexes.
Usage:
Shell code

$ Bin/nutch dedup <indexes>...

Parameter description:
<Indexes>: indexes index file
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
17. plugin
It is the alias for "org. Apache. nutch. plugin. pluginrepository". It is used to load a plug-in from the plug-in library and execute its main method.
Usage:
Shell code

$ Bin/nutch plugin <pluginid> <classname> [ARGs...]

Parameter description:
<Pluginid>: ID of the plug-in to be executed
<Classname>: class name containing the main method
[ARGs]: Input plug-in Parameters
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed instructions on the commands of nutch

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detailed instructions on the commands of nutch

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support