Detailed description of common nutch commands

Source: Internet
Author: User

Nutch uses a command to work. Its command can be a single LAN command or a step-by-step command to crawl the entire web. The main Commands are as follows:
 
1.Crawl
Crawl is an alias for org. Apache. nutch. Crawl. Crawl. It is a complete crawling and indexing process command.
Usage:
Shell code
$ Bin/nutchCrawl<Urldir>[-DirD][-ThreadsN][-DepthI][-Topn]
 
Parameter description:
<Urldir>: a text file that includes the URL list. It is an existing folder.
[-Dir <D>]: Specifies the working directory where the climbing record is saved in the nutch file. The default value is./Crawl-[date], where [date] indicates the current period.
[-Threads <n>]: Number of fetcher threads, overwrite the value of Fetcher. threads. Fetch in the default configuration file (10 by default ).
[-Depth <I>]: depth of the crawling iteration of the nutch. The default value is 5.
[-Topn <num>]: Limit the first N records in each iteration. The default value isInteger. max_value.
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Crawl-tool.xml
 
Other files:
Crawl-urlfilter.txt
 
2. readdb
The readdb command is an alias for org. Apache. nutch. Crawl. crawldbreader. It returns or exports the information in the crawl database (crawldb.
Usage:
Shell code
 
$ Bin/nutchreaddb<Crawldb>(-Stats|-Dump<Out_dir>|-URL <URL>)
 
Parameter description:
<Crawldb>: crawldb directory.
[-Stats]: print all statistics on the console
[-Dump <out_dir>]: export the crawldb information to a file in the specified folder.
[-URL <URL>]: print the statistics of the specified URL
Instance:
Shell code
 
$ Bin/nutchReaddbFullindex/crawldb-Stats
 
CrawldbstatisticsStart:Fullindex/crawldb
StatisticsforCrawldb:Fullindex/crawldb
Totalurls:468030
Retry0: 467361
Retry1: 622
Retry2: 32
Retry3:15
Minscore:0.0
Avgscore:0.0034686408
Maxscore:61.401
Status1 (db_unfetched ):312748
Status2 (db_fetched ):80671
Status3 (db_gone ):69927
Status4 (db_redir_temp ):1497
Status5 (db_redir_perm ):3187
Crawldbstatistics:Done
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
Note:
-The stats command is a useful job to quickly view crawling information. Its output information indicates:
Db_unfetched: Number of pages that have been crawled but not yet crawled (the reason is that they are not filtered by the url filter, or they are discarded by nutch outside of topn)
Db_gone: indicates that Error 404 or some other hypothetical errors have occurred. This status prevents subsequent crawling.
Db_fetched indicates the page that has been crawled and indexed. If the value is 0, an error is returned.
 
3. readlinkdb
It is an alias for "org. Apache. nutch. Crawl. linkdbreader". It exports information from the link library or returns a URL.
Usage:
Shell code
 
$ Bin/nutchreadlinkdb<Linkdb>(-Dump<Out_dir>|-URL<URL>)
 
Parameter description:
<Linkdb>: linkdb working directory
[-Dump <out_dir>]: export information to a folder.
[-URL <URL>]: prints the statistics of a URL.
Instance:
Shell code
 
$Bin/nutchReadlinkdbFullindex/linkdb-URLWww.ccnu.edu.cn -NoLinkInformation
 
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
4. Inject
It is an alias for "org. Apache. nutch. Crawl. injector" and injects a new URL into the crawler.
Usage:
Shell code
 
$ Bin/nutchinjector<Crawldb><Urldir>
 
Parameter description:
<Crawldb>: crawldb folder
<Urldir>: folder directory for storing files with URLs
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
The following configuration file parameters affect the injection method:
DB. Default. Fetch. interval -- sets the crawling interval by day. The default value is 30366f.
DB. Score. Injected -- set the default score for the URL. The default value is 1.0f.
Urlnormalizer. Class -- normalization URL class. The default value is org.apache.nutch.net. basicurlnormalizer.
 
5. Generate
It is "org. Apache. nutch. Crawl. Generator" and captures the new segment from crawldb.
Usage:
Shell code
 
$ Bin/nutchgenerator<Crawldb><Segments_dir>[-Topn<Num>] [-numfetchers<Fetchers>][-Adddays<Days>]
 
Parameter description:
<Crawldb>: crawldb directory
<Segments_dir>: the newly created climbet Segment directory.
[-Topn <num>]: select the first number of links. The default value is long. max_value.
[-Numfetchers <fetchers>]: Number of captured partitions.Default:ConfigurationKey->Mapred. Map. Tasks->1
[-Adddays <days>]:Add<Days> by the current time, configure crawlingURLs to be quickly crawled from DB. Default. Fetch. interval. The default value is 0. The crawling end time before the current time.
Example:
Shell code
 
$ Bin/nutchgenerate/My/crawldb/My/segments-Topn100-Adddays20
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Generate. Max. Per. Host & ndash;Sets the maximum number of URLs for a single host. The default value is unlimited.
 
6.Fetch
It is the alias for "org. Apache. nutch. Fetcher. fetcher" and is responsible for crawling a segment.
Usage:
Shell code
 
$ Bin/nutchfetcher<Segment>[-Threads<N>][-Noparsing]
 
Parameter description:
<Segment>: Segment directory
[-Threads <n>]: The default number of running fetcher threads isConfigurationKey->Fetcher. threads. Fetch->10
[-Noparsing]: Disable auto-resolution of segment data
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Fetcher depends on multiple plug-ins to crawl different protocols. The existing protocols and support plug-ins are as follows:
HTTP:
Protocol-HTTP
Protocol-httpclient
Https:
Protocol-httpclient
FTP:
Protocol-FTP
File:
Protocol-File
When crawling online documents, you should not use protocol-file because it is used to crawl local files. If you want to crawl HTTP and https, you should use protocol-httpclient.
7.Parse
It is the name of org. Apache. nutch. parse. parsesegment. It runs parsesegment on a segment.
Usage:
Shell code
 
$ Bin/nutchparse<Segment>
 
Parameter description:
<Segment>: Segment folder
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
It relies on multiple plug-ins to parse content in different formats. supported formats and plug-ins include:
Content formatPlug-in remarks
Text/htmlparse-htmlUse nekohtmlOr use tagsoup to parse HTML
Application/X-javascriptparse-jsParse JavascriptDocument (. JS ).
Audio/mpegparse-mp3Parse MP3Audio文((objects ).
Application/vnd. MS-excelparse-MSExcelParse the MSExcel document(.Xls ).
Application/vnd. MS-powerpointparse-mspowerpointParse mspower! PointDocument
Application/mswordparse-MSWordParse MSWord documents
Application/RSS + xmlparse-RSSParse RSS document (. RSS)
Application/rtfparse-RTFParse the RTF document (. rtf)
Application/upload parse-PDFPDF document Parsing
Application/X-Shockwave-flashparse-SwFParse flashDocument(.Swf)
Text-plainparse-TextTextdocument (.txt)
Application/zipparse-zipUnzip zipdocument (.zip)
OthertypesParse-extParse a document using an external command based on Content-Type or path prefix
By default, only the TXT, HTML, JS format plug-in is available, other needs to be configured in the nutch-site.xml to use.
8. Segread
"Segread" is the alias for "org. Apache. nutch. segment. segmentreader". It reads and exports segment data.
Usage:
Shell code
 
$ Bin/nutchsegread<Segment>
 
Parameter description:
<Segment>: Segment folder
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
Note:
Changed to readseg in Versions later than nutch0.9
9. Updatedb
It is the alias of "org. Apache. nutch. Crawl. crawldb", and crawldb is updated with the information obtained during the fetch process.
Usage:
Shell code
 
$ Bin/nutchupdatedb<Crawldb><Segment>[-Noadditions]
 
Parameter description:
<Crawldb>: crawldb directory
<Segment>: The crawled Segment directory.
[-Noadditions]: whether to add a new link to crawldb
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
10. invertlinks
It is the name of org. Apache. nutch. Crawl. linkdb. It updates linkdb with the information obtained from the segment.
Usage:
Shell code
 
$ Bin/nutchinvertlinks<Linkdb>(-DirSegmentsdir|Segment1Segment2...)
 
Parameter description:
<Linkdb>: linkdb directory
<Segment>: Segment directory. You can specify at least one folder.
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
11. Index
It is called "org. Apache. nutch. indexer. indexer". It creates a segment index and uses the data in crawldb and linkdb to rate the pages in the index.
Usage:
Shell code
 
$ Bin/nutchindex<Index><Crawldb><Linkdb><Segment>...
 
Parameter description:
<Index>: storage directory after the index is created
<Crawldb>: crawldb directory
<Linkdb>: linkdb directory
<Segment>: Segment directory. You can specify multiple segments.
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
12. Merge
Merge is the alias for "org. Apache. nutch. indexer. indexmerger", which combines multiple segment indexes.
Usage:
 
$ Bin/nutchmerge[-Workingdir<Workingdir>]<Outputindex><Indexesdir>...
 
Parameter description:
[-Workingdir <workingdir>]: Specifies the working directory.
<Outputindex>: merged index storage directory
<Indexesdir>: contains the index directories to be merged. You can specify multiple
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
13. mergedb
It is the alias of "org. Apache. nutch. Crawl. crawldbmerger". Combining Multiple crawldbs, urlfilter can selectively filter specified content.
You can merge multiple databases into one. It is useful when you run crawlers separately and want to merge the databases. Optional. You can run the current urlfilter to filter URLs in the database to filter unnecessary URLs. It is also useful when there is only one DB, which means that you can filter out the URLs you don't want in those DB through this job.
It is also possible to use this tool only for filtering. In this case, only one crawldb is specified.
If the same URL is included in multiple crawldbs, only the latest versions are retained, which is determined by the org. Apache. nutch. Crawl. crawldatum. getfetchtime () value. However, the metadata of all versions is aggregated, and the new value replaces the previous value.
Usage:
 
$ Bin/nutchmergeOutput_crawldbCrawldb1[Crawldb2Crawldb3...][-Filter]
 
Parameter description:
Output_crawldb: Output Folder of crawldb
Crawldb1 [crawldb2Crawldb3...]: One or more crawldb (s ).
-Filter: used urlfilters
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
 
14. mergelinkdb
It is the alias for org. Apache. nutch. Crawl. linkdbmerger. It is used to merge multiple linkdb and can be used to filter specified content selectively.
It is useful when linkdb is created in a distributed manner from multiple segment groups and needs to be merged into one. Alternatively, you can specify a single linkdb to filter URLs.
It is also possible to use this tool only for filtering. In this case, only one linkdb is specified.
If a URL is included in multiple linkdb, all internal links are aggregated, but the number of Inner Links specified by DB. Max. inlinks is added. If it is activated, urlfilter can be applied to all target URLs and their internal chains. If the target link is disabled, all the internal links of the target link will be moved together with the target link. If some internal links are forbidden, they will only be removed. They will not be included when verifying the maximum number of restrictions mentioned above.
Usage:
 
$ Bin/nutchmergelinkdbOutput_linkdbLinkdb1[Linkdb2Linkdb3...][-Filter]
 
Parameter description:
Output_linkdb: Output linkdb
Linkdb1 [linkdb2Linkdb3...]:More than one input linkdb (s)
-Filter: actualUrlfiltersToBeAppliedOnURLsAndLinksInLinkdb (s ).
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
15. mergesegs
It is the alias of "org. Apache. nutch. segment. segmentmerger". It is used to merge multiple segments and can be selectively output to one or more fixed-size segments.
Usage:
Shell code
 
$ Bin/nutchmergesegsOutput_dir(-DirSegments|Seg1Seg2...)[-Filter][-SliceNNNN]
 
Parameter description:
Output_dir: the name of the result segment or the parent directory of the segment part.
-Dirsegments: parent directory, including multiple segments
Seg1seg2...: Segment directory list
-Filter: Filter by urlfilters
-Slicennnn:Create multiple output segments, each of which includes NNNN URLs.
 
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
16. dedup
"Dedup" is the alias of "org. Apache. nutch. indexer. deleteduplicates", which removes duplicate pages from segmentindexes.
Usage:
Shell code
 
$ Bin/nutchdedup<Indexes>...
 
Parameter description:
<Indexes>: indexes index file
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml
17. plugin
It is the alias for "org. Apache. nutch. plugin. pluginrepository". It is used to load a plug-in from the plug-in library and execute its main method.
Usage:
Shell code
 
$ Bin/nutchplugin<Pluginid><Classname>[ARGs...]
 
Parameter description:
<Pluginid>: ID of the plug-in to be executed
<Classname>: class name containing the main method
[ARGs]: Input plug-in Parameters
Configuration file:
Hadoop-default.xml
Hadoop-site.xml
Nutch-default.xml
Nutch-site.xml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.