NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page

Last Update:2015-10-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, why crawl is the integration of inject,generate,fetch,parse,update (the specific meaning and function of the command will be described in subsequent articles), we open the nutch_home/runtime/local/bin/ Crawl

I paste the main code down

# initial injectionecho  "Injecting seed urls" __bin_nutch inject  "$SEEDDIR"  -crawlId  "$CRAWL _id" # main loop : rounds of generate -  fetch - parse - updatefor  ((a=1; a <= limit ; a++)) do ... echo  "Generating a new fetchlist"   generate_args= ($commonOptions  -topn   $sizeFetchlist  -noNorm -noFilter -adddays  $addDays  -crawlId  "$CRAWL _id"  -batchId  $batchId) $bin/nutch generate  "${generate_args[@]}" ...echo  "FETCHING&NBSP;:   "  __bin_nutch fetch  $commonOptions  -d fetcher.timelimit.mins=$ timelimitfetch  $batchId  -crawlId  "$CRAWL _id"  -threads 50...__bin_nutch parse   $commonOptions   $skipRecordsOptions   $batchId  -crawlId  "$CRAWL _id" ...  __bin_nutch  updatedb  $commonOptions  $batchId  -crawlId  "$CRAWL _id" ...echo  "indexing  $CRAWL _id on solr index  ->  $SOLRURL "__bin_nutch index  $commonOptions  -d solr.server.url= $SOLRURL  -all -crawlId  "$CRAWL _id" ...echo  "solr dedup ->  $SOLRURL" __bin_nutch  solrdedup  $commonOptions   $SOLRURL

Then manually perform the steps above

We'll always be in the runtime/local/directory.

1,inject:

Of course, the seed file to be written first, Urls/url file to write to the site you want to crawl, I take http://www.6vhao.com as an example

During the crawl, I don't want him to crawl other sites except 6vhao.com, which can be added to the Conf/regex-urlfilter.txt file.

# Accept anything else

+^http://www.6vhao.com/

Use the following command to start the crawl:

./bin/nutch Inject Urls/url-crawlid 6vhao

Using the List command in the HBase Shell to view, a new table was generated 6vhao_webpage

Scan '6vhao_webpage' to view its contents

row                                  column+cell                                                                                                   com.6vhao.www:http/                 column=f:fi, timestamp=1446135434505, value=\x00 ' \x8D\x00                                                  com.6vhao.www:http/                 column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                    com.6vhao.www:http/                 column=mk:_ injmrk_, timestamp=1446135434505, value=y                                                       com.6vhao.www:http/                 column=mk:dist, timestamp=1446135434505, value=0                                                           com.6vhao.www:http/                 column=mtdt:_ csh_, timestamp=1446135434505, value=?\x80\x00\x00                                           com.6vhao.www:http/                 column=s:s, timestamp= 1446135434505, value=?\x80\x00\x00

You can see that a row of hbase data is generated, 4 column family data, the specific meaning later

2,generator

Use the command./bin/nutch generate

    -topn <n>      - number of top  URLs to be selected, default is Long.MAX_VALUE      -crawlid <id>  - the id to prefix the schemas to  operate on,                       (default: storage.crawl.id) ")     -nofilter       - do not activate the filter plugin to  filter the url, default is true     -noNorm         - do not activate the normalizer  plugin to normalize the url, default is true     - Adddays   &nbsP;   - adds numdays to the current time to facilitate  crawling urls already                      fetched sooner then  db.fetch.interval.default. default value is 0.    -batchid        - the batch id

We specify-crawlid as 6vhao

./bin/nutch generate-crawlid 6vhao

 com.6vhao.www:http/                 column=f:bid, timestamp=1446135900858, value=1446135898-215760616                                          com.6vhao.www:http/                 column=f:fi, timestamp=1446135434505, value=\x00 ' \x8d\x00                                                  com.6vhao.www:http/                 column=f:ts, timestamp=1446135434505, value=\x00\x00\x01p\xb4c\x86\xaa                                    com.6vhao.www:http/                 column=mk:_gnmrk_,  timestamp=1446135900858, value=1446135898-215760616                                    com.6vhao.www:http/                 column=mk:_injmrk_, timestamp=1446135900858, value=y                                                       com.6vhao.www:http/                 column=mk:dist, timestamp= 1446135900858, value=0                                                           com.6vhao.www:http/                 column=mtdt:_csh_, timestamp=1446135434505, value=?\ x80\x00\x00                                           com.6vhao.www:http/                 column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

2 more columns of data found in comparison

3, start fetch fetch

usage: fetcherjob  (<batchid> | -all)  [-crawlid <id>] [-threads  N]                    [-resume] [-numTasks N]    <batchId>      - crawl identifier returned by generator, or -all for all                       generated batchid-s    -crawlid <id> - the  id to prefix the schemas to operate on,                       (Default:  storage.crawl.id)     -threads n    - number of  fetching threads per task    -resume       - resume  interrupted job    -numtasks n   - if n >  0 then use this many reduce tasks for fetching                       (default :  mapred.map.tasks)

./bin/nutch Fetch-all-crawlid 6vhao-threads 8

More data, the contents of the basic page are all in, self-view in HBase

4,parse

usage: parserjob  (<batchid> | -all)  [-crawlid <id>] [-resume]  [-force]    <batchid>     - symbolic batch  id created by generator    -crawlid <id> - the  id to prefix the schemas to operate on,                       (Default:  storage.crawl.id)     -all           - consider pages from all crawl jobs    -resume        - resume a previous incomplete job     -force        - force re-parsing even  If a page is&nBsp;already parsed[email protected]:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse  -crawlid 6vhao -all

./bin/nutch Parse-crawlid 6vhao-all

Pase results can be viewed in hbase

5,update

Usage:dbupdaterjob (<batchId> |-all) [-crawlid <id>] <batchId>-crawl identifier returned by Generator, Or-all for all generated batchid-s-crawlid <id>-The ID to prefix the schemas T O operate on, (Default:storage.crawl.id)

./bin/nutch Updatedb-all-crawlid 6vhao

Results can be viewed in hbase

6, repeat 2-5 steps, that is, grab the depth of the site 2 layers

Solrindex the next section to talk about ....

NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support