NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page

Source: Internet
Author: User

First of all, why crawl is the integration of inject,generate,fetch,parse,update (the specific meaning and function of the command will be described in subsequent articles), we open the nutch_home/runtime/local/bin/ Crawl

I paste the main code down

# initial injectionecho  "Injecting seed urls" __bin_nutch inject  "$SEEDDIR"  -crawlId  "$CRAWL _id" # main loop : rounds of generate -  fetch - parse - updatefor  ((a=1; a <= limit ; a++)) do ... echo  "Generating a new fetchlist"   generate_args= ($commonOptions  -topn   $sizeFetchlist  -noNorm -noFilter -adddays  $addDays  -crawlId  "$CRAWL _id"  -batchId  $batchId) $bin/nutch generate  "${generate_args[@]}" ...echo  "FETCHING&NBSP;:   "  __bin_nutch fetch  $commonOptions  -d fetcher.timelimit.mins=$ timelimitfetch  $batchId  -crawlId  "$CRAWL _id"  -threads 50...__bin_nutch parse   $commonOptions   $skipRecordsOptions   $batchId  -crawlId  "$CRAWL _id" ...  __bin_nutch  updatedb  $commonOptions  $batchId  -crawlId  "$CRAWL _id" ...echo  "indexing  $CRAWL _id on solr index  ->  $SOLRURL "__bin_nutch index  $commonOptions  -d solr.server.url= $SOLRURL  -all -crawlId  "$CRAWL _id" ...echo  "solr dedup ->  $SOLRURL" __bin_nutch  solrdedup  $commonOptions   $SOLRURL

Then manually perform the steps above

We'll always be in the runtime/local/directory.

1,inject:

Of course, the seed file to be written first, Urls/url file to write to the site you want to crawl, I take http://www.6vhao.com as an example

During the crawl, I don't want him to crawl other sites except 6vhao.com, which can be added to the Conf/regex-urlfilter.txt file.


# Accept anything else

+^http://www.6vhao.com/

Use the following command to start the crawl:

./bin/nutch Inject Urls/url-crawlid 6vhao

Using the List command in the HBase Shell to view, a new table was generated 6vhao_webpage

Scan '6vhao_webpage' to view its contents

row                                  column+cell                                                                                                   com.6vhao.www:http/                 column=f:fi, timestamp=1446135434505, value=\x00 ' \x8D\x00                                                  com.6vhao.www:http/                 column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                    com.6vhao.www:http/                 column=mk:_ injmrk_, timestamp=1446135434505, value=y                                                       com.6vhao.www:http/                 column=mk:dist, timestamp=1446135434505, value=0                                                           com.6vhao.www:http/                 column=mtdt:_ csh_, timestamp=1446135434505, value=?\x80\x00\x00                                           com.6vhao.www:http/                 column=s:s, timestamp= 1446135434505, value=?\x80\x00\x00


You can see that a row of hbase data is generated, 4 column family data, the specific meaning later

2,generator

Use the command./bin/nutch generate

    -topn <n>      - number of top  URLs to be selected, default is Long.MAX_VALUE      -crawlid <id>  - the id to prefix the schemas to  operate on,                       (default: storage.crawl.id) ")     -nofilter       - do not activate the filter plugin to  filter the url, default is true     -noNorm         - do not activate the normalizer  plugin to normalize the url, default is true     - Adddays   &nbsP;   - adds numdays to the current time to facilitate  crawling urls already                      fetched sooner then  db.fetch.interval.default. default value is 0.    -batchid        - the batch id

We specify-crawlid as 6vhao

./bin/nutch generate-crawlid 6vhao

 com.6vhao.www:http/                 column=f:bid, timestamp=1446135900858, value=1446135898-215760616                                          com.6vhao.www:http/                 column=f:fi, timestamp=1446135434505, value=\x00 ' \x8d\x00                                                  com.6vhao.www:http/                 column=f:ts, timestamp=1446135434505, value=\x00\x00\x01p\xb4c\x86\xaa                                    com.6vhao.www:http/                 column=mk:_gnmrk_,  timestamp=1446135900858, value=1446135898-215760616                                    com.6vhao.www:http/                 column=mk:_injmrk_, timestamp=1446135900858, value=y                                                       com.6vhao.www:http/                 column=mk:dist, timestamp= 1446135900858, value=0                                                           com.6vhao.www:http/                 column=mtdt:_csh_, timestamp=1446135434505, value=?\ x80\x00\x00                                           com.6vhao.www:http/                 column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

2 more columns of data found in comparison

3, start fetch fetch

usage: fetcherjob  (<batchid> | -all)  [-crawlid <id>] [-threads  N]                    [-resume] [-numTasks N]    <batchId>      - crawl identifier returned by generator, or -all for all                       generated batchid-s    -crawlid <id> - the  id to prefix the schemas to operate on,                       (Default:  storage.crawl.id)     -threads n    - number of  fetching threads per task    -resume       - resume  interrupted job    -numtasks n   - if n >  0 then use this many reduce tasks for fetching                       (default :  mapred.map.tasks)

./bin/nutch Fetch-all-crawlid 6vhao-threads 8

More data, the contents of the basic page are all in, self-view in HBase

4,parse

usage: parserjob  (<batchid> | -all)  [-crawlid <id>] [-resume]  [-force]    <batchid>     - symbolic batch  id created by generator    -crawlid <id> - the  id to prefix the schemas to operate on,                       (Default:  storage.crawl.id)     -all           - consider pages from all crawl jobs    -resume        - resume a previous incomplete job     -force        - force re-parsing even  If a page is&nBsp;already parsed[email protected]:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse  -crawlid 6vhao -all

./bin/nutch Parse-crawlid 6vhao-all

Pase results can be viewed in hbase

5,update

Usage:dbupdaterjob (<batchId> |-all) [-crawlid <id>] <batchId>-crawl identifier returned by Generator, Or-all for all generated batchid-s-crawlid <id>-The ID to prefix the schemas T O operate on, (Default:storage.crawl.id)

./bin/nutch Updatedb-all-crawlid 6vhao

Results can be viewed in hbase

6, repeat 2-5 steps, that is, grab the depth of the site 2 layers

Solrindex the next section to talk about ....




NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.