First of all, why crawl is the integration of inject,generate,fetch,parse,update (the specific meaning and function of the command will be described in subsequent articles), we open the nutch_home/runtime/local/bin/ Crawl
I paste the main code down
# initial injectionecho "Injecting seed urls" __bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL _id" # main loop : rounds of generate - fetch - parse - updatefor ((a=1; a <= limit ; a++)) do ... echo "Generating a new fetchlist" generate_args= ($commonOptions -topn $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL _id" -batchId $batchId) $bin/nutch generate "${generate_args[@]}" ...echo "FETCHING&NBSP;: " __bin_nutch fetch $commonOptions -d fetcher.timelimit.mins=$ timelimitfetch $batchId -crawlId "$CRAWL _id" -threads 50...__bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL _id" ... __bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL _id" ...echo "indexing $CRAWL _id on solr index -> $SOLRURL "__bin_nutch index $commonOptions -d solr.server.url= $SOLRURL -all -crawlId "$CRAWL _id" ...echo "solr dedup -> $SOLRURL" __bin_nutch solrdedup $commonOptions $SOLRURL
Then manually perform the steps above
We'll always be in the runtime/local/directory.
1,inject:
Of course, the seed file to be written first, Urls/url file to write to the site you want to crawl, I take http://www.6vhao.com as an example
During the crawl, I don't want him to crawl other sites except 6vhao.com, which can be added to the Conf/regex-urlfilter.txt file.
# Accept anything else
+^http://www.6vhao.com/
Use the following command to start the crawl:
./bin/nutch Inject Urls/url-crawlid 6vhao
Using the List command in the HBase Shell to view, a new table was generated 6vhao_webpage
Scan '6vhao_webpage' to view its contents
row column+cell com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00 ' \x8D\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA com.6vhao.www:http/ column=mk:_ injmrk_, timestamp=1446135434505, value=y com.6vhao.www:http/ column=mk:dist, timestamp=1446135434505, value=0 com.6vhao.www:http/ column=mtdt:_ csh_, timestamp=1446135434505, value=?\x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp= 1446135434505, value=?\x80\x00\x00
You can see that a row of hbase data is generated, 4 column family data, the specific meaning later
2,generator
Use the command./bin/nutch generate
-topn <n> - number of top URLs to be selected, default is Long.MAX_VALUE -crawlid <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) ") -nofilter - do not activate the filter plugin to filter the url, default is true -noNorm - do not activate the normalizer plugin to normalize the url, default is true - Adddays &nbsP; - adds numdays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. default value is 0. -batchid - the batch id
We specify-crawlid as 6vhao
./bin/nutch generate-crawlid 6vhao
com.6vhao.www:http/ column=f:bid, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00 ' \x8d\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01p\xb4c\x86\xaa com.6vhao.www:http/ column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135900858, value=y com.6vhao.www:http/ column=mk:dist, timestamp= 1446135900858, value=0 com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\ x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
2 more columns of data found in comparison
3, start fetch fetch
usage: fetcherjob (<batchid> | -all) [-crawlid <id>] [-threads N] [-resume] [-numTasks N] <batchId> - crawl identifier returned by generator, or -all for all generated batchid-s -crawlid <id> - the id to prefix the schemas to operate on, (Default: storage.crawl.id) -threads n - number of fetching threads per task -resume - resume interrupted job -numtasks n - if n > 0 then use this many reduce tasks for fetching (default : mapred.map.tasks)
./bin/nutch Fetch-all-crawlid 6vhao-threads 8
More data, the contents of the basic page are all in, self-view in HBase
4,parse
usage: parserjob (<batchid> | -all) [-crawlid <id>] [-resume] [-force] <batchid> - symbolic batch id created by generator -crawlid <id> - the id to prefix the schemas to operate on, (Default: storage.crawl.id) -all - consider pages from all crawl jobs -resume - resume a previous incomplete job -force - force re-parsing even If a page is&nBsp;already parsed[email protected]:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlid 6vhao -all
./bin/nutch Parse-crawlid 6vhao-all
Pase results can be viewed in hbase
5,update
Usage:dbupdaterjob (<batchId> |-all) [-crawlid <id>] <batchId>-crawl identifier returned by Generator, Or-all for all generated batchid-s-crawlid <id>-The ID to prefix the schemas T O operate on, (Default:storage.crawl.id)
./bin/nutch Updatedb-all-crawlid 6vhao
Results can be viewed in hbase
6, repeat 2-5 steps, that is, grab the depth of the site 2 layers
Solrindex the next section to talk about ....
NUTCH2 Crawl command decomposition, crawl the detailed process of the Web page