Nutch The information crawled from the Web page into the HBase database, by default the table name is $crawlid_webpage, but the contents of the table are represented in 16, and the direct scan or read through the Java API can read only 16 of the binary information.
The specific usage is:
$ bin/nutch readdbusage:webtablereader (-stats |-url [url] |-dump <out_dir> [-regex regex]) [-crawlid <id& gt;] [-content] [-headers] [-links] [-text] -crawlid <id> -The ID to prefix the schemas to operate on, (default:storage.crawl.id) -stats [-sort]- Print overall statistics to System.out [-sort] -list status sorted by host -url <url> -Print Info Rmation on <url> to System.out -dump <out_dir> [-regex regex]-dump the webtable to a text file in & lt;out_dir> -content -Dump also raw content -headers -Dump protocol headers-links -Dump links -text -Dump extracted text [-regex] -Filter on the URL of the webtable entry
Example:
(1) The contents of Seed.txt are as follows:
Http://www.163.com
(2) Execute the following command for inject operation
Bin/nutch Inject Seed.txt-crawlid test001
(3) Scan table contents, found meaningless
HBase (main):002:0> scan ' test001_webpage ' ROW Column+cell com.163.money:http/ Column=f:fi, timestamp=1423550107073, value=\x00 ' \x8d\x00 Com.163.money:http/column=f:ts, timestamp=1423550107073, value=\x00\x00\x01kr2\xc7\x D6 Com.163.money:http/column=mk:_injmrk_, timestamp=1423550107073, Value=y com.163.money:http /Column=mk:dist, timestamp=1423550107073, value=0 Com.163.money:http/column=mtdt:_csh_, timestamp=1423550107073, value=?\ x80\x00\x00 Com.163.money:http/column=s:s, timestamp=14235501 07073, value=?\x80\x00\x00 1 row (s) in 0.4090 seconds
(4) Read the contents of the table to/MNT/JEDIAEL/2
Bin/nutch readdb -DUMP/MNT/JEDIAEL/2
(5) See what's in/MNT/JEDIAEL/2
$ lltotal 4-rwxrwxrwx. 1 Jediael Jediael 344 14:41 part-r-00000-rwxrwxrwx. 1 Jediael jediael 0 Feb 14:41 _success
$ cat part-r-00000http://money.163.com/ key: com.163.money:http/baseurl: nullstatus:0 (NULL) fetchtime : 1423550105558prevFetchTime: 0fetchInterval: 2592000retriesSinceFetch: 0modifiedTime: 0prevModifiedTime: 0protocolStatus: (null) Parsestatus: (NULL) title: Nullscore: 1.0marker _ Injmrk_: ymarker Dist: 0reprUrl: nullmetadata _csh_: ? 锟
Read the contents of the webpage table