Read the contents of the webpage table

Source: Internet
Author: User







Nutch The information crawled from the Web page into the HBase database, by default the table name is $crawlid_webpage, but the contents of the table are represented in 16, and the direct scan or read through the Java API can read only 16 of the binary information.


The specific usage is:


$ bin/nutch readdbusage:webtablereader (-stats |-url [url] |-dump <out_dir> [-regex regex])                      [-crawlid <id& gt;] [-content] [-headers] [-links] [-text]    -crawlid <id>  -The ID to prefix the schemas to operate on,                     (default:storage.crawl.id)    -stats [-sort]- Print overall statistics to System.out    [-sort]        -list status sorted by host    -url <url>     -Print Info Rmation on <url> to System.out    -dump <out_dir> [-regex regex]-dump the webtable to a text file in                     & lt;out_dir>    -content       -Dump also raw content    -headers       -Dump protocol    headers-links         -Dump links    -text          -Dump extracted text    [-regex]       -Filter on the URL of the webtable entry

Example:
(1) The contents of Seed.txt are as follows:
Http://www.163.com

(2) Execute the following command for inject operation
Bin/nutch Inject Seed.txt-crawlid test001

(3) Scan table contents, found meaningless







HBase (main):002:0> scan ' test001_webpage ' ROW Column+cell                        com.163.money:http/                                                                   Column=f:fi, timestamp=1423550107073, value=\x00 ' \x8d\x00 Com.163.money:http/column=f:ts, timestamp=1423550107073, value=\x00\x00\x01kr2\xc7\x D6 Com.163.money:http/column=mk:_injmrk_, timestamp=1423550107073, Value=y com.163.money:http                                                                            /Column=mk:dist, timestamp=1423550107073, value=0 Com.163.money:http/column=mtdt:_csh_, timestamp=1423550107073, value=?\             x80\x00\x00                                                 Com.163.money:http/column=s:s, timestamp=14235501 07073, value=?\x80\x00\x00 1 row (s) in 0.4090 seconds


(4) Read the contents of the table to/MNT/JEDIAEL/2
Bin/nutch readdb  -DUMP/MNT/JEDIAEL/2  

(5) See what's in/MNT/JEDIAEL/2
$ lltotal 4-rwxrwxrwx. 1 Jediael Jediael 344 14:41 part-r-00000-rwxrwxrwx. 1 Jediael jediael   0 Feb 14:41 _success

$ cat part-r-00000http://money.163.com/   key:    com.163.money:http/baseurl:        nullstatus:0 (NULL) fetchtime :      1423550105558prevFetchTime:  0fetchInterval:  2592000retriesSinceFetch:      0modifiedTime:   0prevModifiedTime:       0protocolStatus: (null) Parsestatus:    (NULL) title:  Nullscore:  1.0marker _ Injmrk_:       ymarker Dist:   0reprUrl:        nullmetadata _csh_:        ? 锟











Read the contents of the webpage table


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.