Storage Structure of nutch-hbase

Source: Internet
Author: User

Webpage field description

Field value changes in various stages of webpage

Row COM.2345. Www: HTTP/Col F: fi//FetchintervalCol F: TS//Fetchtime
Id primary key, which is generated based on the webpage URL (Format: reversed domain name: Protocol: Port and Path). Therefore, nutch2 can only save the status of the current webpage, however, historical information cannot be saved. (if this is not the case, historical versions will be retained. Each row has a timestamp, and a certain number of historical versions will be retained. If it is too low, it may be deleted ).
Rowkey isReversed domain name: Protocol: Port and path example: COM.2345. www: HTTP/

Basically, the URL is used as the rowkey to remove the URL. Then, we can use "fetchtime" to check whether it has reached the generate date.



Storage Structure of nutch-hbase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.