Hadoop-hbase Case Study-hadoop Learning notes < two >

Source: Internet
Author: User
Tags opentsdb

I was fortunate enough to take the MOOC college Hadoop experience class at the academy.
This is the little Elephant College hadoop2. X Overview Notes for chapter eighth
The main introduction is HBase, a distributed database application case.

Case Overview:

1) Time series database (OPENTSDB)
Use HBase to store time series data, every moment is resolved, the database is open source
2) hbase Crawler Scheduler Library
Vertical Search Crawler
Mass crawler (whole crawler)
This defines the URL crawler scheduler
3) HBase Document library
Storage of document databases, with emphasis on storage
4) Bank RMB Inquiry System

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

The application of HBase in practical problems:

When the data requires random read and write applications, or high concurrency (big Data multiple operations), or when the data structure is simple, but large (non-relational type need to apply a large number of join operations)
HBase is more difficult to do with relational queries such as joins
The key is to design the Rowkey to speed up the query
Common language has Java, Thrift reference Other language operation

In Rowkey design to avoid rowkey hot spots, to make full use of the Rowkey order characteristics, and can combine the requirements of the field into a Rowkey

Time Series Database

Opentsdb belongs to a distributed, scalable time series database
Capture data in seconds, support permanent storage and capacity planning, and store, index from different metrics
Normal MySQL capacity is not enough, dimension support is not enough
experience with the database (there should be omissions: )
1) More columns, more data, faster scan (scan on column faster than line scan)
2) to make each row of data relatively independent. To slice a line according to a certain rule (for example, 10 seconds is a row of data, timestamp)
3) to store more data in each of the keyvalue
4) do not store the synchronization in the server (such as Htable/htablepool, etc.), multi-use asynchbase nursing high concurrency database
5) Key as long as possible
6) do not store too much in a region?

Methods for storing time series

Each row holds a metric & time and values that can be stored in different dimensions
Put the metric ID in front of the time to do the combination of key, can scan the corresponding dimension faster, and can save storage space (metrics number, rather than directly with its name do metrics)

You can also widen the row so that the rows store more data (+0,+1,+2), but this will not save any space, it's just a change in the display.
But the line cannot be infinitely wider.
In addition, to prevent the network from interrupting the wrong line, it is recommended to break lines by time stamp instead of time + 1, + 2, +3
There is a corresponding PDF, search on the internet can be.

Summarize

Widening rows can increase scanning speed and combine the use of rowkey, but these do not save space
Only merge columns, shorten column family name to a certain extent, shorten the space

Vertical Crawler Scheduling Library


Multiple groups (group newsgroups, etc.) are also processed at the same time, and stored in the Dispatch library, HBase regular read can

Characteristics

Crawler software needs to crawl URLs based on real-time, priority, and other storage schedules.
And the crawler needs to maintain a list of URLs for different groups
is basically a queue feature, first the URL to be inserted is the priority crawl. However, there are also features that can be customized to the priority level. And because of the large difference in data volume (large picture), it is also reasonable to allocate resources.
such as vertical business scheduling, site crawl speed limit processing, there is time stamp scheduling processing.

Dispatch Library

Store host features and host URL lists for different channels.
Sort by HostID and priority in the URL
This is in line with the previous OPENTSDB characteristics, do not use the name of the Rowkey, but the ID (from the Host name table) sort
This allows you to have an interval of scan threads to execute URLs

Summarize:

To make full use of rowkey for orderly sorting
To incorporate Rowkey into useful fields Hostid+pid+urlid
Do not use strings directly as Rowkey, but instead encode (integers) to scan, saving space (because each column is stored Rowkey
And it's normalized after the integer.

Document Library

Document library is similar to the principle of dispatch library
Document libraries that store Web pages to analyze more refined data later

Characteristics:

Data formats are different, require real-time reads and writes (and updates), and there is a correlation between data storage (e.g. blog comments and body text are related)

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Technical features

Split base and Dynamic Data (two column family)
Basic will not change (page title AH content Ah creation time AH)
Dynamic data can be changed in real time (browse volume Ah, etc.)
This is no longer a server for different groups, but multiple servers should respond to multiple groups to meet different data refinement requirements for different groups
Association

Do not read in the blog Park will be seen, this blog post is Http://www.cnblogs.com/weibaar all

Only ensure that the layout of the blog Park blog is clean and the code blocks and pictures correctly displayed, he stood please keep the author information respecting copyright AH

Characteristics of Bank RMB Inquiry system:

Large scale, and equipment dispersion (such as ATM AH cash counter AH, etc.), acquisition system requirements to be timely and can not be omitted
can be in accordance with the size of the yuan, do hash value or reversal (because the crown may be continuous, some of the number of banknotes will be stored together, can not effectively slice the data storage, sometimes resulting in access to hot spots, so need to change the crown size to do Rowkey)
Requirements
Timely and reliable, fast retrieval and storage, and better scalability

Because it involves multi-device acquisition input, you can solve the problem with Flume+hbase
The reason to choose HBase is that the application is very simple, just a simple query, with HBase is enough
can refer to the Cloudera Open Source log collection system

Summarize

HBase often needs to be used in conjunction with other systems
To avoid creating access hotspots (especially to avoid direct adoption of time as Rowkey), break the serial number

Hadoop-hbase Case Study-hadoop Learning notes < two >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.