Rowkey design of HBase for big data performance tuning

Last Update:2015-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Overview

HBase is a distributed, column-oriented database, and the biggest difference from a general relational database is that HBase is well suited for storing unstructured data, and it's column-based rather than row-based patterns.

Since HBase is a keyvalue column store, Rowkey is the KeyValue key, which represents the only row. Rowkey is also a binary stream with a maximum length of 64KB, and the content can be customized by the user who uses it. When data is loaded, it is generally based on the Rowkey binary sequence from small to large.

HBase is retrieved based on Rowkey, which locates the region of a Rowkey (or a Rowkey range) and then routes the request for querying data to that area for data. HBase retrieval is supported in 3 different ways:

(1) Access by a single rowkey, that is, a get operation according to a Rowkey key value, so that the only record is obtained;

(2) scan via Rowkey range, that is, by setting Startrowkey and Endrowkey, within this range. This allows a batch of records to be obtained according to the specified conditions;

(3) Full table scan, that is, directly scan all row records in the entire table.

HBase is very efficient to retrieve by a single rowkey and takes less than 1 milliseconds to get 1000~2000 records per second, but queries for non-key columns are slow.

Rowkey Design of 2 hbase

2.1 Design Principles 2.1.1 Rowkey length principle

Rowkey is a binary stream, the length of Rowkey is suggested by many developers to design in 10~100 bytes, but the suggestion is that the shorter the better, not more than 16 bytes.

The reasons are as follows:

(1) Data persistence file hfile is stored according to KeyValue, if Rowkey is too long, such as 100 bytes, 10 million columns of data light rowkey will occupy 100*1000 = 1 billion bytes, will be nearly 1G of data, which will greatly affect the hfile storage efficiency ；

(2) Memstore will cache part of the data to memory, if the Rowkey field is too long the effective utilization of memory will be reduced, the system will not be able to cache more data, which will reduce retrieval efficiency. So the shorter the byte length of the Rowkey, the better.

(3) The current operating system is all 64-bit systems, memory 8-byte alignment. Control the best features of the operating system in 16-byte, 8-byte integer times.

If the Rowkey is incremented by timestamp, do not put the time in front of the binary code, it is recommended to rowkey the high-level as a hash field, generated by the program loop, low-drop Time field, which will improve the data equalization distribution in each regionserver to achieve load balancing probability. If there is no hash field, the first field is directly the time information will generate all new data on a regionserver accumulation of hot phenomenon, so that when doing data retrieval load will be concentrated in the individual regionserver, reduce query efficiency.

Uniqueness must be ensured in design.

2.2 Application Scenarios

Based on the above 3 principles of rowkey, there are different Rowkey design recommendations for different application scenarios.

Transactional data is a time attribute, and it is recommended that time information be stored in Rowkey, which helps prompt query retrieval speed. For transactional data recommendations, the default is to create a table for data by day, so the benefits of this design are manifold. According to the talent table, the time information can be removed from the date section only to keep the hour minute milliseconds, so that 4 bytes can be done. Add a hash field of 2 bytes altogether 6 bytes to form a unique Rowkey. As shown in the following:

transactional data rowkey design
No. 0 byte	1th byte	2nd byte	3rd byte	4th byte	5th byte
		Time field (msec)				extended field

Such a design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Maybe someone wants to ask why the time field is not saved in the host byte order, so it can also be used as a hash field. This is because the time range of data or as far as possible to ensure continuous, the same time range of data lookup probability is very large, for query retrieval has a good effect, so the use of separate hash field effect is better, for some applications, we can consider using hash field all or part to store some data field information, As long as the same hash value is guaranteed to be unique at the same time (milliseconds).

Statistics are also with time attributes, the smallest unit of statistics will only be minutes (to the seconds of pre-statistics is meaningless). At the same time, we also use the data table by default for statistical data, so the benefits of the design need not be said. When you press the talent table, the time information only needs to be reserved for the hour minutes, then the 0~1400 takes only two bytes to save the time information. Because some dimensions of the statistics are very large, 4 bytes are required as sequence fields, so the hash field is used as a sequence field as well as 6 bytes to form a unique rowkey. As shown in the following:

statistics Rowkey design
No. 0 byte	1th byte	2nd byte	3rd byte	4th byte	5th byte
				Time field (minutes)
			&NBSP;

Similarly, this design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Pre-statistics may involve repeated recalculation requirements, to ensure that obsolete data can be effectively deleted, and not affect the balance of the hash effect, so special treatment.

The universal data adopts the self-increment sequence as the unique primary key, and the user can choose the single table mode by the day to build the table. This mode requires the uniqueness of the hash field (sequence field) that is run-time for multiple inbound load modules. You can consider assigning unique factor differences to different loading modules. The design structure is as shown.

General Data Rowkey Design
No. 0 Byte	1th byte	2nd byte	3rd byte	...
Hash field (sequence field)				Extended field (controlled within 12 bytes)
0X00000000~0XFFFFFFFF)				Can be made up of multiple user fields

HBase uses the scan method when it obtains a batch of records on a specified condition. The scan method has the following characteristics:

(1) scan can improve speed by setcaching and Setbatch method (space change time);

(2) scan can be scoped by Setstartrow and Setendrow. The smaller the range, the higher the performance.

The ingenious Rowkey design allows us to get the elements in a collection of records in bulk together (which should be in the same region) and get good performance when traversing the results.

(3) Scan can add filters through the SetFilter method, which is the basis of paging and multi-criteria queries.

After satisfying the length, the three columns, the unique principle, we need to consider how to design rowkey to take advantage of the scope function of the scan method, so that the query speed of obtaining a batch of records can be improved. The following example describes how to combine multiple columns into a single rowkey, using the range of scan to achieve faster query speeds.

Example:

We store the file information in the table, each file has 5 attributes: File ID (long, globally unique), creation time (long), file name (string), class name (string), owner (User).

We can enter the query criteria: file creation time interval (such as files created from 20120901 to No. 20120914), file name ("China Good Sound"), Category ("Art"), the owner ("Zhejiang satellite TV").

Suppose currently we have the following file altogether:

Id	Createtime	Name	Category	Userid
1	20120902	China Good Sound 1th issue	Entertainment	1
2	20120904	China Good Sound 2nd issue	Entertainment	1
3	20120906	China good sound outer card race	Entertainment	1
4	20120908	China Good Sound 3rd issue	Entertainment	1
5	20120910	China Good Sound 4th issue	Entertainment	1
6	20120912	Interview with a good Chinese voice player	Art Highlights	2
7	20120914	China Good Sound 5th issue	Entertainment	1
8	20120916	China Good sound recording trailer	Art Highlights	2
9	20120918	Zhang Wei exclusive interview	Tidbits	3
10	20120920	Jdb Herbal Tea Advertisement	Entertainment Advertising	4

Here the userid should correspond to another user table, which is not currently listed. We only need to know the meaning of the UserID:

1 on behalf of Zhejiang TV, 2 for the sound of the crew, 3 for the xx Weibo, 4 for the sponsor. When the query interface is invoked, the above 5 conditions are also input to find (20120901,20121001, "Good sound in China", "Zongyi", "Zhejiang satellite TV"). At this point we should get records of 1th, 2, 3, 4, 5, 7. 6th because does not belong to "Zhejiang satellite TV" should not be selected. We can do this when we design Rowkey: using the UserID + Createtime + Fileid to make up the rowkey, which can satisfy the multi-condition query and fast query speed.

The following points need to be noted:

(1) Each record is Rowkey, and each field needs to be filled to the same length. If we expect a maximum of 100,000 levels of users, then the UserID should be uniformly populated to 6 bits, such as 000001,000002 ...

(2) The intention of adding globally unique fileid at the end is to make the records corresponding to each file globally unique. Avoid overwriting two different file records when the UserID is the same as createtime.

Storing the above file records according to this rowkey is the following structure in the HBase table:

RowKey (UserID 6 + time 8 + FileID 6) name category ....

00000120120902000001

00000120120904000002

00000120120906000003

00000120120908000004

00000120120910000005

00000120120914000007

00000220120912000006

00000220120916000008

00000320120918000009

00000420120920000010

How to use this watch?

After setting up a scan object, we Setstartrow (00000120120901), Setendrow (00000120120914).

In this way, scan only scans the userid=1 data, and the time range is limited to the specified time period, which satisfies the filtering of the results by user and by time range. And because the records are centrally stored, the performance is good.

Then use Singlecolumnvaluefilter (Org.apache.hadoop.hbase.filter.SingleColumnValueFilter), a total of 4, respectively, to constrain the upper and lower limits of name, and the upper and lower limits of category. satisfies the prefix matching by both the file name and the category name at the same time.

(Note: Using Singlecolumnvaluefilter can affect query performance, which consumes a lot of resources when you really process massive amounts of data and takes a long time)

If you need paging, you can also add a pagefilter limit to the number of records returned.

Above, we completed the design of the HBase table structure with high performance support for multi-conditional queries.

Rowkey design of HBase for big data performance tuning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Rowkey design of HBase for big data performance tuning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Rowkey design of HBase for big data performance tuning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support