Rowkey design of HBase for big data performance tuning

Source: Internet
Author: User

1 Overview

HBase is a distributed, column-oriented database, and the biggest difference from a general relational database is that HBase is well suited for storing unstructured data, and it's column-based rather than row-based patterns.

Since HBase is a keyvalue column store, Rowkey is the KeyValue key, which represents the only row. Rowkey is also a binary stream with a maximum length of 64KB, and the content can be customized by the user who uses it. When data is loaded, it is generally based on the Rowkey binary sequence from small to large.

HBase is retrieved based on Rowkey, which locates the region of a Rowkey (or a Rowkey range) and then routes the request for querying data to that area for data. HBase retrieval is supported in 3 different ways:

(1) Access by a single rowkey, that is, a get operation according to a Rowkey key value, so that the only record is obtained;

(2) scan via Rowkey range, that is, by setting Startrowkey and Endrowkey, within this range. This allows a batch of records to be obtained according to the specified conditions;

(3) Full table scan, that is, directly scan all row records in the entire table.

HBase is very efficient to retrieve by a single rowkey and takes less than 1 milliseconds to get 1000~2000 records per second, but queries for non-key columns are slow.

Rowkey Design of 2 hbase
2.1 Design Principles 2.1.1 Rowkey length principle

Rowkey is a binary stream, the length of Rowkey is suggested by many developers to design in 10~100 bytes, but the suggestion is that the shorter the better, not more than 16 bytes.

The reasons are as follows:

(1) Data persistence file hfile is stored according to KeyValue, if Rowkey is too long, such as 100 bytes, 10 million columns of data light rowkey will occupy 100*1000 = 1 billion bytes, will be nearly 1G of data, which will greatly affect the hfile storage efficiency ;

(2) Memstore will cache part of the data to memory, if the Rowkey field is too long the effective utilization of memory will be reduced, the system will not be able to cache more data, which will reduce retrieval efficiency. So the shorter the byte length of the Rowkey, the better.

(3) The current operating system is all 64-bit systems, memory 8-byte alignment. Control the best features of the operating system in 16-byte, 8-byte integer times.

Back to Top 2.1.2 Rowkey Hashing principle

If the Rowkey is incremented by timestamp, do not put the time in front of the binary code, it is recommended to rowkey the high-level as a hash field, generated by the program loop, low-drop Time field, which will improve the data equalization distribution in each regionserver to achieve load balancing probability. If there is no hash field, the first field is directly the time information will generate all new data on a regionserver accumulation of hot phenomenon, so that when doing data retrieval load will be concentrated in the individual regionserver, reduce query efficiency.

Back to Top 2.1.3 Rowkey Sole principle

Uniqueness must be ensured in design.

2.2 Application Scenarios

Based on the above 3 principles of rowkey, there are different Rowkey design recommendations for different application scenarios.

Back to Top 2.2.1 Design for transactional data Rowkey

Transactional data is a time attribute, and it is recommended that time information be stored in Rowkey, which helps prompt query retrieval speed. For transactional data recommendations, the default is to create a table for data by day, so the benefits of this design are manifold. According to the talent table, the time information can be removed from the date section only to keep the hour minute milliseconds, so that 4 bytes can be done. Add a hash field of 2 bytes altogether 6 bytes to form a unique Rowkey. As shown in the following:

transactional data rowkey design
No. 0 byte 1th byte 2nd byte 3rd byte 4th byte 5th byte
Time field (msec) extended field

Such a design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Maybe someone wants to ask why the time field is not saved in the host byte order, so it can also be used as a hash field. This is because the time range of data or as far as possible to ensure continuous, the same time range of data lookup probability is very large, for query retrieval has a good effect, so the use of separate hash field effect is better, for some applications, we can consider using hash field all or part to store some data field information, As long as the same hash value is guaranteed to be unique at the same time (milliseconds).

Back to Top 2.2.2 Rowkey Design for statistical data

Statistics are also with time attributes, the smallest unit of statistics will only be minutes (to the seconds of pre-statistics is meaningless). At the same time, we also use the data table by default for statistical data, so the benefits of the design need not be said. When you press the talent table, the time information only needs to be reserved for the hour minutes, then the 0~1400 takes only two bytes to save the time information. Because some dimensions of the statistics are very large, 4 bytes are required as sequence fields, so the hash field is used as a sequence field as well as 6 bytes to form a unique rowkey. As shown in the following:

statistics Rowkey design
No. 0 byte 1th byte 2nd byte 3rd byte 4th byte 5th byte
Time field (minutes)
 

Similarly, this design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Pre-statistics may involve repeated recalculation requirements, to ensure that obsolete data can be effectively deleted, and not affect the balance of the hash effect, so special treatment.

Back to Top 2.2.3 Rowkey Design for universal Data

The universal data adopts the self-increment sequence as the unique primary key, and the user can choose the single table mode by the day to build the table. This mode requires the uniqueness of the hash field (sequence field) that is run-time for multiple inbound load modules. You can consider assigning unique factor differences to different loading modules. The design structure is as shown.

General Data Rowkey Design
No. 0 Byte 1th byte 2nd byte 3rd byte ...
Hash field (sequence field) Extended field (controlled within 12 bytes)
0X00000000~0XFFFFFFFF) Can be made up of multiple user fields
Back to Top Rowkey Design of 2.2.4 support multi-conditional query

HBase uses the scan method when it obtains a batch of records on a specified condition. The scan method has the following characteristics:

(1) scan can improve speed by setcaching and Setbatch method (space change time);

(2) scan can be scoped by Setstartrow and Setendrow. The smaller the range, the higher the performance.

The ingenious Rowkey design allows us to get the elements in a collection of records in bulk together (which should be in the same region) and get good performance when traversing the results.

(3) Scan can add filters through the SetFilter method, which is the basis of paging and multi-criteria queries.

After satisfying the length, the three columns, the unique principle, we need to consider how to design rowkey to take advantage of the scope function of the scan method, so that the query speed of obtaining a batch of records can be improved. The following example describes how to combine multiple columns into a single rowkey, using the range of scan to achieve faster query speeds.

Example:

We store the file information in the table, each file has 5 attributes: File ID (long, globally unique), creation time (long), file name (string), class name (string), owner (User).

We can enter the query criteria: file creation time interval (such as files created from 20120901 to No. 20120914), file name ("China Good Sound"), Category ("Art"), the owner ("Zhejiang satellite TV").

Suppose currently we have the following file altogether:

Id Createtime Name Category Userid
1 20120902 China Good Sound 1th issue Entertainment 1
2 20120904 China Good Sound 2nd issue Entertainment 1
3 20120906 China good sound outer card race Entertainment 1
4 20120908 China Good Sound 3rd issue Entertainment 1
5 20120910 China Good Sound 4th issue Entertainment 1
6 20120912 Interview with a good Chinese voice player Art Highlights 2
7 20120914 China Good Sound 5th issue Entertainment 1
8 20120916 China Good sound recording trailer Art Highlights 2
9 20120918 Zhang Wei exclusive interview Tidbits 3
10 20120920 Jdb Herbal Tea Advertisement Entertainment Advertising 4

Here the userid should correspond to another user table, which is not currently listed. We only need to know the meaning of the UserID:

1 on behalf of Zhejiang TV, 2 for the sound of the crew, 3 for the xx Weibo, 4 for the sponsor. When the query interface is invoked, the above 5 conditions are also input to find (20120901,20121001, "Good sound in China", "Zongyi", "Zhejiang satellite TV"). At this point we should get records of 1th, 2, 3, 4, 5, 7. 6th because does not belong to "Zhejiang satellite TV" should not be selected. We can do this when we design Rowkey: using the UserID + Createtime + Fileid to make up the rowkey, which can satisfy the multi-condition query and fast query speed.

The following points need to be noted:

(1) Each record is Rowkey, and each field needs to be filled to the same length. If we expect a maximum of 100,000 levels of users, then the UserID should be uniformly populated to 6 bits, such as 000001,000002 ...

(2) The intention of adding globally unique fileid at the end is to make the records corresponding to each file globally unique. Avoid overwriting two different file records when the UserID is the same as createtime.

Storing the above file records according to this rowkey is the following structure in the HBase table:

RowKey (UserID 6 + time 8 + FileID 6) name category ....

00000120120902000001

00000120120904000002

00000120120906000003

00000120120908000004

00000120120910000005

00000120120914000007

00000220120912000006

00000220120916000008

00000320120918000009

00000420120920000010

How to use this watch?

After setting up a scan object, we Setstartrow (00000120120901), Setendrow (00000120120914).

In this way, scan only scans the userid=1 data, and the time range is limited to the specified time period, which satisfies the filtering of the results by user and by time range. And because the records are centrally stored, the performance is good.

Then use Singlecolumnvaluefilter (Org.apache.hadoop.hbase.filter.SingleColumnValueFilter), a total of 4, respectively, to constrain the upper and lower limits of name, and the upper and lower limits of category. satisfies the prefix matching by both the file name and the category name at the same time.

(Note: Using Singlecolumnvaluefilter can affect query performance, which consumes a lot of resources when you really process massive amounts of data and takes a long time)

If you need paging, you can also add a pagefilter limit to the number of records returned.

Above, we completed the design of the HBase table structure with high performance support for multi-conditional queries.

Rowkey design of HBase for big data performance tuning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.