HBase Application Development Review and summary series bis: Rowkey code for design of line keys

Source: Internet
Author: User

2. Rowkey Line key design specification2.1. Rowkey Four characteristics2.1.1 String Type

Although the row keys are stored in hbase as byte[] byte arrays, it is recommended that their data types be set to string type in the system development process, ensuring versatility, and if Rowkey is defined as other types in development, such as long, The length of the data may be limited by the length of the data as defined by the compilation environment.

The common line key strings are as follows:

    • A pure numeric string, such as 9559820140512;
    • Number + special delimiter, e.g. 95598-20140512;
    • Numbers + English letters, such as city20140512;
    • Numbers + English letters + special separators, such as city_20140512.
2.1.2 has a definite meaning

The main function of Rowkey is to make the uniqueness of the data record, but the uniqueness is not all, and the definite meaning of the row key is of special significance to the application development and data retrieval. For example, the above number string 9559820140512, the actual meaning is this: 95598 (grid customer service phone) +20140512 (date).

Row keys are often composed of multiple values, and the position order of each value will affect the efficiency of data storage and retrieval, so in the design of the row key, the future business application development needs to have a more in-depth understanding and forward-looking predictions, in order to design the most efficient to retrieve the row key.

2.1.3 is ordered

Rowkey is stored in the dictionary order, so when designing Rowkey, make full use of this sorting feature to store the data that is often read together and put the data that might be accessed recently.

For example, if the data that is most recently written to the HBase table is most likely to be accessed, consider using the timestamp as part of the Rowkey, and because it is a dictionary ordering, you can use Long.max_value–timestamp as the Rowkey. This guarantees that the newly written data can be hit quickly when it is read.

2.1.4 has a definite staying power

The basis of the order of the row keys is the fixed length, such as 20140512080500, 20140512083000, the two date-time form of the string is incremented, regardless of the number of seconds, we will set it to 14-bit number form, if we take the back of the 0 removed, Then 201405120805 will be greater than 20140512083, and its ordering has changed. So we suggest that the row key must be designed to be fixed-length.

2.2. Rowkey Design Principles2.2.1 RowKeyLength principle

Rowkey is a binary stream, the length of Rowkey is suggested by many developers to design in 10~100 bytes, but the suggestion is that the shorter the better, not more than 16 bytes.

The reasons are as follows:

(1) Data persistence file hfile is stored according to KeyValue, if Rowkey is too long, such as 100 bytes, 10 million columns of data light rowkey will occupy 100*1000 = 1 billion bytes, will be nearly 1G of data, which will greatly affect the hfile storage efficiency ;

(2) Memstore will cache part of the data to memory, if the Rowkey field is too long the effective utilization of memory will be reduced, the system will not be able to cache more data, which will reduce retrieval efficiency. So the shorter the byte length of the Rowkey, the better.

(3) The current operating system is all 64-bit systems, memory 8-byte alignment. Control the best features of the operating system in 16-byte, 8-byte integer times.

2.2.2 RowKeyHash Principle

If the Rowkey is incremented by timestamp, do not put the time in front of the binary code, it is recommended to rowkey the high-level as a hash field, generated by the program loop, low-drop Time field, which will improve the data equalization distribution in each regionserver to achieve load balancing probability. If there is no hash field, the first field is directly the time information will generate all new data on a regionserver accumulation of hot phenomenon, so that when doing data retrieval load will be concentrated in the individual regionserver, reduce query efficiency.

2.2.3 RowKeyThe only principle

Uniqueness must be ensured in design.

2.3. Rowkey Application Scenario

Based on the above 3 principles of rowkey, there are different Rowkey design recommendations for different application scenarios.

2.3.1 Rowkey for transactional dataDesign

Transaction data is a time attribute, and it is recommended to deposit time information into Rowkey, which helps prompt query retrieval speed. For transactional data recommendations, the default is to create a table for data by day, so the benefits of this design are manifold. According to the talent table, the time information can be removed from the date section only to keep the hour minute milliseconds, so that 4 bytes can be done. Add a hash field of 2 bytes altogether 6 bytes to form a unique rowkey. As shown in the following:

1th byte

" TD valign= "Top" width= "+" "

transactional data Rowkey Design

" 0 bytes

2nd byte

3rd byte

4th byte

5th byte

...

hash field

Time field (ms )

Extended field

0~65535 (0X0000~0XFFFF)

0~86399999 (0X00000000~0X05265BFF)

Such a design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Maybe someone wants to ask why the time field is not saved in the host byte order, so it can also be used as a hash field. This is because the time range of data or as far as possible to ensure continuous, the same time range of data lookup probability is very large, for query retrieval has a good effect, so the use of separate hash field effect is better, for some applications, we can consider using hash field all or part to store some data field information, As long as the same hash value is guaranteed to be unique at the same time (milliseconds).

2.3.2 Rowkey for statistical dataDesign

Statistics are also with time attributes, the smallest unit of statistics will only be minutes (to the seconds of pre-statistics is meaningless). At the same time, we also use the data table by default for statistical data, so the benefits of the design need not be said. When you press the talent table, the time information only needs to be reserved for the hour minutes, then the 0~1400 takes only two bytes to save the time information. Because some dimensions of the statistics are very large, 4 bytes are required as sequence fields, so the hash field is used as a sequence field as well as 6 bytes to form a unique rowkey. As shown in the following:

1th byte

0x00000000~0xffffffff)

" TD valign= "Top" width= "+" "

statistics Rowkey Design

" 0 bytes

2nd byte

3rd byte

4th byte

5th byte

...

hash field ( sequence field)

Time field (minutes)

Extended field

0~ 1439 (0x0000~0x059f)

 

Similarly, this design cannot save overhead from the operating system memory management level, because 64-bit operating systems are required to be 8-byte aligned. However, the rowkey portion of the persistence store can save 25% of the overhead. Pre-statistics may involve repeated recalculation requirements, to ensure that obsolete data can be effectively deleted, and not affect the balance of the hash effect, so special treatment.

2.3.3 for common data RowkeyDesign

The universal data adopts the self-increment sequence as the unique primary key, and the user can choose the single table mode by the day to build the table. This mode requires the uniqueness of the hash field (sequence field) that is run-time for multiple inbound load modules. You can consider assigning unique factor differences to different loading modules. The design structure is as shown.

General Data Rowkey Design

Section 0 bytes

1th byte

2nd byte

3rd byte

...

Hash Fields ( sequence field)

Extended field (controlled within 12 bytes)

0X00000000~0XFFFFFFFF)

Can be made up of multiple user fields

2.3.4 Support for Rowkey of multi-conditional queriesDesign

HBase uses the scan method when it obtains a batch of records on a specified condition. The scan method has the following characteristics:

(1) scan can improve speed by setcaching and Setbatch method (space change time);

(2) scan can be scoped by Setstartrow and Setendrow. The smaller the range, the higher the performance.

The ingenious Rowkey design allows us to get the elements in a collection of records in bulk together (which should be in the same region) and get good performance when traversing the results.

(3) Scan can add filters through the SetFilter method, which is the basis of paging and multi-criteria queries.

After satisfying the length, the three columns, the unique principle, we need to consider how to design rowkey to take advantage of the scope function of the scan method, so that the query speed of obtaining a batch of records can be improved.

Shangbing

Unit: Henan Electric Power Research Institute, Intelligent Grid

qq:52190634

Home page: http://www.cnblogs.com/shangbingbing

Space: http://shangbingbing.qzone.qq.com

HBase Application Development Review and summary series bis: Rowkey code for design of line keys

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.