Analysis of HBase data model and basic table design

Source: Internet
Author: User
Tags md5 encryption

Recently in the study of the use of hbase, and carefully read an official recommended blog, here on the side of the translation as a summary of the way and everyone together to comb the HBase data model and basic table design ideas.

Official recommended Blog Original address: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_ Khurana.pdf Click on the Open link


HBase is an open source scalable, distributed NoSQL database for massive data storage, modeled and built on the HDFs storage system of Hadoop based on Google's bigtable data model. It differs from the relational database MySQL, Oracle, and so on, and the HBase data model sacrifices some of the features of the relational database, but in return for great scalability and flexible operation of the table structure.

To a certain extent, hbase can also be viewed as a database of ordered map data structures with line keys (row key), column identification (columns qualifier), timestamp (timestamp), which features sparse, distributed, persistent and multidimensional.

Introduction to the data model of base
The HBase data model is also composed of a sheet of tables, each table also has data rows and columns, but the rows and columns in the HBase database are slightly different from the relational database. The following is a unified introduction to the concepts of some nouns in the hbase data model:

Table: HBase will organize the data into a sheet of tables, but note that the table name must be a valid name to use in the file path because the HBase table is mapped to the file above the HDFs.

Rows (Row): In a table, each row represents a data object, each row is uniquely identified with a row key (the row key), and there is no specific data type for the row keys to store in binary bytes.

Row family (column Family): When defining the HBase table, you need to set up the column family in advance, all the columns in the table need to be organized in the column family, once the column family is determined, it cannot be easily modified, because it affects the real physical storage structure of hbase, but the column identity in the column family Qualifier) and its corresponding values can be dynamically additions and deletions. Each row in the table has the same column family, but does not require a consistent column identifier (column Qualifier) and values in each row's column family, so it is a sparse table structure that avoids redundancy of the data to some extent. For example: {row1, userinfo:telephone-> 137xxxxx869}{row2, Userinfo:fax phone-> 0898-66xxxx} row 1 and row 2 all have the same column family UserInfo, but in line 1 The column family has only the column ID (columns Qualifier): The mobile number, and the column family in row 2 has only the column ID (columns Qualifier): fax number.

Column identification (columns Qualifier): The data in the column family is mapped by the column ID, in fact, we can not rigidly adhere to the "column" concept, can also be understood as a key value pair, column Qualifier is key. There is no specific data type for the column identification, which is stored in binary bytes.

Cell: Each row key, the column family and the column identification together constitute a unit, the data stored in the unit is called the unit data, the Unit and the unit data also does not have the specific data type, in binary byte to store.

Timestamp (Timestamp): By default, data in each cell is inserted with a timestamp to be used for version identification. When reading unit data, if the timestamp is not specified, the default is to return the most recent data, and when the new cell data is written, the current time is used by default if no timestamp is set. The number of versions of Unit data for each column family is maintained separately by HBase, and the HBase retains 3 version data by default.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

Sometimes, you can also think of hbase as a multidimensional map model to understand its data model. As shown in the following figure, a row key maps a column family array, each column family in an array of column families maps a column identity array, and each column identity in the array identifies (column Qualifier) to a timestamp array, which is the value of different versions under different timestamp mappings, but defaults to the most recent time , it can be viewed as a mapping of the column identification (column Qualifier) and its corresponding value. Users can also obtain the values of multiple versions of cell data through the HBase API. The row key is also equivalent to the primary key of the relational database in HBase, and the row key is set when the table is created, and the user cannot specify a column as the row key.



Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf


And sometimes you can think of HBase as a key-value database like Redis. The following figure, when you want to query all the data in a row, the row key is the key, and value is the data in the cell (the column family, the value of the different versions of the timestamp in the columns column in the column family), and when the user wants to query a cell data in the specified row when it is deep into the hbase underlying storage mechanism, HBase reads a block of data that, in addition to the cell data to be queried, may also acquire other unit data, because the block also contains other column or column information corresponding to the row key, which actually represents another unit of data. This is also the actual working principle of the HBase API.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

HBase provides a rich API interface for users to manipulate the data. The main API interface is 3, Put,get,scan. Put and get are the data that operates on the specified row, so you need to provide a row key to operate. Scan is the operation of a range of data, by specifying the start line key and the end line key to obtain the scope, if you do not specify the start line key and End row key, the default to get all row data.


problems needing attention in the design of HBase table
The following issues need to be considered when you start designing tables in HBase:
1. How to set the structure of the row key, and what information should be included in the row key (this is important, as the following example illustrates)
2. How many column families should be in the table
3. What data should be stored in the column family
4. How many columns of data are stored in each column family
5. What is the name of the column, because this information is required to operate the API
6. What kind of information should be stored in a unit (cell)
7. How many versions of information are stored in each cell
The most important thing in the design of HBase table is to define the structure of row-key, and to define the structure of row-key, we have to consider the access sample of the table, that is, what kind of reading and writing scene will appear in the real application. In addition, we should also consider some features of the HBase database when designing the table.
1. The index of the table in HBase is implemented by key
2. In the table, a row of rows of data is sorted by the dictionary order of row key, and the partition of each area of the table is determined by starting row key and ending row key.
3. All data stored in the HBase table is a binary byte with no data type.
4. Atomicity is guaranteed only in rows, and there are no hbase transactions in the table.
5. Row family (column Family) should be defined before the table is created
6. Column identities in the column family (column Qualifier) can be added dynamically when the table is created.

Next we'll consider a scenario where we're going to design a table to hold information about the user's mutual powder on the microblog. So before designing the table, we need to consider the reading and writing scene in the business.

Read the scene we want to consider:
1. Who are the concerns of each user
2. User A has no interest in User B
3. Who is interested in user A

To write the scene we have to consider:
1. Users are concerned about another user
2. Users to cancel attention to a user

Let's look at the design of several table structures:

In the first kind of table structure design, in this kind of table structure design, each row represents a user and all other users that he is concerned about. This user ID is the row Key, and each column identifier (column Qualifier) is the number of other users in the column family that the user is interested in, and the cell data is the user ID of the other user who is concerned with the user. In the design of this kind of table structure, "each user is concerned about who" this problem is very good to solve, but for "User A is not concerned about User B" This problem in a lot of times, need to traverse all unit data to find User B, such overhead will be very big. And when you add a new user that is being watched, because you don't know what column family number to assign to this new user, you need to iterate through all the columns in the entire row to find the last column, and the number +1 of the last column to the new attention user as the serial number in the column family, and that's a huge overhead.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

So derived a second table structure design, the following figure, add a counter records the total number of all columns in the family, when the new user is added, the number of the new user is counter+1. However, when you want to cancel the attention of a user, the same need to traverse all the column data, and the biggest problem is that HBase does not support transactions, the counter to add the attention of the user's operation logic to write in the client.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

Recall that the column ID is stored as a binary byte, so the column ID can store any data, and the column identification is dynamically added, based on which we improve the design of the table, as shown in the following diagram. Qualifier This time with the user ID of concern as column identification (Qualifier), then the unit data can be any number, such as all unified into 1. In the design of this kind of table structure, it is easy to add new followers and cancel attention. But for the reading scene, who is concerned about the issue of user A, because the index of the HBase database is only built on the row key, here has to scan the entire table to count all the users who pay attention to user A, so the following table structure design also has some performance problems. This also leads to the idea that people who are being watched need to add indexes in some way.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

For the above table structure there are three optimization scenarios, the first one is to create another table, which holds a user and all the users concerned about him. The second solution is to store information about a user and all users who are interested in the same table, and distinguish it from the row key, for example: Row key holds information for all Jame interested people in this line of jame_001_following, and Row_key for Jame This line of _001_followed keeps information about all the people who are concerned about Jame. The last optimization scheme is, the following figure, the row key design into the form of "Followerid+followedid", such as: "Jame+emma", where the row key value represents the Jame attention to Emma (in fact, this should be "Jame id+ Emma's ID, "just to explain the convenience and use the name directly, but also includes the attention and the attention of the two information; it also needs to be noted that the name of the column family is designed to have only one letter F, the advantage of this design is to reduce the hbase to the data I/O operation pressure, It also reduces the data bytes returned to the client, increasing the response speed, because each KeyValue object returned to the client will contain the column family name. Also, the user name of the person being watched is saved in the table as column Qualifier, the benefit of which is to save the resource for the user table to find the username. Under this table structure design, "User A is concerned about a user B", "User A has no interest in User B." "Business processes can become simple and efficient.



Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

There is also a need to pay attention to the problem is that in the actual production environment, but also need to use the row key MD5 encryption, on the one hand, the length of the row key is consistent, can improve the data access performance. This optimization is not covered in this article.


Photo from: Http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

Summary:
The whole article outlines the HBase data model and basic table design ideas. Here is a summary of some of the key features of HBase:
1. Row key is an important part of hbase table structure design, which directly influences the efficiency of program and hbase interaction and the performance of data storage.
2. Base table structure is more flexible than traditional relational database, you can store any binary data in the table, and independent data type.
3. All data in the same column family has the same access mode
4. Indexing is primarily done through the row key
5. The table structure designed with longitudinal expansion can quickly and easily obtain data, but at the expense of a certain atomicity, such as the last table structure above, and the design of the horizontal expansion of the table structure, that is, many columns in the column family, such as the first table structure above, can maintain a certain degree of atomicity.
6. HBase does not support transactions, all try to get results in one API request operation
7. The hash optimization for row key can get a fixed length row key and make the data distribution more uniform, rather than concentrating on one server, but also sacrificing some data sorting and reading performance.
8. You can store data by using the column identification (columns Qualifier).
9. The length of the name of the column ID (Qualifier) and the length of the column family name affect both the read and write performance of I/O and the amount of data sent to the client, so their naming should be concise.

References
[1] amandeep Khurana Introduction to HBase Schema design:http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f . r43.cf2.rackcdn.com/9353-login1210_khurana.pdf

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.