Characteristics and table design of hbase data model

Source: Internet
Author: User
Tags current time map data structure md5 encryption table name
HBase is an open-source, scalable, distributed NoSQL database for massive data storage that is modeled on the Google BigTable data model and built on the HDFs storage system of Hadoop. It differs significantly from the relational database MySQL, Oracle, etc., and HBase's data model sacrifices some of the features of the relational database, but in exchange for great scalability and flexible operation of the table structure.

To a certain extent, hbase can also be regarded as the database of the ordered map data structure with the row key (row key), column identifier (qualifier), timestamp (timestamp), which has the characteristics of sparse, distributed, persistent, multi-dimension.

Introduction to HBase Data model
HBase's data model is also made up of a sheet of tables, with rows and columns in each table, but rows and columns in the HBase database are slightly different from the relational database. The following is a unified introduction to the concepts of some nouns in the hbase data model:
Table: HBase organizes data into a sheet of tables, but it is important to note that the table name must be a valid name that can be used in the file path, because the HBase table is mapped to the file above the HDFs. Row: In a table, each row represents a data object, each row is uniquely identified by a row key, and there is no specific data type for the row key, which is stored in binary bytes. Column Family: When you define an hbase table, you need to set up the column family in advance, all the columns in the table need to be organized in the column family, and once the column family is determined, it cannot be easily modified because it affects the actual physical storage structure of hbase, but the column identity in the column family Qualifier) and its corresponding values can be dynamically deleted. Each row in the table has the same column family, but the column families that do not need each row have consistent column identifiers (columns Qualifier) and values, so it is a sparse table structure that avoids data redundancy to some extent. For example: {row1, userinfo:telephone-> 137xxxxx869}{row2, Userinfo:fax phone-> 0898-66xxxx} row 1 and row 2 all have the same column family UserInfo, but row 1 Column family is only a column identifier (column Qualifier): The mobile number, and the column family in row 2 is only the column identifier (column Qualifier): fax number. Column Qualifier: The data in the column family is mapped by the column ID, in fact, you can not adhere to the concept of "column", can also be understood as a key value pair, column Qualifier is key. The column ID also does not have a specific data type, which is stored in binary bytes. Unit (cell): Each row key, column family and column identity together constitute a unit, the data stored in the cell is called the cell data, and cell and cell data do not have a specific data type, in binary bytes to store. Timestamp (Timestamp): By default, the data in each cell is inserted with a timestamp for version identification. When the cell data is read, if the timestamp is not specified, the default is to return the most recent data, and when the new cell data is written, the current time is used by default if no timestamp is set. The number of versions of the unit data for each column family is maintained separately by HBase, and HBase retains 3 version data by default.

Sometimes, you can think of hbase as a multidimensional map model to understand its data model. As shown in the following illustration, a row key maps a column family array, each column family in the column family array maps a column identity array, and each column identifier in the column identity array is mapped to a timestamp array, which is a different version of the value under different timestamp mappings, but the default takes the nearest time value. , so it can be thought of as a mapping of the column identifier (column Qualifier) and the value it corresponds to. Users can also get the value of multiple versions of cell data simultaneously through HBase's API. The row key is equivalent to the primary key of the relational database in HBase, and the row key is set when the table is created, and the user cannot specify a column as the row key.

Sometimes you can also think of HBase as a key-value database like Redis. As shown in the following diagram, when you want to query all data for a row, the row key is the equivalent of key, and value is the data in the cell (the column family, the different versions of the timestamp in the columns column in the column family), and when the user queries the storage mechanism in the specified row, HBase reads a block of data that, in addition to the cell data to be queried, may also fetch other cell data, since the block also contains other column families or other column information corresponding to the row key, which actually represents the other unit data. This is also how HBase's API works in practice.


HBase provides a rich API interface for users to manipulate this data. The main API interface is 3, Put,get,scan. Put and get operate on the data of the specified row, so a row key is required to operate. Scan is the operation of a range of data, by specifying the Start row key and the end row key to get the scope, if you do not specify the start row and end row keys, the default gets all row data.

Problems needing attention in the design of HBase table
Here are a few questions to consider when you start designing a table in HBase:
1. How to set the structure of the row key, and what information to include in the row key (this is important, the following example will be explained)
2. How many column families should be in the table
3. What data should be stored in the column family
4. How many columns of data are stored in each column family
5. What is the name of the column, because this information is required when manipulating the API
6. What information should be stored in the cell (cells)
7. How many version information is stored in each unit
In the HBase table design, the most important thing is to define the structure of the Row-key, to define the structure of the Row-key have to consider the table access samples, that is, in the real-world application of this table will appear in the read and write scenarios. In addition, we should also consider some of the features of the HBase database when designing a table.
1. The index of the table in HBase is implemented by key
2. In the table is a row key in the dictionary order to sort the rows of data, each section of the table is divided by the beginning of the row key and the end of the row key to determine.
3. All data stored in an HBase table is a binary byte with no data type.
4. Atomicity is guaranteed only within the line, and there are not many transactions in the HBase table.
5. Row family (column Family) should be defined before the table is created
6. Column Qualifier in the column family can be added when the data is dynamically inserted after the table has been created.

Next we consider a scenario in which we design a table to hold information about the user's mutual powder on Weibo. So before we design the table, we need to consider the read and write scenarios in the business.

In the reading scene we need to consider:
1. Each user is concerned about who
2. User A has no concern for User B
3. Who has followed user A

Write the scene we want to consider:
1. The user is concerned about another user
2. The user cancels the focus on a user

Let's look at the design of several table structures:

In the first table structure design, in this table structure design, each row represents a user and all other users he is interested in. This user ID is the row Key, and each column identifier (column Qualifier) is the user's focus on the column family of other users, cell data is the user's attention to the user ID of other users. In the design of this table structure, "each user is concerned about who" this problem is very good to solve, but for "User A is not concerned about User B" This problem in a lot of columns, it is necessary to traverse all the cell data to find User B, the cost is very large. And when adding a new user to be followed, because you do not know what column family ordinal to assign to this new user, it is very expensive to traverse all columns in the entire column family to find the last column, and give the last column ordinal +1 to the new attention user as the ordinal number within the column family.


Therefore, the second table structure is derived, the following figure, add a Counter record column family of all the total number of columns, when the new user is added to the attention, the number of the new user is counter+1. However, when you want to cancel the focus on a user, you have to iterate through all the column data, and the biggest problem is that HBase does not support transaction processing, which adds the user-focused operation logic to the client by counter.



Recall that the column identifier (Qualifier) is stored in binary bytes, so the column ID can store any data, and the column identifier or dynamic addition, based on this feature we will improve the table design, as shown below. This time with the user ID being followed as the column identifier (Qualifier), then the cell data can be any number, such as all unified into 1. In the design of this table structure, it is easy to add new followers, and to remove the attention. But for the reading scene, who pays attention to the problem of user A, because the index of HBase database is only built on the row key, here to scan the whole table to count all the users concerned about user A's number, so the following table structure design also has some performance problems. This also leads to the idea that the followers need to add the index in some way.


There are three optimization schemes for the table structure above, and the first one is to create a new table that holds a user and all users who follow him. The second solution is to store the information of a user and all users who are concerned about him in the same table, and differentiate it from the row key, such as: Row key is a jame_001_following line that holds information about all jame people, and Row_key is Jame This line of _001_followed holds all the information about people who are interested in Jame. The final optimization scheme is to design the row key as "Followerid+followedid" in the following diagram, for example: "Jame+emma", where the row key value represents Jame's focus on Emma (which should actually be "Jame id+ Emma's id ", just to explain the convenience and direct use of the name), but also contains two of followers and followers of the message; It is also important to note that the name of the column family is designed to have only one letter F, so the benefit of the design is to reduce the I/O operation pressure of hbase on the data, It also reduces the data bytes returned to the client, increasing the response speed because each KeyValue object returned to the client contains the column family name. While the user name of the person being followed is also saved in the table as column Qualifier, the benefit is to save resources to find the user name in the user table. In this table structure design, "User A is not concerned about a user B", "User A has no concern for User B." "The business process will become simple and efficient.


There is also a need to note that in the actual production environment, you also need to use the row key MD5 encryption, on the one hand, the length of the row key is consistent, can improve the data access performance. Optimization in this area is beyond the scope of this article.

2016-12-06_141118.jpg (31.34 KB, download number: 0)

Download attachments to albums

Yesterday  14:16  upload

Summary:
The whole article outlines the data model and basic table design ideas for hbase. Here's a summary of some of the key features of HBase:
   1. Row key is an important part of the design of the HBase table structure, which directly influences the efficiency of the program and HBase interaction and the performance of data storage.
   2. The table structure of base is more flexible than the traditional relational database, and you can store any binary data in the table, and the unrelated data types.
   3. All data in the same column family has the same access mode
   4. The main thing is to set up the index
   5 by the row key. The table structure based on vertical expansion can be quickly and easily Get the data, but at the expense of a certain atomicity, such as the last table structure above, and the horizontal expansion of the main design table structure, that is, there are many columns in the column family, such as the first table in the above structure, can maintain a certain degree of atomicity in the row.
   6. HBase does not support transactions, all try to get results in one API request Operation
   7. Hash optimization of row key can obtain a fixed-length row key and make the data distribution more evenly, instead of concentrating on a single server. But it also sacrificed some data sorting and reading performance.
   8. You can use column identifiers (Qualifier) to store data. The length of the
   9. Column ID (Qualifier) name and the length of the column family name all affect the read and write performance of I/O and the amount of data sent to the client, so their naming should be concise.

References
[1] amandeep khurana   introduction to HBase Schema design:  http://0b4af6cdc2f0c 5998459-c024 gin1210_khurana.pdf

Source: CSDN
Author: ymh198816

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.