Analysis of HBase data model and basic table design

Last Update:2018-07-26 Source: Internet

Author: User

Tags md5 encryption

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in the study of the use of hbase, and carefully read an official recommended blog, here on the side of the translation as a summary of the way and everyone together to comb the HBase data model and basic table design ideas.

Official recommended Blog Original address: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_ Khurana.pdf Click on the Open link

HBase is an open source scalable, distributed NoSQL database for massive data storage, modeled and built on the HDFs storage system of Hadoop based on Google's bigtable data model. It differs from the relational database MySQL, Oracle, and so on, and the HBase data model sacrifices some of the features of the relational database, but in return for great scalability and flexible operation of the table structure.

To a certain extent, hbase can also be viewed as a database of ordered map data structures with line keys (row key), column identification (columns qualifier), timestamp (timestamp), which features sparse, distributed, persistent and multidimensional.

Introduction to the data model of base
The HBase data model is also composed of a sheet of tables, each table also has data rows and columns, but the rows and columns in the HBase database are slightly different from the relational database. The following is a unified introduction to the concepts of some nouns in the hbase data model:

Table: HBase will organize the data into a sheet of tables, but note that the table name must be a valid name to use in the file path because the HBase table is mapped to the file above the HDFs.

Rows (Row): In a table, each row represents a data object, each row is uniquely identified with a row key (the row key), and there is no specific data type for the row keys to store in binary bytes.

Row family (column Family): When defining the HBase table, you need to set up the column family in advance, all the columns in the table need to be organized in the column family, once the column family is determined, it cannot be easily modified, because it affects the real physical storage structure of hbase, but the column identity in the column family Qualifier) and its corresponding values can be dynamically additions and deletions. Each row in the table has the same column family, but does not require a consistent column identifier (column Qualifier) and values in each row's column family, so it is a sparse table structure that avoids redundancy of the data to some extent. For example: {row1, userinfo:telephone-> 137xxxxx869}{row2, Userinfo:fax phone-> 0898-66xxxx} row 1 and row 2 all have the same column family UserInfo, but in line 1 The column family has only the column ID (columns Qualifier): The mobile number, and the column family in row 2 has only the column ID (columns Qualifier): fax number.

Column identification (columns Qualifier): The data in the column family is mapped by the column ID, in fact, we can not rigidly adhere to the "column" concept, can also be understood as a key value pair, column Qualifier is key. There is no specific data type for the column identification, which is stored in binary bytes.

Cell: Each row key, the column family and the column identification together constitute a unit, the data stored in the unit is called the unit data, the Unit and the unit data also does not have the specific data type, in binary byte to store.

Timestamp (Timestamp): By default, data in each cell is inserted with a timestamp to be used for version identification. When reading unit data, if the timestamp is not specified, the default is to return the most recent data, and when the new cell data is written, the current time is used by default if no timestamp is set. The number of versions of Unit data for each column family is maintained separately by HBase, and the HBase retains 3 version data by default.