HBase system architecture and data structure

Source: Internet
Author: User
Tags file info memory usage split versions hadoop ecosystem

The tables in HBase generally have this feature:

1 big: A table can have billions of rows, millions of columns

2 column-oriented: column (family)-oriented storage and permission control, column (family) independent retrieval.

3 sparse: For a column that is empty (null), it does not occupy storage space, so the table can be designed to be very sparse.

The following image is the location of hbase in Hadoop ecosystem.

Second, logical view

HBase stores data in the form of a table. The table is made up of rows and columns. Columns are divided into a number of column families (row family)

Row Key

Like NoSQL databases, row key is the primary key used to retrieve records. There are only three ways to access rows in HBase table:

1 access via a single row key

2 through the range of row key

3 Full table Scan

Row key line keys (row key) can be any string (the maximum length is 64KB, the actual application length is generally 10-100bytes), inside HBase, the row key is saved as a byte array.

When stored, the data is sorted by the dictionary order (byte order) of the row key. When designing a key, to fully sort the storage feature, put together the row stores that are often read together. (Positional dependency)


The result of the dictionary ordering of int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,..., 9,91,92,93,94,95,96,97,98,99. To maintain the natural order of shaping, the row key must be left padded with 0.

One read or write of a row is an atomic operation (no matter how many columns are read or written). This design decision makes it easy for the user to understand the behavior of the program when concurrent update operations are performed on the same row.

Column Family

Each column in an hbase table is attributed to a column family. The column family is part of the Chema of the table (and the column is not) and must be defined before the table is used. Column names are prefixed with the column family. such as Courses:history,courses:math
All belong to the courses family.

Access control, disk, and memory usage statistics are performed at the column family level. In practical applications, control permissions on the column family help us manage different types of applications: we allow some apps to add new basic data, some apps can read basic data and create inherited column families, and some apps will only allow browsing data (and maybe not even browsing all data for privacy reasons).

time Stamp

A storage unit identified by row and columns in HBase is called a cell. Each cell holds multiple versions of the same piece of data. The version is indexed by time stamp. The type of timestamp is a 64-bit integer. The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If your application avoids data versioning conflicts, it must generate its own unique timestamp. In each cell, different versions of the data are sorted in reverse chronological order, that is, the most recent data is in the front row.

To avoid the burden of management (including storage and indexing) caused by too many versions of data, HBase provides two ways to recover data versions. The first is to save the last n versions of the data, and the second is to save the version for the most recent period (for example, the last seven days). Users can set them for each column family.


The only unit determined by {row key, column (=<family> + <label>), version}. The data in the cell is of no type and is all stored in bytecode form.
third, physical storage

1 has already been mentioned, all rows in the table are arranged in the dictionary order of row key.

2 Table is split into multiple hregion in the direction of the row.

3 region by size, each table starts with only one region, as the data is constantly inserted into the table, the region is increasing, when the increase to a threshold, hregion will wait for the chapter two new hregion. As the rows in the table grow, there will be more and more hregion.

4 Hregion is the smallest unit of distributed storage and load balancing in HBase. The smallest unit means that different hregion can be distributed on different hregion servers. However, a hregion is not split across multiple servers.

5 Hregion Although it is the smallest unit of distributed storage, it is not the smallest unit of storage.

In fact, hregion consists of one or more stores, each store a columns family.

Each strore is made up of one memstore and 0 to more storefile. As shown in figure:

StoreFile is saved in hfile format on HDFs.

The format of the hfile is:

The hfile is divided into six parts:

Data Block Segment – Saves the table, which can be compressed

Meta block Segment (optional) – Save user-defined kv pairs that can be compressed.

The meta-information of the File info segment –hfile is not compressed, and users can add their own meta-information in this section.

The index of the Data block index segment –data block. The key for each index is the key of the first record of the block being indexed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.