In-row storage HBase system Architecture Learning

Source: Internet
Author: User
Tags compact file info zookeeper hadoop mapreduce hadoop ecosystem

I. Introduction of HBase

HBase is an open source, non-relational, distributed database (NoSQL) that references Google's bigtable modeling and implements the Java programming language. It is part of the Apache Software Foundation's Hadoop project and runs on the HDFs file system, providing Hadoop with services similar to bigtable size. As a result, it can store large amounts of sparse data in a fault-tolerant manner.

HBase implements the compression algorithms, memory operations, and Bron filters mentioned in the BigTable paper in the column. The HBase table can be used as input and output for the MapReduce task and can be accessed through the Java API or through the API of rest, Avro, or thrift. -Wikipedia

HBase is an open source version of BigTable, built on HDFS, providing high reliability, high performance, Columnstore, scalable, real-time read and write database systems. It is between NoSQL and RDBMS, can only retrieve data through a range of primary key (row key) and primary key, only supports single-line transactions (complex operations such as multi-table joins can be implemented through hive support), and is primarily used to store unstructured and semi-structured loose data. Like Hadoop, hbase targets rely primarily on scale-out to increase compute and storage capacity by increasing the number of inexpensive commercial servers.

The tables in HBase generally have this feature:

Big: A table can have billions of rows, millions of columns

Column-Oriented: column (family)-oriented storage and permission control, column (family) independent retrieval.

Sparse: For columns that are empty (null), they do not occupy storage space, so the table can be designed to be very sparse.

The following image is the location of hbase in Hadoop ecosystem.

Second, logical view

HBase stores data in the form of a table. The table is made up of rows and columns. Columns are divided into a number of column families (row family)

Row Key:

Like NoSQL databases, row key is the primary key used to retrieve records. There are only three ways to access rows in HBase table:

Access via a single row key

From the range of row key

Full table Scan

Row key line keys (row key) can be any string (the maximum length is 64KB, the actual application length is generally 10-100bytes), inside HBase, the row key is saved as a byte array. When stored, the data is sorted by the dictionary order (byte order) of the row key. When designing a key, to fully sort the storage feature, put together the row stores that are often read together. (Positional dependency)

The result of the dictionary ordering of int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,..., 9,91,92,93,94,95,96,97,98,99. To maintain the natural order of shaping, the row key must be left padded with 0. One read or write of a row is an atomic operation (no matter how many columns are read or written). This design decision makes it easy for the user to understand the behavior of the program when concurrent update operations are performed on the same row.

Column families:

Each column in an hbase table is attributed to a column family. The column family is part of the Chema of the table (and the column is not) and must be defined before the table is used. Column names are prefixed with the column family. For example, Courses:history,courses:math belong to the courses family. Access control, disk, and memory usage statistics are performed at the column family level. In practical applications, control permissions on the column family help us manage different types of applications: we allow some apps to add new basic data, some apps can read basic data and create inherited column families, and some apps will only allow browsing data (and maybe not even browsing all data for privacy reasons).

Time stamp:

A storage unit identified by row and columns in HBase is called a cell. Each cell holds multiple versions of the same piece of data. The version is indexed by time stamp. The type of timestamp is a 64-bit integer. The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If your application avoids data versioning conflicts, it must generate its own unique timestamp. In each cell, different versions of the data are sorted in reverse chronological order, that is, the most recent data is in the front row.

To avoid the burden of management (including storage and indexing) caused by too many versions of data, HBase provides two ways to recover data versions. The first is to save the last n versions of the data, and the second is to save the version for the most recent period (for example, the last seven days). Users can set them for each column family.


The only unit determined by {row key, column (= +), version}. The data in the cell is of no type and is all stored in bytecode form.

Third, physical storage

1. As already mentioned, all rows in the table are arranged in the dictionary order of row key. 2. Table is divided into multiple hregion in the direction of the row.

3, the region by the size of the division, each table at the beginning of only one region, with the data constantly inserted into the table, the region is increasing, when the increase to a threshold, hregion will wait for the chapter two new hregion. As the rows in the table grow, there will be more and more hregion.

4. Hregion is the smallest unit of distributed storage and load balancing in HBase. The smallest unit means that different hregion can be distributed on different hregion servers. However, a hregion is not split across multiple servers.

5, Hregion Although is the smallest unit of distributed storage, but is not the smallest unit of storage. In fact, hregion consists of one or more stores, each store a columns family. Each strore is made up of one memstore and 0 to more storefile. StoreFile is saved in hfile format on HDFs.

The format of the hfile is as follows:

The hfile is divided into six parts:

Data Block Segment – Saves a table of information, which can be compressed.

Meta block Segment (optional) – Save user-defined kv pairs that can be compressed.

The meta-information of the File info segment –hfile is not compressed, and users can add their own meta-information in this section.

The index of the Data block index segment –data block. The key for each index is the key of the first record of the block being indexed.

The index of the Meta Block index (optional) –meta block.

trailer– This paragraph is fixed-length. The offset of each segment is saved, and when a hfile is read, the first reading of the Trailer,trailer saves the starting position of each segment (the magic number of the segment is used for the security check), and then DataBlock index is read into memory, so that When retrieving a key, you do not need to scan the entire hfile, but simply find the block where key is located in memory, read the entire block into memory with one disk IO, and then find the key you need. DataBlock index is eliminated by LRU mechanism.

The hfile data Block,meta Block is typically stored in compression, which reduces network IO and disk IO Significantly, with the overhead of, of course, CPU compression and decompression. Currently hfile compression supports two ways: Gzip,lzo.

First of all, the hfile file is indefinite, with a fixed length of only two blocks: trailer and FileInfo. In the center of the trailer, there is a pointer to the starting point of the other data block. File info records Some meta-information about files such as: Avg_key_len, Avg_value_len, Last_key, COMPARATOR, Max_seq_id_key, and so on. The data index and Meta index blocks record the starting point for each data block and meta block.

The Data block is the basic unit of HBase I/O, and in order to improve efficiency, the hregionserver is based on the LRU block cache mechanism. The size of each data block can be specified by parameters when creating a table, the large block facilitates sequential scan, and the small block is useful for random queries. Each data block in addition to the beginning of the magic is a keyvalue stitching, magic content is some random numbers, the purpose is to prevent data corruption. The internal construction of each keyvalue pair is described in detail later.

Each keyvalue pair inside the hfile is a simple byte array. However, this byte array contains many items and has a fixed structure. Let's take a look at the concrete structure inside:

The start is a two fixed-length number that represents the length of the key and the length of the value, respectively. Next is the key, which starts with a fixed-length value that represents the length of the Rowkey, followed by a rowkey, then a fixed-length value that represents the length of the family, then the family, then the qualifier, then the two fixed-length values that represent the time Stamp and Key Type (Put/delete). The value section does not have such a complex structure, which is purely binary data.

HLog (WAL log):

WAL means Write ahead log (,

Similar to Binlog in MySQL, used for disaster recovery only, Hlog records all changes to the data, once the data is modified, it can be recovered from the log.

Each region server maintains one hlog, not one per region. This way, the logs from different region (from different table) will be mixed together, and the purpose is to continuously append a single file to reduce the number of disk addresses compared to simultaneously writing multiple files, thus improving the write performance of the table. The trouble is that if a region server is offline, in order to recover the region on it, the log on region server needs to be split and then distributed to other region servers for recovery.

The Hlog file is an ordinary Hadoop Sequence file,sequence The key is the Hlogkey object, the Hlogkey records the attribution information written to the data, in addition to table and region names, but also includes Sequence number and Timestamp,timestamp are "write Time", the starting value of sequence is 0, or the last time the file system was deposited in sequence. The value of HLog sequece file is the KeyValue object of HBase, which corresponds to KeyValue in hfile, as described above.

Iv. System Architecture


Contains an interface that accesses HBase, and the client maintains some caches to speed up access to hbase, such as regione location information.


Ensure that there is only one master in the cluster at any time

Stores the addressing entry for all region.

Real-time monitoring of the status of Region server and real-time notification of Region server's online and offline information to master

Stores the schema for HBase, including which table, which column family each table


Allocate region for region server, responsible for load balancing of Region server

Manage user's increment, delete, change, check operation of table

Discover the failed region server and reassign the region on it

Garbage file recycling on GFs

Responsible for regions migration on failed hregionserver after hregionserver shutdown

As you can see, the process of client access to data on HBase does not require master involvement (addressing access to zookeeper and region server, data read and write access to Regione server), Master only maintains the metadata information for table and region, and the load is low.


Region server maintains the region that master assigns to it, processing IO requests to these region

Region server is responsible for slicing the region that has become too large during operation

Hregionserver is primarily responsible for responding to user I/O requests and reading and writing data to the HDFs file system, which is the core module in HBase.

Hregionserver internally manages a series of Hregion objects, each of which corresponds to a region,hregion in a table consisting of multiple hstore. Each hstore corresponds to the storage of a column family in the table, and you can see that each column family is actually a centralized storage unit, so it's best to place a column with the common IO feature in a column family. This is the most effective.

Hstore storage is the core of hbase storage, which consists of two parts, part Memstore, and part storefiles. Memstore is sorted Memory Buffer, the user writes the data first will put into Memstore, when the Memstore full will be flush into a storefile (the underlying implementation is hfile), When the number of storefile files increases to a certain threshold, the compact merge operation is triggered, merging multiple storefiles into one storefile, the merge process is versioned and data is deleted, so you can see that hbase actually only adds data, All updates and deletions are performed during the subsequent compact process, which allows the user's write operations to return immediately as soon as they enter memory, guaranteeing the high performance of hbase I/O.

When the Storefiles compact, will gradually become more and more large storefile, when a single storefile size exceeds a certain threshold, will trigger the split operation, while the current region split into 2 region, the parent region will be offline, The new split of the 2 children region will be hmaster assigned to the corresponding hregionserver, so that the original 1 region of the pressure can be diverted to 2 region. Describes the process of compaction and split:

After understanding the basic principles of the above hstore, it is also necessary to understand the Hlog function, because the above Hstore in the system is not a problem under the premise of normal operation, but in a distributed system environment, can not avoid system error or downtime, so once hregionserver unexpectedly quit, The memory data in the Memstore will be lost, which requires the introduction of Hlog. Each hregionserver has a Hlog object, Hlog is a class that implements the write Ahead log, and writes a copy of the data to the Memstore file each time the user operation writes Hlog (the Hlog file format is followed). The Hlog file periodically scrolls out of the new and deletes the old file (data that has persisted to storefile).

When the hregionserver unexpected termination, Hmaster will be aware through zookeeper, Hmaster will first deal with the remaining hlog files, the different region of the log data is split, respectively, placed in the corresponding region of the directory, Then redistribute the failed region, pick up the hregionserver of these region in the process of load region, will find that there is a history hlog need to deal with, so will replay Hlog data into Memstore, Then flush to Storefiles to complete the data recovery.

MapReduce on HBase

The most convenient and practical model for running batch operations on the HBase system is still mapreduce, such as:

V. Key algorithms/Processes

5.1 Region positioning

How does the system find the region where a row key (or a row key range) is located? BigTable uses a three-tier B + tree structure to hold the area location:

The first layer is the file stored in the zookeeper, which holds the location of the root region.

Second level root region is. META. The first region of the table where the. meta.z table is stored. The location of the other region. With Root region, we can access it. META. Table data.

The third layer is. META., which is a special table that holds the region location information for all data tables in HBase.


Root region will never be split, ensuring that three jumps are required to locate any region.

META. Table each row holds the location information for a region, and row key is encoded as the last of the table name + table.

In order to speed up access,. META. All region of the table is saved in memory.

The client will cache the queried location information, and the cache will not be actively invalidated, so if the cache on the client is completely invalidated, it will take 6 network rounds to navigate to the correct region (three of which are used to discover cache invalidation and three to obtain location information).

Assume. META. A row in the table occupies approximately 1KB in memory. And each region is limited to 128MB. Then the number of region that the above three-tier structure can hold is:

(128mb/1kb) * (128MB/1KB) = 2 (34) region

5.2 Introduction to the reading and writing process

As mentioned earlier, HBase uses Memstore and storefile to store updates to tables. When the data is first written to log (WAL log) and memory (Memstore) in the update, the data in Memstore is sorted, and when memstore accumulates to a certain threshold, a new memstore is created, and add the old Memstore to the flush queue, which is flush to disk by a separate thread and becomes a storefile. At the same time, a redo point is recorded in the zookeeper to indicate that the changes before this time have persisted. (Minor compact)

When an unexpected system occurs, data in memory (Memstore) can be lost, and log (WAL log) is used to recover the data after checkpoint. As mentioned earlier, StoreFile is read-only and cannot be modified once it has been created. So the update to HBase is actually a constant addition to the operation. When a storefile in a store reaches a certain threshold, a merge (major compact) is made, and the changes to the same key are combined to form a large storefile, when the storefile size reaches a certain threshold, StoreFile will be split, and so divided into two storefile.

Since updates to the table are constantly appended, the read request needs to be accessed by accessing all storefile and Memstore in the store, merging them according to Row key, since StoreFile and Memstore are all sorted. And StoreFile with an in-memory index, the merge process is still relatively fast.

Write Request processing process:

Client submits write requests to region server

Region server locates Target region

Region checks if data is consistent with schema

Gets the current system time as the data version if the client does not have a specified version

Write update to Wal log

Write Updates to Memstore

Determine if the memstore needs to be flush to the store file.

Region distribution

At any one time, a region can only be assigned to a region server. Master records which region servers are currently available. And which region is currently assigned to which region server, and which region has not been assigned. When there is an unassigned region and there is a space available on the region server, Master sends a mount request to the region server and assigns the region to the region server. After the region server has been requested, it begins to service this region.

Region Server Online

Master uses zookeeper to track the region server state. When a region server starts, it first establishes its own file in the server directory on zookeeper and obtains an exclusive lock on the file. Because master subscribes to change messages on the server directory, master can get real-time notifications from zookeeper when a file in the server directory is added or deleted. So once the region server is online, master can get the message right away.

Region Server offline

When the region server is offline, it disconnects from the zookeeper session, zookeeper and automatically releases the exclusive lock on the file that represents the server. Master constantly polls for the lock status of files in the server directory. If Master discovers that a region server has lost its own exclusive lock (or if Master does not successfully communicate with the region server several times), Master attempts to obtain a read-write lock representing the region server, once the success is obtained , you can determine:

The network between region server and zookeeper is disconnected.

Region server is hung up.

In either case, the region server cannot continue to serve its region, and master removes the file representing the region server in the server directory and will use this region The region of server is assigned to other comrades who are still alive.

If a network transient problem causes the region server to lose its lock, then after the region server is reconnected to zookeeper, as long as the file representing it is still there, it will constantly try to get the lock on the file, and once it gets there, it can continue to serve.

Master Online

Master initiates the following steps:

Gets the unique code master lock from the zookeeper to prevent other master from becoming master.

Scan the server directory on zookeeper for a list of currently available region servers.

Communicate with each region server in 2 to obtain the current assigned region and region server correspondence.

Scanning. A collection of meta.region that calculates the region that is not currently allocated, placing them in the list of region to be allocated.

Master Downline

Because master only maintains the metadata of the table and region and does not participate in the table data IO process, the master downline only causes changes to all metadata to be frozen (unable to create a delete table, cannot modify the schema of the table, cannot load balance the region, cannot handle region up and down lines , the only exception to the region's merge is that the split in region is normal, because only region server participates, and the data read and write for the table can be normal.

Therefore, the master downline has no effect on the entire hbase cluster for a short period of time. From the online process can be seen, master saved information is all can be redundant information (can be collected from other parts of the system or calculated), so there is always a master in the HBase cluster to provide services, there is more than one "master" in the waiting time to preempt its location.

Vi. Access Interface

Native Java API, the most routine and efficient way to access, is suitable for hadoop MapReduce job parallel batching hbase table data

HBase Shell,hbase's command-line tool, the simplest interface for hbase management use

Thrift Gateway, using Thrift serialization technology to support multiple languages such as C++,php,python, to access hbase table data online for other heterogeneous systems

Rest Gateway, which supports the rest-style HTTP API to access HBase, lifting language restrictions

Pig, you can use the Pig Latin streaming programming language to manipulate data in HBase, similar to hive, which is ultimately compiled into a mapreduce job to handle hbase table data for data statistics

Hive, the release version of the current hive is not yet supported for HBase, but HBase will be supported in the next version of Hive 0.7.0 and can be accessed using a similar SQL language to HBase

Vii. construction and use of hbase cluster

Distributed real-time log system (iv) construction of the environment CentOS 6.4 under the 1.0.1 distributed cluster construction, the cluster construction process is introduced, and a one-click installation script is provided.

Using Phoenix to update operations with SQL statements The HBase data article describes how to install Phoenix and use the update operation on HBase.

For more information, please visit the Superman Academy website Http:// or follow the Superman Academy number: CRXY-CN

In-row storage HBase system Architecture Learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.