HBase Learning (16) system architecture diagram

Source: Internet
Author: User
Tags compact file info

Transferred from: http://www.cnblogs.com/cenyuhai/p/3708135.html

HBase system Architecture diagram


Constituent Parts Description
Communicating with Hmaster and hregionserver using the hbase RPC mechanism
Client communicates with Hmaster to manage class operations
Client and Hregionserver data read and write class operation
Zookeeper Quorum storage-root-table address, Hmaster address
Hregionserver the Ephedral way to zookeeper, Hmaster feel the health of each hregionserver at any time
Zookeeper avoid hmaster single point problem
Hmaster there is no single point of issue, HBase can start multiple hmaster, through the zookeeper Master election mechanism to ensure that there is always a master running
Mainly responsible for the management of table and region:
1 Manage users to change the table and delete the operation
2 Manage Hregionserver load Balancing, adjust region distribution
3 Region split, responsible for the distribution of the new region
4 after Hregionserver outage, responsible for failure hregionserver on region migration
The most core module in HBase, primarily responsible for responding to user I/O requests and reading and writing data to the HDFs file system


Hregionserver manages some column hregion objects;
Each hregion corresponds to a region,hregion in table consisting of multiple hstore;
Each hstore corresponds to the storage of a column family in the table;
Column family is a centralized storage unit, so it is more efficient to put a column with the same IO characteristics in a column family

The core of hbase storage. Made up of Memstore and StoreFile.
Memstore is sorted Memory Buffer. The process by which a user writes data:


Client write-to-memstore, until Memstore full, flush into a storefile, until it grows to a certain threshold, trigger compact merge operations Multiple StoreFile merged into one storefile, simultaneous version merging and data deletion, when the Storefiles compact is gradually formed, the larger StoreFile-a single storefile size exceeds a certain threshold value, Trigger split operation, the current region split into 2 region,region will be offline, the new split out of the 2 children region will be hmaster assigned to the corresponding hregionserver, The pressure of the original 1 region was diverted to 2 region.
As a result of this process, hbase only adds data, and the resulting update and delete operations are all done in the compact phase, so the user writes only need to go into memory to return immediately, thus ensuring I/O is high performance.

Introduce Hlog reason:
In a distributed system environment, there is no way to avoid system error or downtime, once hregionserver unexpectedly exits, memstore memory data will be lost, introduced hlog is to prevent this situation
Working mechanism:
There is a Hlog object in each Hregionserver, Hlog is a class that implements the write Ahead log, and each time a user action is written to Memstore, a copy of the data to the Hlog file is written, the Hlog file is periodically scrolled out of the new, and the old file is deleted ( Persisted to data in storefile). When the hregionserver unexpected termination, hmaster through zookeeper sense, hmaster first processing the legacy hlog files, the different region of the log data split, respectively, placed in the corresponding region directory, Then redistribute the failed region, pick up the hregionserver of these region in the process of load region, will find that there is a history hlog need to deal with, so will replay Hlog data into Memstore, Then flush to Storefiles to complete the data recovery.

hbase Storage Format
All data files in HBase are stored on the Hadoop HDFs file system in two main formats:
1 hfile hbase keyvalue data storage format, hfile is a hadoop binary format file, in fact StoreFile is the hfile to do a lightweight packaging, that is storefile the bottom is hfile
2 HLog file,hbase in the storage format of the Wal (Write Ahead Log), which is physically the sequence File of Hadoop



Picture explanation:
The hfile file is variable length, with fixed lengths of only two blocks: trailer and FileInfo
Trailer The pointer points to the starting point of the other data block
Some meta information for files is recorded in file info, for example: Avg_key_len, Avg_value_len, Last_key, COMPARATOR, Max_seq_id_key, etc.
The data index and Meta index blocks record the starting point for each data block and meta block
The Data block is the basic unit of HBase I/O, and in order to improve efficiency, there is an LRU-based block cache mechanism in Hregionserver
The size of each data block can be specified by parameter when creating a table, large block facilitates sequential scan, small block for random query
Each data block in addition to the beginning of the magic is a keyvalue pairs of stitching, magic content is some random numbers, the purpose is to prevent data corruption

Each keyvalue pair inside the hfile is a simple byte array. This byte array contains many items and has a fixed structure.


Keylength and Valuelength: Two fixed lengths representing the length of key and value, respectively
Key part: Row length is a fixed-length value, indicating the length of the Rowkey, row is Rowkey
Column Family length is a fixed-length value that represents the lengths of the Family
Then the column Family, then the qualifier, then the two fixed-length values representing the time stamp and key Type (Put/delete)
The value section does not have such a complex structure, that is, pure binary data

HLog File


The Hlog file is an ordinary Hadoop Sequence file,sequence The key is the Hlogkey object, the Hlogkey records the attribution information written to the data, in addition to table and region names, but also includes Sequence number and Timestamp,timestamp are "write Time", the starting value of sequence is 0, or the last time the file system was deposited in sequence.
The value of HLog sequece file is the KeyValue object of HBase, which corresponds to KeyValue in hfile


Concluding remarks: This article is I specifically on the internet to get down, is the hbase part of the ultimate article bar, my server source series also to be based on this order to carry out.


I. Introduction of HBASE

HBase is a distributed, column-oriented, open-source database that comes from the Google paper "Bigtable: A distributed storage system of structured data" written by Fay Chang. Just as BigTable leverages the distributed data store provided by the Google File system, HBase provides bigtable-like capabilities on top of Hadoop. HBase is a sub-project of the Apache Hadoop project. HBase differs from the general relational database, which is a database suitable for unstructured data storage. The other difference is that HBase is column-based instead of row-based patterns.

HBase is an open source cottage version of BigTable. is built on the HDFS, providing high reliability, high performance, Columnstore, scalable, real-time reading and writing database system.

It is between NoSQL and RDBMS and can only retrieve data from the primary key (row key) and the range of the primary key, supporting only single-line transactions (complex operations such as multi-table joins can be implemented through hive support). It is mainly used to store unstructured and semi-structured loose data.

Like Hadoop, hbase targets rely primarily on scale-out to increase compute and storage capacity by increasing the number of inexpensive commercial servers.

The tables in HBase generally have this feature:

1 big: A table can have billions of rows, millions of columns

2 column-oriented: column (family)-oriented storage and permission control, column (family) independent retrieval.

3 sparse: For a column that is empty (null), it does not occupy storage space, so the table can be designed to be very sparse.

Logical view

HBase stores data in the form of a table. The table is made up of rows and columns. Columns are divided into a number of column families (row family)

Row Key Column-family1 Column-family2 Column-family3
Column1 Column1 Column1 Column2 Column3 Column1
Key1 T1:abc
Key2 T3:abc
Key3 T2:dfadfasd


Row Key

Like NoSQL databases, row key is the primary key used to retrieve records. There are only three ways to access rows in HBase table:

1 access via a single row key

2 through the range of row key

3 Full table Scan

Row key line keys (row key) can be any string (the maximum length is 64KB, the actual application length is generally 10-100bytes), inside HBase, the row key is saved as a byte array.

When stored, the data is sorted by the dictionary order (byte order) of the row key. When designing a key, to fully sort the storage feature, put together the row stores that are often read together. (Positional dependency)


The result of the dictionary ordering of int is 1,10,100,11,12,13,14,15,16,17,18,19,2,20,21,..., 9,91,92,93,94,95,96,97,98,99. To maintain the natural order of shaping, the row key must be left padded with 0.

One read or write of a row is an atomic operation (no matter how many columns are read or written). This design decision makes it easy for the user to understand the behavior of the program when concurrent update operations are performed on the same row.

Column Family

Each column in an hbase table is attributed to a column family. The column family is part of the Chema of the table (and the column is not) and must be defined before the table is used. Column names are prefixed with the column family. For example courses:history, Courses:math belong to the courses family.

Access control, disk, and memory usage statistics are performed at the column family level. In real-world applications, control permissions on the column family can
Help us manage different types of applications: we allow some apps to add new basic data, some apps can read basic data and create inherited column families, and some apps will only allow browsing data (and possibly even
All data cannot be browsed for privacy reasons).

Time stamp

A storage unit identified by row and columns in HBase is called a cell. Each cell holds multiple versions of the same piece of data. The version is indexed by time stamp. The type of timestamp is a 64-bit integer. The timestamp can be assigned by HBase (automatically when the data is written), at which time the timestamp is the current system time that is accurate to milliseconds. Timestamps can also be explicitly assigned by the customer. If your application avoids data versioning conflicts, it must generate its own unique timestamp. In each cell, different versions of the data are sorted in reverse chronological order, that is, the most recent data is in the front row.

To avoid the burden of management (including storage and indexing) caused by too many versions of data, HBase provides two ways to recover data versions. The first is to save the last n versions of the data, and the second is to save the version for the most recent period (for example, the last seven days). Users can set them for each column family.


The only unit determined by {row key, column ( =<family> + <label>), version} . The data in the cell is of no type and is all stored in bytecode form.

Physical storage

1 has already been mentioned, all rows in the table are arranged in the dictionary order of row key.

2 Table is split into multiple hregion in the direction of the row.

3 region by size, each table starts with only one region, as the data is constantly inserted into the table, the region is increasing, when the increase to a threshold, hregion will wait for the chapter two new hregion. As the rows in the table grow, there will be more and more hregion.

4 Hregion is the smallest unit of distributed storage and load balancing in HBase. The smallest unit means that different hregion can be distributed on different hregion servers. However, a hregion is not split across multiple servers.

5 Hregion Although it is the smallest unit of distributed storage, it is not the smallest unit of storage.

In fact, hregion consists of one or more stores, each store a columns family.

Each strore is made up of one memstore and 0 to more storefile.

StoreFile is saved in hfile format on HDFs.

The hfile is divided into six parts:

Data Block Segment – Saves the table, which can be compressed

Meta block Segment (optional) – Save user-defined kv pairs that can be compressed.

The meta-information of the File info segment –hfile is not compressed, and users can add their own meta-information in this section.

The index of the Data block index segment –data block. The key for each index is the key of the first record of the block being indexed.

The index of the Meta Block index (optional) –meta block.

trailer– This paragraph is fixed-length. The offset of each segment is saved, and when a hfile is read, the first
Read Trailer,trailer saves the starting position of each segment (the magic number of the segment is used to make a secure check), and then DataBlock index is read into memory, so that when a key is retrieved, it does not need to scan the entire hfile. Instead, simply find the block in memory where key is located, read the entire block into memory with a single disk IO, and then find the key you need. DataBlock index is eliminated by LRU mechanism.

The hfile data Block,meta Block is typically stored in compression, which reduces network IO and disk IO Significantly, with the overhead of, of course, CPU compression and decompression.

Compression support for Target hfile two ways: Gzip,lzo.

HLog (WAL log)

WAL means Write ahead log (http://en.wikipedia.org/wiki/Write-ahead_logging), similar to the Binlog in MySQL, to
For disaster recovery purposes only, Hlog records all changes to the data, and once the data is modified, it can be recovered from log.

Each region server maintains one hlog, not one per region. This way, the logs from different region (from different table) are mixed together, so the goal is to keep appending a single
Files can reduce the number of disk addresses relative to the simultaneous writing of multiple files, thus improving write performance to the table. The trouble is that if a region server is offline, in order to recover the region on it, the log on region server needs to be split and then distributed to other region servers for recovery.

The Hlog file is an ordinary Hadoop Sequence file,sequence The key is the Hlogkey object, the Hlogkey records the attribution information written to the data, in addition to table and region names, but also includes Sequence number and Timestamp,timestamp are "write Time", the starting value of sequence is 0, or the last time the file system was deposited in sequence. The value of HLog sequece file is the KeyValue object of HBase, which corresponds to KeyValue in hfile, as described above.

System architecture


1 contains an interface for accessing HBase, and the client maintains some caches to speed up access to hbase, such as regione location information.


1 guarantee that there is only one master in the cluster at any time

2 storage of the addressing entrances of all region.

3 real-time monitoring of the status of Region server, the region server's online and offline information to real-time notification to master

4 stores the schema for HBase, including which table, which column family each table


1 assigning region to region server

2 responsible for load balancing of Region server

3 Find the failed region server and reassign the region on it

4 Garbage file collection on GFs

5 Working with Schema update requests

Region Server

1 Region server maintains the region that master assigns to it, processing IO requests to these region

2 Region server is responsible for slicing the region that has become too large during operation

As you can see, the process of client access to data on HBase does not require master involvement (addressing access to zookeeper and region server, data read and write access to Regione server), Master only maintains the metadata information for table and region, and the load is low.

V. Key algorithms/Processes

Region positioning

How the system locates a region of a row key (or a row key range)

BigTable uses a three-layer structure similar to the B + tree to preserve the region location.

The first layer is the file stored in the zookeeper, which holds the location of the root region.

Second level root region is. META. The first region of the table where the. meta.z table is stored. The location of the other region. With Root region, we can access it. META. Table data.

. META. Is the third layer, which is a special table that holds the region location information for all data tables in HBase.

1 root region will never be split, guaranteeing the most need to jump three times, you can locate any region.

2.META. Table each row holds the location information for a region, and row key is encoded as the last of the table name + table.

3 to speed up access,. META. All region of the table is saved in memory.

Assume. META. A row in the table occupies approximately 1KB in memory. And each region is limited to 128MB.

Then the number of region that the above three-tier structure can hold is:

(128mb/1kb) * (128MB/1KB) = 2 (34) region

4 The client will cache the queried location information, and the cache will not be actively invalidated, so if the cache on the client fails, then 6 network rounds are required to locate the correct region (three of which are used to discover cache invalidation and three to obtain location information).

Read and write process

As mentioned earlier, HBase uses Memstore and storefile to store updates to tables.

When the data is first written to log (WAL log) and memory (Memstore) in the update, the data in Memstore is sorted, and when memstore accumulates to a certain threshold, a new memstore is created and
and add the old Memstore to the flush queue, which is flush to disk by a separate thread and becomes a storefile. At the same time, the system will be in zookeeper
Record a redo point to indicate that the change before this moment has persisted. (Minor compact)

When an unexpected system occurs, data in memory (Memstore) can be lost, and log (WAL log) is used to recover the data after checkpoint.

As mentioned earlier, StoreFile is read-only and cannot be modified once it has been created. So HBase is more
The new is actually a constant addition to the operation. When a storefile in a store reaches a certain threshold, a merge (major compact) is made, and the changes to the same key are combined to form a large storefile that, when the storefile size reaches a certain threshold, StoreFile split, etc divided into two storefile.

Since updates to the table are constantly appended, the read request needs to be accessed by accessing all storefile and Memstore in the store, merging them according to Row key, since StoreFile and Memstore are all sorted. And StoreFile with an in-memory index, the merge process is still relatively fast.

Write Request processing process

1 Client submits write request to region server

2 Region server locates target region

3 Region Check if data is consistent with schema

4 If the client does not specify a version, gets the current system time as the data version

5 Write update to WAL log

6 Write updates to Memstore

7 determine if the Memstore needs to be flush for the store file.

Region distribution

At any one time, a region can only be assigned to a region server. Master records which region servers are currently available. And which region is currently assigned to which region server, and which region has not been assigned. When there is an unassigned region and there is a space available on the region server, Master sends a mount request to the region server and assigns the region to the region server. After the region server has been requested, it begins to service this region.

Region Server Online

Master uses zookeeper to track the region server state. When a region server starts, it first establishes its own file in the server directory on zookeeper and obtains an exclusive lock on the file. Because master subscribes to change messages on the server directory, master can get real-time notifications from zookeeper when a file in the server directory is added or deleted. So once the region server is online, master can get the message right away.

Region Server offline

When the region server is offline, it disconnects from the zookeeper session, zookeeper and automatically releases the exclusive lock on the file that represents the server. Master constantly polls for the lock status of files in the server directory. If Master discovers that a region server has lost its own exclusive lock (or if Master does not successfully communicate with the region server several times), Master attempts to obtain a read-write lock representing the region server, once the success is obtained , you can determine:

1 The network between region server and zookeeper is disconnected.

2 region server is hung up.

In either case, the region server cannot continue to serve its region, and master removes the file representing the region server in the server directory and will use this region The region of server is assigned to other comrades who are still alive.

If a network transient problem causes the region server to lose its lock, then after the region server is reconnected to zookeeper, as long as the file representing it is still there, it will constantly try to get the lock on the file, and once it gets there, it can continue to serve.

Master Online

Master initiates the following steps:

1 Gets the unique code master lock from zookeeper to prevent other master from becoming master.

2 Scan the server directory on zookeeper for a list of currently available region servers.

Each region server in 3 and 2 communicates with the current assigned region and region server.

4 scan. A collection of meta.region that calculates the region that is not currently allocated, placing them in the list of region to be allocated.

Master Downline

Because master maintains only metadata for tables and region, and does not participate in table data IO
Process, master downline only causes the modification of all metadata to be frozen (unable to create a delete table, cannot modify the schema of the table, cannot load balance the region, cannot process the region up and down, cannot be merged with region, The only exception to this is that the split in region is normal, because only region server participates, and the data read and write of the table can be normal. Therefore, the master downline has no effect on the entire hbase cluster for a short period of time. From the online process you can see that the master saved
Information can be redundant information (can be collected from other parts of the system or calculated), so there is always a master in the HBase cluster to provide services, and more than one
' Master ' is waiting for the opportunity to preempt its position.

HBase Learning (16) system architecture diagram

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.