NetEase Video Cloud: hbase principle and design

Last Update:2016-05-16 Source: Internet

Author: User

Tags compact md5 hash mongodb sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

NetEase Video CloudNetEase is the launch of the video cloud service platform, to provide customers withreally easy to useVideo cloud services,comprehensive end-to-end solution, full technical expert access guidance. Below, the technical expert of NetEase video cloud to share with you:hbase principle and design.

Brief introduction

Hbase--hadoop database's abbreviation, Google bigtable Another way to realize the open source, from the outset, in order to solve the high-speed access to large quantities of low-cost machine data, distributed data storage to provide reliable solutions. In terms of functionality, HBase is a database that provides data storage and reading services as well as our familiar Oracle, MySQL, and MSSQL. From the application point of view, HBase and the general database is different, hbase itself access interface is quite simple, does not support complex data access, but also does not support SQL and other Structured Query language; HBase does not have an index other than Rowkey. All data distributions and queries are dependent on Rowkey. Therefore, HBase will have very strict requirements on the design of the table. Schema, HBase is a model of distributed databases, which is more like MongoDB sharding mode, according to the size of the key value, the data distributed to different storage nodes, MongoDB based on the configserver to locate the data on which partition, HBase accesses the zookeeper to obtain the address of the-root-table, which is obtained by-root-table. META. Table information to get the region location of the data store.

Architecture

As mentioned above, HBase is a distributed architecture, except for the underlying storage of HDFS, hbase itself can be divided into three pieces of functionality: Zookeeper Group, Master Group and Regionserver Group.

Zookeeper Group: An indispensable part of the HBase cluster, mainly for storing master addresses, coordinating upper and lower lines such as master and Regionserver, storing temporary data, and so on.
Master Group: Master is mainly to do some management operations, such as: Region allocation, manual management operations issued and so on, general data read and write operations do not need to go through the master cluster, so master generally does not need a high configuration can.
Regionserver Group: Regionserver Group is the real data storage place, each regionserver is composed of several regions, while a region maintains a certain interval rowkey values of data, the entire structure such as:

HBase structure diagram

, Zookeeper (ZK) is a cluster, usually consisting of an odd number of ZK services. Master is also recommended for service availability, because Master is the initiator of the entire management operation, if Master has an unplanned outage, the entire cluster will not be able to manage operations, so master must have multiple, of course, many master also has a master and slave points, How to distinguish which is the Lord and which is from? The key to see which master can compete to the ZK on the master directory of the lock, holding the directory lock Master Master, other from the master polling competition that lock, so once the master master has an unplanned outage, From Master will soon take over the service because it competes for a lock on the master folder.
Regionserver (RS) in the non-replication mode, the entire system is unique, that is, in the entire non-replication hbase cluster, each RS stored in different data, so relative to the previous two, RS is not highly available in this mode, at least the RS may have a single point of failure, but due to hbase internal data in the region storage and region can be migrated mechanism, RS Service single point of failure may be very small cost recovery quickly, But once the Stop RS has-root-or. META. Table of region, that the consequences are still relatively serious, because the data node of the RS shutdown, only in a short period of time will affect the region of the RS on the inaccessible, wait until after the migration is complete can be restored, if the-root-、. META. Where the RS is down, the entire hbase new plea will be affected because of the need to route through the. META. Table to find the address of the RS where the region is located.

Data organization

Throughout the architecture, ZK is used for service coordination and the preservation of some information during the whole cluster operation and-root-table address positioning, master for the internal management of the cluster, so the remaining RS is mainly used for processing data.
RS is the main place to process data, so how does the data inside the RS be distributed? In fact, RS itself is just a container that defines a number of functional threads, such as the data merge thread (compact thread), the StoreFile split thread, and so on. The main object in a container is that region,region is a table that is divided by its own rowkey extent, a table can be divided into parts, that is, several region,region can be distributed on different RS depending on the Rowkey range ( Of course it can be on the same RS, but it is not recommended). An RS can contain more than one table of region, or it can contain only a portion of a table region,rs and tables are two different concepts.
There is also a concept--the list of clusters. There are some people who know about hbase, more or less heard: HBase is a columnstore database, and this column storage column, in fact, is different from the General database column, here the concept of the column is a column, the column cluster, as the name implies is a collection of many columns, and on the data storage, different column cluster of data , must be stored separately, even within the same region, different columns are stored in different folders, the advantage is that generally when we define the cluster, we usually put similar data into the same column cluster, different column clusters are stored separately, it is advantageous to the data compression, and hbase itself supports a variety of compression methods.

Principle

The general architecture of HBase is described earlier, and we know that HBase is composed of ZK, Master, and RS, and this section introduces the basic principles of hbase, from data access, RS routing to RS internal cache, data storage and brush writing, and merging and splitting of the region, among other functions.

Regionserver positioning

Access to hbase through the HBase client (or API), the entire hbase provided to the external address, in fact, is the entrance of ZK, before also introduced, ZK in the Save-root-where the RS address, From the-root-table you can get the. Meta. Table information, based on the. Meta. Table to get the region's distribution on RS, the entire region addressing process is as follows:

RS Positioning process

First, the client requests the address of the target data by accessing ZK.
ZK saves the address of the-root-table, so ZK requests the data address by accessing the-root-table.
Similarly, the-root-table holds the. Meta. Information by accessing the. Meta. Table to get the specific Rs.
. META. Table query to the specific RS information after returning the specific RS address to the client.
After the client gets to the destination address, it then sends a data request directly to that address.

The above process is actually a three-layer index structure, obtaining-root-information from ZK, and then obtaining it from-root-. Meta. Table information, and finally from the. Meta. Table to find the RS address after the cache. Here are a few questions:

Since ZK can save-root-information, then why not. META. Information is stored directly in ZK and needs to be located through the-root-table?
After the client finds the destination address, the next request needs to go zk->-root-->. META. This process?

Let's answer the first question: why not just put. META. Table information is saved directly into ZK? Mainly in order to preserve the amount of data to consider, ZK is not suitable to save large amounts of data, and. META. Table is mainly to save the region and RS mapping information, the number of region is not specific constraints, as long as the memory allowed to the extent that there can be many, if stored in ZK, ZK pressure will be very large. Therefore, through a-root-table to be dumped into the RS is an ideal solution, compared to directly stored in ZK, there is a layer of-root-table query, the performance is not very significant.
The second problem: every visit needs to go zk–>-root-->. META. The process? Of course not, the client side has a cache, the first query to the corresponding region of the RS, this information will be cached to the client side, each subsequent access will be directly from the cache to obtain the RS address. Of course, there is an accident: if the region of the visit has changed on the RS, such as being balancer to other RS, this time, through the cache address access will be an exception, in the event of an exception, the client will need to go through the process to obtain a new RS address. In general, the change of region will occur only in rare cases, the general changes will not be very large, so throughout the cluster access process, the impact can be ignored.

Region Data Write

HBase through Zk->-root-->. META. When the access to the RS address is obtained, the data is written directly to the RS and the whole process is as follows:

Regionserver Data manipulation Process

After the client obtains the RS address through a three-layer index, the data is written to the corresponding region of the specified RS, and hbase data is written in the form of the Wal (write ahead log), which writes the log first and then writes the data. HBase is a append-type database, with no complicated operations for relational databases, so logging Hlog is a simple put operation (Delete/update operations are converted to put)

Hloghlog Write

Hlog is the log message generated by HBase in the form of the Wal, which is a simple sequential log, each region on the RS shares a hlog, and all the region data writes on the RS are recorded in the Hlog. The main role of Hlog is that when the RS crashes, you can recover as much data as possible, here is as much as possible, because in general, the client in order to improve performance, the Hlog auto flush will be turned off, so that the Hlog log all rely on the operating system guarantee, if there is an unexpected crash, Logs that have not been fsync in a short period of time will be lost.

Hlog Expired

A large number of writes to Hlog will cause Hlog to occupy more storage space, HBase hlog cleanup through Hlog expiration, a hlog monitoring thread is running inside each RS, The cycle can be configured with Hbase.master.cleaner.interval.
Hlog after the data has been flush from memstore to the underlying storage, the Hlog is no longer needed. will be moved to. Oldlogs this directory, the Hlog monitoring thread monitors the Hlog in that directory, and the monitoring thread immediately deletes the expired Hbase.master.logcleaner.ttl when hlog in the folder reaches the expiration condition of the Hlog setting.

Memstore Data storage

Memstore is the region internal cache, and its size is configured by the HBase parameter hbase.hregion.memstore.flush.size. Rs after writing the Hlog, the next goal of data writing is that region's Memstore,memstore is organized through the LSM-TREE structure within hbase, so it is possible to combine a large number of update operations on the same rowkey.
It is because of the existence of Memstore, hbase data writes are asynchronous, and performance is very good, after writing to Memstore, the write request can be returned, hbase that the data is written successfully. One thing to note here is that the data written to Memstore is sorted in advance by the value of Rowkey, which facilitates subsequent data lookups.

Data Brush Disc

Memstore data in a certain condition will be brush-write operation, so that the data persisted to the corresponding storage device, triggering the Memstore brush disk operation There are many different ways such as:

Memstore Brush Writing process

All of these can trigger the flush operation of the Memstore, but the method is implemented in different ways:

1 trigger Memstore Brush disk operation with global memory control. Memstore Overall memory consumption limit is set by the parameter Hbase.regionserver.global.memstore.upperLimit, of course, after reaching the upper limit, Memstore brush write is not always, in memory down to Hbase.regionserve R.global.memstore.lowerlimit the configured value, the brush disk operation is stopped memstore. This is done primarily to prevent long-memstore brush discs, which can affect overall performance.
In this case, all region of the RS Memstore memory footprint is not up to the brush disk conditions, but the overall memory consumption has reached a very dangerous range, if sustained, it is likely to cause the RS Oom, this time, the need to carry out the Memstore brush disk, thereby freeing memory.
2 Manual trigger Memstore Brush disc operation
HBase provides API interface to run a memstore brush disk with external calls
3 Memstore maximum trigger data brush disc
As mentioned earlier, the size of the Memstore is set by Hbase.hregion.memstore.flush.size, and when the amount of data memstore in the region reaches that value, Memstore's brush disk operation is automatically triggered.

Brush Disc Effect

Memstore in different conditions will trigger the data brush disk, then the entire data in the process of the brush disk, the region of the data write, etc. what impact? Memstore's data brush disk, the direct impact on region is: In the beginning of the data brush disk to the end of the time, access to the region is rejected, mainly because at the end of the data brush, RS will change region to do a snapshot, At the same time Hlog do a checkpoint operation, informing ZK which Hlog can be moved to. oldlogs. As can be seen from the previous diagram, the corresponding region will be added with a updatelock lock at the beginning of the Memstore write disk, and the lock is released when the write is finished.

StoreFile

Memstore will be written to the underlying storage after triggering the brush disk operation, each time the Memstore brush disk will generate a corresponding storage file hfile,storefile that is hfile in the hbase layer of lightweight sub-assembly. The constant write of the data volume, resulting in memstore frequent flush, each flush will produce a hfile, so that the number of hfile files on the underlying storage device will be more and more. Whether it is HDFs or Linux under Common file system such as EXT4, XFS, and so on, small and many files on the management of no large files to be effective, such as small file open need to consume more file handle The query performance of the specified Rowkey data in a large number of small files is not fast to query in a small number of large files, and so on.

Compact

A large number of hfile, will consume more file handle, and will cause the RS in the data query efficiency greatly reduced, hbase to solve this problem, introduced the compact operation, RS through the compact to a large number of small hfile file merging, generate large hfile file.
The compact on RS can be divided into two different types depending on the function: the minor compact and the major compact.

Minor Compact

The minor compact, also known as the small compact, is carried out frequently during the operation of RS, It is controlled mainly by the parameter hbase.hstore.compactionThreshold, which configures the number of hfile to be minor when the value is satisfied Compact,minor The compact only selects the lower part of the hfile for compact operation, and the selected hfile size cannot exceed the hbase.hregion.max.filesize parameter setting.

Major Compact

Instead, the major compact is also known as the large Compact,major compact, which will compact all hfile of the same cluster across the region, meaning that after the major compact is complete, The hfile under the same column cluster will be merged into one. The major compact is a long process with relatively high pressure on the underlying I/O.
Major compact in addition to merging hfile, another important feature is to clean up outdated or deleted data. As mentioned earlier, the delete operation of HBase is also written by append, once some data is deleted inside hbase, it is simply marked as deleted internally, and there is no data cleanup at the storage level, only through major When the compact hfile is reorganized, the data that is marked as deleted can be really cleaned up.
The compact operation has a specific thread, which generally does not affect the performance of the data write on the RS, and there are exceptions: when the compact operation speed cannot keep up with the hfile growth rate in region, for safety reasons, RS will reach a certain number of hfile The write is locked until the hfile is released to a certain extent by the compact.

Split

The compact combines multiple hfile with a single hfile file, and as the amount of data is constantly written, a single hfile will become larger, a large number of small hfile will affect the performance of the data query, the larger hfile will be, the greater the hfile, The longer it takes to search for the specified Rowkey data in hfile, HBase also provides a split solution for region to solve the problem of long data query times caused by large hfile.
A larger region through the split operation, will generate two small region, called daughter, the general daughter in the data is based on the point of Rowkey between the segmentation, region of the split process is roughly like:

Region Split Process

Region first changes the state of the spliting in ZK.
Master detects a change in region state.
Region creates a new. Split folder in the storage directory to hold the daughter region information after split.
The parent region closes the data write and triggers the flush operation to ensure that all data written to the parent region is persisted.
Create a new two region under the. Split folder, called daughter A, daughter B.
Daughter A, daughter B are copied to the HBase root directory, forming two new region.
Parent Region Notification modification. META. Table after the downline, no longer provide services.
Daughter A, daughter B on-line, began to provide services outside.
If the Balance_switch service is turned on, the region after split will be re-distributed.

The above 1 ~ 9 is the entire process of region split, split process is very fast, the speed is basically in seconds, then in such a fast time, the region of the data how to be re-organized?
In fact, split is simply to divide the region logically into two, and does not involve the reorganization of the underlying data, split after the completion of the Parent region is not destroyed, but is done offline processing, no longer provide services to the outside. And the newly generated region daughter A and daughter B, the internal data is simply an index to the parent region data, and the cleanup of the parent region data is performed in daughter A and daughter B major After the compact, it is found that the Parent region is not actually cleaned up until it has been indexed to its internal data.

HBase Design

HBase is a distributed database, its performance depends mainly on the design of internal tables and the allocation of resources is reasonable.

Rowkey Design

Rowkey is the basis of hbase implementation of distributed, hbase through the Rowkey Range Division of different region, the basic requirement of distributed system is at any time, the system access should not appear obvious hot phenomenon, so the design of Rowkey is very important, Generally we suggest that the beginning of the Rowkey in hash or MD5 hash, as far as possible to achieve rowkey the head is evenly distributed. It is forbidden to use the mark of time, user ID and so on with obvious segmentation phenomenon as rowkey directly.

Column cluster design

HBase table design, according to different needs have different choices, need to do online query data table, try not to design multiple columns, we know that different columns are separated in the storage, multi-column cluster design will result in the data query time to read more files, thereby consuming more I/O.

TTL design

Choosing the right data expiration time is also a point to note in the table design, which allows the column clusters in hbase to define the data expiration time, which can be cleaned up by the major compact once the data expires. The remnants of a large amount of useless historical data will cause the region to increase in volume and affect query efficiency.

Region design

Conversely, a small region means that the major compact will be relatively frequent, but because the region is small, the major compact has relatively fast relative time, and a relatively large major compact operation accelerates the cleanup of outdated data.
Of course, the design of small region means more region split risk, region capacity is too small, after the amount of data reached the limit, the region needs split to split, in fact, split operation in the entire hbase running process, is not expected to appear , because once split occurs, it involves the reorganization of data, the redistribution of region, and a series of problems. So we need to take these issues into account at the beginning of the design and try to avoid split in the operation of the region.
HBase can solve the split generation of region in the run process by pre-allocating the region at the time the table is created, allocating enough region numbers in advance of the table design, and at least some of the data will expire before the region reaches the limit by major After the compact is cleaned up, the volume of data in region is always maintained in a balanced state.
Region number design also needs to consider the memory limit, through the previous introduction we know that each region has a number of memstore,memstore and region number and the number of clusters below the region, One RS under Memstore memory consumption:

Memory = Memstore Size * Region number * Number of clusters

If you do not make a preliminary data volume estimate and the region is pre-allocated, creating a new region with a continuous split can easily result in an oom phenomenon due to insufficient memory.

NetEase Video Cloud: hbase principle and design

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More