Storage Rowkey Design for HBase

Last Update:2015-03-11 Source: Internet

Author: User

Tags file info hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The location of hbase in the ecosystem the logical view of hbase storage storage format for HBase hbase Write Data Flow hbase Quick Response Data

The location of hbase in the ecosystem

HBase is located in a structured storage tier, Hadoop HDFS provides high-reliability underlying storage support for HBase, and Hadoop MapReduce provides HBase with high-performance computing power Zookeeper provides a stable service and failover mechanism for hbase.

logical view of hbase storage

1) Row key (RowKey)

--The row key is a byte array, and any string can be used as a row key;
--The rows in the table are sorted according to the row key, and the data is sorted by the byte order of row key (byte order);
--All access to the table is through row keys (single Rowkey access, or Rowkey range access, or full table scan) (Level two index)

2) Column family (columnfamily)

--CF must be given when the table is defined

--Each CF can have one or more column members (Columnqualifier), the column members do not need to be given when the table is defined, the new column family members can then on-demand, dynamically join

--data is stored separately by CF, and HBase's so-called Columnstore is stored separately based on CF (one store per CF), a design that is well suited for data analysis scenarios

3) timestamp (TimeStamp)

--each cell may have multiple versions, which are distinguished by timestamps

4) cells (cell)

--Cell is determined by row key, column family: qualifier, timestamp only, data are all stored in bytecode form

5) area (region)

-HBase automatically divides the table horizontally (by row) into multiple regions (region), each of which saves a contiguous piece of data in a table;
--Each table starts with only one region, and as the data is inserted into the table, the region grows, and when it increases to a threshold, region will wait for two new region of the branch;
-When the rows in the table are increasing, there will be more and more region. Such a complete table is saved in multiple region.

--Hregion is the smallest unit of distributed storage and load Balancing in HBase (default 256M). The smallest unit means that different hregion can be distributed on different hregionserver. However, a hregion is not split across multiple servers.

Characteristics:

Modeless: each row has a sortable primary key and any number of columns, the columns can be dynamically increased as needed, and different rows in the same table can have distinct columns ;

Column-oriented: column- oriented (family) storage and permission control, column (family) independent retrieval;
Sparse: empty (NULL) columns do not occupy storage space , tables can be designed very sparse;

storage format for HBase

Each table in HBase is split into multiple sub-tables (hregion) by a range of row keys, and by default a hregion of more than 256M is divided into two, managed by Hregionserver, Manage which hregion are assigned by Hmaster.

Hregionserver When a child table is accessed, a Hregion object is created, and a store instance is created for each column family of the table (column Family) . Each store will have 0 or more storefile corresponding to it, each storefile will correspond to a hfile, hfile is the actual storage file. As a result, there are as many stores as a hregion number of column families. In addition, each hregion also has a Memstore instance . The memstore is stored in memory and StoreFile is stored on HDFs.

Although region is the smallest unit of distributed storage, it is not the smallest unit of storage. The region consists of one or more stores, each store a columns family , and each store consists of a memstore and 0 to multiple storefile, The storefile contains hfile;memstore stored in memory and StoreFile is stored on HDFs.

HBase is a bigtable-based, column-oriented, distributed storage system, and its storage design is based on the memtable/sstable design, which is divided into two parts, one in memory Memstore (memtable) and the other on disk (this is HDFs). On the hfile (sstable). There is also the storage of the Wal log, the main implementation class for Hlog.

Essentially Memstore is a map in memory that holds a key/value, and when Memstore (default 64MB) is full, it starts to swipe the disk operation.

HBase is stored on HDFs primarily with two file types: 1.hfile, hbase keyvalue data storage format, hfile is a hadoop binary format file, in fact StoreFile is the hfile do lightweight packaging, that is storefile the bottom is hfile 2.HLog File, the storage format of the Wal (Write Ahead Log) in HBase, which is physically the sequence File of Hadoop

Hfile structure:

Data Block: Save the table, this part can be compressed

Meta Block: (optional) Save user-defined kv pairs that can be compressed.

File info:hfile meta-meta information, not compressed, fixed length.

The index of the Data block Index:data block. The starting point for each data block.

Meta Block Index: (optional) a meta block of indexes, the starting point of the meta blocks.

Trailer: fixed length. The offset of each segment is saved, and when a hfile is read, the Trailer,trailer has a pointer to the starting point of the other data block, the starting position of each segment is saved (the magic number of the segment is used for the security check), and then the DataBlock Index is read into memory so that when a key is retrieved, it does not need to scan the entire hfile, but simply finds the block of key in memory, reads the entire block into memory with one disk IO, and then finds the required key. DataBlock index is eliminated by LRU mechanism.

The hfile data Block,meta Block is typically stored in a compressed manner. The Data block is the basic unit of HBase I/O, and in order to improve efficiency, the hregionserver is based on the LRU block cache mechanism. The size of each data block can be specified by parameters when creating a table, the large block facilitates sequential scan, and the small block is useful for random queries. Each data block in addition to the beginning of the magic is a keyvalue stitching, magic content is some random numbers, the purpose is to prevent data corruption.

The Key-value structure in hfile

Each key-value pair in the hfile is a simple byte array. But this byte array contains a lot of information and contains a fixed structure. (a bit like data flow)

The start is a fixed two-length number that represents the length of the key and the length of the value, respectively. Next is the key, which starts with a fixed-length value that represents the length of the Rowkey, followed by a rowkey, then a fixed-length value representing the length of the family, followed by the Family (column family), followed by the qualifier (small column), then the two fixed-length values, Represents the time stamp and key Type (Put/delete). The value section is relatively simple and is purely binary data.

HBase maintains a multilevel index for each value, namely: <key, column family, column name (qualifer), timestamp>

hbase Write Data Flow

A) The client initiated a htable.put (put) request to Hregionserver

b) Hregionserver will match the request to a specific hregion

c) Decide whether to write the Wal log. The WAL log file is a standard Hadoop sequencefile file that stores hlogkey that contain the serial number corresponding to the actual data, primarily for crash recovery.

D) The put data is saved to Memstore, while the Memstore status is checked, and if full, the flush to disk request is triggered .

e) Hregionserver handles flush to disk requests, writes the data into the hfile file to the HDFs, and stores the last written data sequence number, so that you know what data has been deposited in the permanently stored HDFs.

Because different column families share region, it is possible that one column family already has 10 million rows and the other one is 100 rows. When a region is required to split, it causes 100 rows of columns to be equally distributed across multiple region. Therefore, it is generally not recommended to set multiple column families.

hbase Quick Response Data

Data on HBase is stored in a block block of HDFs in the form of a storefile (hfile) binary stream, but HDFs does not know what the hbase is and it only stores the file as a binary file, that is, The storage data for hbase is transparent to the HDFs file system.

HBase hregion The data for all the region in the servers cluster is opened when the server is started, and some memstore are initialized within the flush, which in some way speeds up the system response , while the data files in the block in Hadoop are closed by default, they are opened only when needed and closed when the data is processed, which increases response time to a certain extent.
Basically,HBase can provide real-time computing services mainly because of its architecture and underlying data structure, which is determined by Lsm-tree + htable (region partition) + Cache-- The client can navigate directly to the Hregion server server where the data is to be looked up, and then locate the data to match directly on a region of the server, and the data portion is cached by the cache.

Different region will be managed by the master assigned to the appropriate regionserver: There are two special table,-root-and. META in HBase.. Meta.: Records the region information of the user table,. Meta. can have multiple regoin, and Regionserver server address. -root-: recorded the. META. Table's region information,-root-only one regionØ The location of the-root-table is recorded in the zookeeper

　　Before the client accesses the user data, it needs to first access the zookeeper and then accesses the-root-table. META. Table, the last to find the location of user data to access, in the middle of the need for multiple network operations, but the client will do cache caching.

　　
1. The client will connect directly to the Hregion server matching the request data through the information in the internal cache of the relevant-root-and the information in the. META.
2, and then directly to the server on the customer request corresponding region, the customer request will first query the region in memory cache--memstore (Memstore is a key sort of tree-shaped structure buffer);
3. If the results are found in Memstore, the results will be returned to the client directly;
4. No matching data is found in Memstore, and the data in the persisted StoreFile file is then read. StoreFile is also a tree-structured file sorted by key-and is specifically optimized for range queries or block queries, and HBase reads the disk files by its basic I/O unit (that is, hbase block). Specifically, the process is:
If you can find the data to be built in the Blockcache the results of this return, or read the corresponding StoreFile file to read a block of data, if you have not read the data to be checked, the data block is placed in the Hregion The server's Blockcache, then reads the next block of data, until the block data is looped until it finds the data to be requested and returns the result, and if the data in that region is not found for the data to be searched, then it returns null directly. Represents a matching data that is not found. Of course Blockcache will start the LRU algorithm-based elimination mechanism after its size is greater than one threshold (heapsize * hfile.block.cache.size * 0.85), removing the oldest and least frequently used blocks.

HBase fault tolerance and recoveryThe Hlogfile Hlog file is an ordinary Hadoop Sequence the key of file,sequence file is the Hlogkey object,attribution information for writing data is recorded in the Hlogkey, in addition to table and region names, including sequence number and Timestamp,timestamp are "write Time",The starting value of sequence number is 0, or islast deposited in file system sequence number。 The value of HLog sequece file is the KeyValue object of HBase, which corresponds to KeyValue in hfile.

This mechanism is used for fault tolerance and recovery of data:

　　There is a Hlog object in each Hregionserver , Hlog is a class that implements the write Ahead log, and writes a copy of the data to the Memstore file each time the user operation writes Hlog (the Hlog file format is shown later) , the Hlog file periodically scrolls out of the new and deletes the old file (data that has persisted to storefile). when the hregionserver unexpected termination, Hmaster will be aware through zookeeper, Hmaster will first deal with the remaining hlog files, the different region of the log data is split, respectively, placed in the corresponding region of the directory, Then reassign the failed region to the hregionserver of these region in the process of load region, you will find that there is a history hlog need to process, so will Replay Hlog data into Memstore , then flush to Storefiles to complete the data recovery .

HBase Fault Tolerance
Master Fault Tolerance : Zookeeper re-selects a new master
* No master process, the data read is still normal;
* No master process, region segmentation, load balancing, etc. can not be carried out;
Regionserver Fault tolerance : Timed to zookeeper reporting heartbeat, if the heartbeat does not appear in the time, master will reassign the region on the Regionserver to another regionserver, on the failed server " Pre-write logs are split by the primary server and sent to the new Regionserver
Zookeeper fault tolerance : Zookeeper is a reliable service, typically configured with 3 or 5 zookeeper instances

Storage Rowkey Design for HBase

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More