Evolution of hbase file formats

Source: Internet
Author: User
Tags file info

Apache hbase is a distributed and open-source storage management tool of hadoop. It is very suitable for random Real-Time I/O operations.

We know that hadoop's sequence file is a system for sequential read/write and batch processing. But why can hbase achieve random and real-time Io operations?

Hadoop uses the sequence file format at the underlying layer for storage. Sequence file allows you to append K-V (key-value) data. Based on the append-only feature of HDFS, sequence File cannot modify or delete a specified data. Only the append operation is allowed, and you only need to traverse the file to find a key.

Therefore, how to set up a random read/write and low access latency for the hbase system with the file format first?

Before hbase 0.20-mapfile

Mapfile is a file format based on Sequence File. Mapfile is actually composed of two sequence files./data stores data and/index stores indexes.

Mapfile provides an ordered storage method. Every time n (n is configurable) records are written, the file offset address is written to the index file. This enables quick search, compared with directly traversing the sequence file, you can directly traverse the index file, and the index file stores fewer file handles. Once the location of the file block is found, you can directly jump to the file block for search; therefore, the search speed is very fast.

However, there are two other problems:

1. How to delete or update a K-V record;

2. How to Use mapfile when the inserted data is not ordered

The format of the mapfile file is as follows:


Hbase keys include row keys, column families, column qualifier, timestamp, and attributes.

To solve the deletion problem, use the type in the key to mark whether the record is deleted.

To solve the update problem, it is actually to get records with a later timestamp, and the correct data will always be closer to the end of the file. To solve the problem of unordered data, hbase stores the inserted data in the memory until a threshold value is reached. hbase stores the data in the memory in mapfile. Hbase uses the sorted concurrentskiplistmap data structure in the memory to store data. Each time the data reaches the threshold (hbase. hregion. memstore. flush. size) or reach the upper limit of memory (hbase. regionserver. global. memstore. upperlimit) will write data in the memory to a new mapfile. Each flush operation writes a new mapfile, which means that multiple files are searched, which consumes more resources and time.

To avoid traversing too many files during get and scan, hbase has a thread to perform file merge operations. When the thread finds that the number of files reaches the threshold (hbase. hstore. compaction. max), a compaction process is executed to merge files and merge small files into a large file.

Hbase has two merge modes: minor and major. Minor merges two or more smaller files into a large file. In addition, major will merge all files into one file and perform some cleanup operations. The deleted data will not be written to new files, and duplicate data will be cleared, only the latest valid data is left. Hbase versions earlier than 0.20 use the above file storage method. After 0.20, hfile V1 is introduced to replace mapfile.

Before hbase 0.20 to 0.92-hfile V1

From hbase 0.20, hbase introduced the hfile V1 file format. The specific format of hfile V1 is as follows:


The format of keyValue is as follows:


There are four keytypes: Put, delete, deletecolumn, and deletefamily. Rowlength is 2 bytes, row length is not fixed, columnfamilylength is 2 bytes, columnfamily length is not fixed, columnqualifier length is not fixed, timestamp is 4 bytes, keytype is 1 byte. The reason why columnqualifier length is not recorded is that it can be calculated using other fields.

The length of the hfile file is variable. The only fixed values are file info and trailer. Trailer stores pointers to other blocks. It writes persistent data to the end of the file. After writing, the file becomes an immutable data storage file. The key-values stored in data blocks can be considered as a mapfile. When the block is disabled, the first key is written to the index, and the index file is written to the hfile when the hfile is disabled. Hfile V1 also adds two additional metadata types, Meta and fileinfo. The two data are also written when hfile is closed.

The block size is set by hcolumndescriptor. You can specify this configuration when creating a table. The default value is 64kb. If the program mainly involves sequential access, it is more appropriate to set a larger block size. If the program mainly involves random access, it is more appropriate to set a smaller block size. However, smaller blocks also lead to more block indexes, and the creation process may become slower (compression streams must be written at the end of each block, will lead to a fs I/O FLUSHING ). The block size is generally 8 KB ~ 1 MB is suitable.

Hbase regionserver uses meta block to store bloomfilter and fileinfo to store the largest sequenceid, Major compaction key, and time span. This information can be used to determine whether a key exists in an old file or a very new file.

Bloomfilter is a space-efficient random data structure. It uses a bit array to represent a set in a concise manner and can determine whether an element belongs to the set.

Before hbase 0.92 to 0.98-hfile v2

In hbase 0.92, hfile has changed to improve the efficiency of big data storage. The main problem with hfile V1 is that you need to load all the single-chip indexes and bloomfilter into the memory. To solve this problem, V2 introduces multi-level indexes and segmented bloomfilter. Hfile V2 improves speed, memory, and cache utilization.

The specific format of hfile V2 is as follows:


Contains four parts: Scanned block, non-scanned block, and load-on-open (when hbase is running, hfile needs to be loaded into the memory index, Bloom filter metadata and file information) and trailer (the end of the file ).

In V1, when the data block index is very large, it is difficult to load all data to the memory. Assume that each data block uses a default size of 64 KB, and each index item is 64 bytes. In this way, if each data block and its data are stored 60 TB of data, the index data must have 60 GB, therefore, the memory usage is very high. However, these indexes are organized in a tree structure, so that only the top-level indexes are stored in the memory. Other indexes are read as needed and cached through the LRU cache, so that they do not all need to be loaded into the memory.

The main feature of V2 is inline blocks. The main idea is to split the index and bloomfilter into each data block to solve the problem of loading the entire file index and bloomfilter to the memory.

Because the index is split into each data block, this means that each data block has its own index (leaf-index ). The last key in each data block is used as a node to form a multi-level index structure similar to the B + tree.


Block magic in the data block header is replaced by block type. Block Type contains information about block data, including Leaf indexes, Bloom, metadata, and root indexes.

Specific implementation: When writing hfile, the current inline block index will be stored in the memory. When the size of the inline block index reaches a certain threshold (such as KB), it will be flushed directly to the disk, instead of the last flush operation, you do not need to keep all the index data in the memory. After all the inline block indexes are generated, hfile writer generates a block index of the higher level. The content in the index is the offset of these inline block indexes, which are recursive in sequence, generate block indexes of the upper layer gradually. The upper layer contains the offset of the lower layer until the top layer is smaller than the threshold. Therefore, the entire process is to gradually build the upper index block from the bottom up through the lower index block.

The other three fields (compressed/uncompressed size and offset Prev block) are also added for quick search.

In hfile V2, the process of searching data based on the key is as follows:

1) Perform a binary search for the root index of hfile in the memory. If multi-level indexes are supported, locate the leaf index. If it is a single-level index, locate the data block;

2) If multi-level indexes are supported, the leaf index is read from the cache/HDFS, And Then binary search is performed to find the corresponding data block;

3) read data blocks from Cache/HDFS;

4) traverse the data block to find the corresponding data.

From hbase 0.98 to the present-hfile v3

Hbase 0.98 began to support cell tags, so its hfile structure also changed. The hfile V3 format only adds the label section after the V2 format. Others remain unchanged, so V2 is compatible. You can directly switch from V2 to V3.


Reprinted please indicate the source:Http://blog.csdn.net/iAm333

Evolution of hbase file formats

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.