LevelDB sstable-Static layout structure

Source: Internet
Author: User

Sstable is a critical piece of bigtable, and for leveldb, understanding the Sstable implementation details of LEVELDB also helps to understand some of the implementation details in BigTable.
This section focuses on the static layout structure of sstable, sstable file forms a hierarchy of different levels, as to how this hierarchy is formed we put in the back compaction a section of detail. This section focuses on the physical layout and logical layout structure of a sstable file, which is useful for understanding LEVELDB's running process.
Leveldb there are one or more sstable files at different levels (the prefix. SST is characteristic), all the. sst files have the same internal layout. The above section describes the log file is a physical block, sstable will also divide the file into a fixed size of the physical storage block, but the logical layout of the two is very different, the root cause is: Log file is the key unordered, that is, the key size of the record has no definite size relationship, The interior of the. sst file is arranged from small to large according to the key of the record, from the sstable layout described below can be realized that key order is why so design. SST file Structure key.

Figure 1: The block structure of the SST file

Figure 1 shows the physical partitioning structure of a. sst file, which, like a log file, is divided into a fixed-size block of storage, each of which is divided into three sections, including block, type, and CRC. Block is the data store, the type area is used to identify whether data compression algorithm in block (snappy compression or uncompressed two), the CRC part is block data check code, to determine whether the data in the generation and transmission error.
The above is the physical layout of the. sst file, which describes the logical layout of the. sst files, the so-called logical layout, that is, although everyone is a physical block, but what the content of each piece of storage, the internal structure and so on. Figure 4.2 shows the internal logical interpretation of the. sst file.

Figure 2 Logical layout

As can be seen from Figure 2, from a large aspect, the. sst file can be divided into data storage area and data management area, the data store holds the actual key:value data, and the data Zone provides some index pointers and other administrative data to find the corresponding records more quickly and conveniently. Two regions are based on the above-mentioned block, that is, the file in front of a number of blocks actually store the KV data, the data management area behind the storage administration data. The management data is divided into four different types: the purple meta block, the Red Metablock Index and the blue Index block, and a file trailing block footer.
LevelDB version 1.2 for meta block is not actually used, just reserved an interface, it is estimated to add content in subsequent versions, let us look at the index block and the internal structure of the file tail footer.


Figure 3 INDEX block structure

Figure 3 is the internal structure of the index block. Again, the KV record in Data block is sorted by key from small to large, and each record of index block is the index information for a data block, each index information contains three contents: the key upper value in data block ( Not necessarily the maximum key), the data block in the. sst file's offset and size, as shown in Figure 3, the index of the chunk I: the first field in the Red Section records the key that is greater than or equal to the largest key value in the block I, The second field indicates the starting position of the block I in the. sst file, and the third field indicates the size of the data block I (sometimes with data compression). The following two fields are good to understand, is used to locate the data block in the file position, the first field needs to explain in detail, the key value stored in the index is not necessarily a key of a record, in the example of Figure 3, assume the minimum key= "samecity" of the Block I, the maximum key= " The best "i+1", the smallest key= "The Fox", the largest key= "zoo", then for the index of the data block I, the first field record is greater than or equal to the maximum key of block I ("the best"), and smaller than the data block i+ 1 of the minimum key ("The Fox"), so the first field of index I in the example is: "The C", this is to meet the requirements, and the first field of index i+1 is "Zoo", that is, the largest key of the block i+1.
The internal structure of the footer block at the end of the file is shown in Figure 4,metaindex_handle, which indicates the starting position and size of the Metaindex block; Inex_handle indicates the starting address and size of the index block; these two fields can be understood as indexed is set up to correctly read the index value, followed by a fill area and magic number (0XDB4775248B80FB57).


Figure 4 Footer

The above is mainly about the internal structure of the data management area, let us look at the data section of a block of the internal layout of the data part, Figure 5 is its internal layout.


Figure 5 Internal structure of Data block

It can be seen that the interior is also divided into two parts, the front is a KV record, its order is based on the key value from small to large arrangement, in the Block tail is a number of "Restart Point" (Restart points), is actually some pointers, pointing out the block content of some record location.
What is the "restart point"? In simple terms, data compression reduces storage space. We have repeatedly emphasized that the KV record in block content is ordered according to the key size, so that the adjacent two records are likely to overlap the key part, such as key i= "The car", key i+1= "The color", then there are overlapping parts "the C", In order to reduce the storage of key, key i+1 can only store and the last key different part "Olor", the common part of the two can be obtained from key I. The recorded key is stored in the Block Content section, primarily to reduce storage overhead. "Restart point" means: At the beginning of this record, no longer take only the different key parts, but re-record all the key value, assuming that key i+1 is a restart point, then key will be stored in full "the color", rather than the use of a simple "olor" way. However, if the number of records is more, random access to a record, you need to start from the beginning to resolve the line, which also has a lot of overhead, so set up a number of restart points, block tail is to indicate which records are these restart points.


Figure 6 Recording format

What is the internal structure of each KV record in the block content area? Figure 6 shows the detailed structure, each record contains 5 fields: Key shared length, key non-shared length, value length, key non-shared content, value content. For example, the above "the car" and "The Color" record, the key shared length 5;key non-shared length is 4, while key non-shared content is actually stored "olor"; Value length and content indicate key separately: Value is the length of value and stores the actual value values.
These are all the internal mysteries of the. sst file.

For Block format and related operation, please refer to the LEVELDB source code Analysis-sstable:block.

    sstable For more information see the building and reading of the LEVELDB source code analysis-sstable:.sst file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.