LSM Tree Parsing

Source: Internet
Author: User

Introduction

It is well known that traditional disk I/O is a relatively lossy performance, and optimizing system performance often requires dealing with disk I/O, while disk I/O latency is determined by the following 3 factors:

    • Seek time (the time required to move the disk arm to the appropriate cylinder, the time required to move to the adjacent cylinder during the Seek is 1ms, and the time bit required for the random movement 5~10ms)
    • Rotation time (time required to wait for the appropriate sector to rotate to the head)
    • Actual data transfer time (low-end hard drive has a transfer rate of 5mb/ms, while the speed of the hard drive is 10mb/ms)

The average seek time has improved by 7 times times in the last 20 years, the transfer rate has been improved by 1300 times times, and the capacity has been improved by as much as 50,000 times times, mainly because the improvement of moving parts in disk is relatively slow and gradual, while the recording surface achieves a fairly high density. Access to a block is determined entirely by the seek time and the rotation delay, so it takes the same time to access a block, so the more data you take, the better.

Problems brought about by

Disk I/O has become a bottleneck for many systems due to the speed of operating the disk far below CPU and memory, and the disk cache is rapidly increasing, so that a large portion of read requests are directly from the file system cache and do not require disk access operations, i/ The optimization of O is largely focused on the optimization of write operations.

Problem-solving ideas I/O types

Disk I/O bottlenecks may appear above the Seek (seek) and transfer (data transfer).

Depending on the type of disk I/O, the B-and + + trees are widely used in the relational storage engine, while the BigTable storage architecture base uses the log-structured Merge tree.

B-Tree

A B-tree is a balanced lookup tree similar to a red-black tree, but better at reducing disk I/O operations, the B-tree has a lower depth, and finding an element requires only a few nodes from the disk to load into memory and quickly accesses the data to be looked up.

    1. Each node x has the following fields:
      • X.N, node x contains the number of keywords.
      • X.N keys itself, in a non-descending order, so.
      • X.leaf, Boolean value, True if X is a leaf node, or False if it is an inner node
    2. Each inner node x contains x.n+1 pointers to their children , and the leaf nodes have no children, so the pointer field of their children is undefined.
    3. If ki is a keyword stored in the node x child node:
    4. Each leaf node has the same depth, that is, the height of the tree H
    5. Each node contains a number of keywords X.N contains an upper bound and the lower bound, with a fixed integer t>=2 to the table;
      • Each non-root node contains at least t-1 keywords. Each non-root inner node has at least one child, and if the tree is non-empty, the root node contains at least a keyword.
      • Each node contains at most 2t-1 keywords, so that an inner node contains at least 2t children, we say a node is full, if the node contains exactly 2t-1 keywords.

A B-tree with a height of 3, which contains the smallest possible number of key words, shown in each node x is n[x]

The B + Tree is a variant of the B. C-Tree, and the B + tree is more suitable for the external storage index structure, and the MySQL storage engine generally uses B+tree to implement its index structure. The inner node contains only the key values and pointers to the child nodes, the data is stored in the leaf node, and all the record nodes are placed in the same layer's leaf nodes according to the order of the key values, and each node pointer is connected (doubly linked list).
A B + tree with a height of 2

All records are in the leaf node, well, and are stored sequentially, if we are from the leftmost
The leaf nodes begin sequential traversal and can be sorted in order of all boring values 15, 10, 15, 20, 25, 30, 50, 55, 60, 65, 75, 80, 85, 90

B + Tree Index in the database has a feature is its high fan out, so the database of B + tree height is generally in two or three layers *, that is, to find a key value of the row records, up to 2 to 3 IO.

Performance analysis

If there are not too many write operations, the B + tree can work well, and it will perform heavy optimizations to ensure lower access times. Write operations are often random, randomly written to different locations on the disk, and updates and deletions are performed at the rate level of disk seek. RDBMS is usually seek type, mainly caused by the B-tree used to store data, or by the C + + tree structure, to achieve a variety of operations at the rate level of disk seek, typically log (N) Seek operations for each access

While Lsm-tree works at the disk transfer rate level, it can be better scaled to a larger data size, guaranteeing a consistent insertion rate, because it uses log files and a memory storage structure of 1 to convert random writes to sequential write operations . 2, read and write operations are independent so that there is no competition between the two operations .

In the transmission of the same data scenario, the time delay of the random write I/O is mostly spent on seek operation, and the database randomly writes to the disk, which can produce multiple seek operations, and sequential access requires only one seek operation to transfer a large amount of data to a scene that writes large amounts of data in bulk. Sequential writing has a distinct advantage over random writing.

An important idea of the log-structured Merge-tree (Lsm-tree) is that by using an algorithm, the algorithm delays and batches the index changes, and efficiently migrates the updates to disk in a way that is similar to a merge sort, for bulk writes, The use of disk sequential write performance is far better than random write this feature, the random write into sequential write, thus ensuring that the operation of the disk is sequential, to improve write performance, while indexing to obtain faster read performance, a balance between read and write performance.

Lsm-tree is suitable for those scenarios where the write frequency is much larger than the read frequency, and many nosql systems such as HBase, leveldb and so on have borrowed from the idea of Lsm-tree.

Lsm-tree Algorithm Algorithm Introduction

Lsm-tree can be made up of multiple components, and the following is a brief example of the case of 2 components:

Schematic picture of Lsm-tree of

As shown: A two-component Lsm-tree contains a memory component C0 and a larger component C1 that is persisted on disk. When a record is written or updated, the log is first pre-written for data recovery when the data write fails. The record is then inserted into the memory-resident C0 tree, which is moved from the C1 tree to the disk when a certain condition is met.

It is necessary to record the log that restores the insert operation in order to prevent data loss in memory in the event of an unexpected delay in keeping the record from C0 to C1 for a certain period of time, since the update was not persisted during this time.

C0 trees do not necessarily have a class B-tree structure. First, its nodes can be of any size: there is no need to keep it consistent with the disk page size, because the C0 tree will never be on disk, so there is no need to sacrifice CPU efficiency to minimize the depth of the tree, a 2-3-tree or AVL tree can be used as a data structure for the C0 tree. Note: The thread-safe CONCURRENTSKIPLISTMAP data structure is used in hbase.

Inserting an entry into an in-memory C0 tree is fast because the operation does not incur disk I/O overhead. However, the cost of memory used for C0 is much higher than that of disk, and is usually limited by its size. An efficient way to migrate records to a C1 tree that resides on a lower-cost storage device. To achieve this, when the C0 tree reaches a threshold size that is close to an upper limit due to an insert operation, a rolling merge process is initiated to remove some contiguous record segments from the C0 tree and merge them into the C1 tree on disk.

Conceptual picture of rolling merge steps, with result written back to disk

The merge process is as follows: The rolling merge actually consists of a series of merge steps. The first is to read a multi-page block containing the C1 tree, which will allow a series of records in the C1 to enter the cache. Each merge will then read the C1 leaf node directly from the cache with the size of the disk page, merge the records from the leaf nodes with the leaf node level from the C0 tree, reducing the size of the C0 and creating a new merge leaf node in the C1 tree.

The C1 tree on disk is a B-tree-like data structure, but it is optimized for sequential disk access, all nodes are full, and in order to make efficient use of the disk, all single-page nodes under the root node are packaged and placed into contiguous multi-page disk blocks (multi-page Block), the seek time for a page is spread over multiple pages.

After the merge, the new blocks will be written to a new location on the disk so that the old blocks will not be overwritten and is available in the recovery after the crash occurs. The parent directory node in the C1 is also cached in memory, which is also updated to reflect the changes in the leaf node, while the parent node remains in memory for a period of time to minimize Io, and when the merge step is complete, the old leaf node in C1 becomes invalid and is then removed from the C1 directory structure. Usually, the leftmost leaf node record in C1 is involved in the merge, because if the old leaf nodes are empty then the merge step will not generate new nodes, so there is no need to do so. In addition to the updated directory node information, these leftmost records are also cached in memory for a period of time before being written to disk, to provide concurrent access during the merge phase and to recover from memory loss after crash, in order to reduce the rebuild time during recovery, The merge process requires periodic checkpoints, forcing the cache information to be written to disk.

As time progresses there will be more flush operations, resulting in many storage files, and a background process is responsible for aggregating these files into larger files so that the disk seek operation is limited to a certain number of storage files. The tree structure stored on the disk can also be split into multiple storage files. Because all of the stored data is sorted by key, you do not need to reorder the new keys when you insert them in an existing node.

Looking through merging, the memory storage structure is first searched, followed by the disk storage file. In this way, a consistent view of all stored data is seen from the client's perspective, regardless of whether the data is currently residing in memory. Delete is a special update operation that stores a delete tag that is used during a lookup to skip those keys that have been deleted. When the data is re-written back through merging, the delete tag and the key that is obscured by the tag are discarded.

The background process for managing data has an additional feature that can support assertion-type deletion. This means that the delete operation can be triggered by setting a TTL (time-to-live) value on the records that you want to discard. For example, set the TTL value to 20 days, then 20 days later the record becomes invalid. The merge process examines the assertion and, when asserted as true, discards the record in the blocks that it writes back.

The shortcomings

There is a write amplification problem, and the same data may be written to disk multiple times as the merge process progresses.

Implementation of HBase

In HBase, hfile requires that the data to be written is ordered in accordance with KeyValue, while HDFs itself is designed to be sequential read-write (sequential reads/writes) and is not allowed to be modified after it is finished. To solve this problem, HBase caches the recently received data in memory (in Memstore), completes the sort before persisting to HDFs, and then writes to HDFs in bulk in a fast sequence.

Memstore

Memstore is the implementation of C0 in HBase, when writing data to HBase, first write to the in-memory memstore, when a certain threshold is reached, the Memstore will be frozen, no longer respond to write requests, generate a new memstore to respond to write requests, The previous Memstore will flush (write sequentially) to the disk, forming a new storefile,storefile and merge.

HBase Memstore and StoreFile

The main reason for using Memstore is that hfile requires that the data being written is ordered in accordance with KeyValue, and that HDFs itself is designed for sequential read and write (sequential reads/writes) and is not allowed to be modified. To solve this problem, HBase caches the recently received data in memory (in Memstore), completes the sort before persisting to HDFs, and then writes to HDFs in bulk in a fast sequence, and in fact writes large amounts of data to disk in a single address compared to a traditional database random write. Speed up data persistence.

Memstore internally maintains a data structure: Concurrentskiplistmap<keyvalue, keyvalue= "" > Implements the SortedMap interface, Data storage is a sequence of jumping lists that are sorted by key as a data structure that can replace a balanced tree in many applications as a way to implement it. The algorithm for jumping lists has the same progressive expected time boundary as the balanced tree, and is simpler, faster, and uses less space.

Skip List

The nature of the Skip list:

    • Composed of many layers, level is randomly generated by a certain probability.
    • Each layer is an ordered list, which is ascending by default, or it can be sorted based on the comparator provided when the mapping is created, depending on the construction method used.
    • The list at the bottom (level 1) contains all the elements.
    • If an element appears in a linked list of level I, it will also appear in the list below level I.
    • Each node contains two pointers, one pointing to the next element in the same linked list, and one pointing to the following layer of elements

In addition to (O (LOGN)) completion of write and read operations, sequential traversal can be performed based on the provided comparator, enabling sequential write disk generation StoreFile.

hfile

Hflile is the implementation of the C1 in the LSM tree for ease of analysis: logically dividing the hfile into index nodes and data nodes;

Index records and data records are sorted by key (Row,family,coqualifier,ts) dictionary order.
Each record of the index node, respectively, points to a different block, containing the following information:

    • If it is the lowest index, point to the first key of the data block, otherwise point to the first key of the index block
    • The offset of the data block in the hfile (pointer)
    • The size of the data block in the disk, if compressed, the size of the compressed

The following is an example of an index record.

1234567891011 Key=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaazi/info:city/latest_timestamp/put offset=2155038, dataSize=38135 key=aaaaaaaaaaaaaaaaabbbabbaaaabbbbatu/info:city/latest_timestamp/put offset=4343199, dataSize= 38358 key=aaaaaaaaaaaaaaaabbbababbbbababaaxe1t70l5r6f2/info:city/latest_timestamp/put offset= 6539008, datasize=38081

Hfile logic above is a full B + tree, found that if the size of root_index to a certain threshold (128K), it will add a level index;

By using PI to record each index, the following logic diagram can be broadly plotted:

HFILE Organizational Structure

The smaller the amount of data, the depth of the index will be only one layer, as the table continues to grow, the size of the root_index to a certain threshold (128K), will add a layer index.

For the lookup process, the keyvalue in hfile is ordered, and the sorted means that the data is stored sequentially in ascending order of the Dictionary of key strings, and the key is indexed. When retrieving, the index block is loaded into memory, and the corresponding DataBlock is found by binary lookup, that is, in data index, the key of the current query may be in which data block, and then the DataBlock is loaded into memory ( In order to improve performance there will be a cache policy), and then sequentially traverse the data block to find key and value.

Search StoreFile

By organizing these indexes in a tree-like structure, only the top-level index is resident, and the other indexes are read on demand and cached through the LRU cache, so that you do not need to load all the indexes into memory.

Performance considerations
    1. If a row corresponds to multiple columns, or multiple versions, when a table is built, the fewer keyvalue a hfileblock can hold, especially when looking for the latest data, which can span multiple blocks.
    2. Scan operations will continue to add blocks to the cache, if more than one scan cache block, and only a process to eliminate the block, then there is likely to be a memory overflow, so the online calculation requires a large number of scans, it is best to close the cache.
    3. StoreFile too much can affect read performance, because a read operation is likely to require too many Hfile,io to be opened, which can affect performance
    4. Hfile block the larger the more suitable for sequential scanning (large fast decompression speed will be slow) but the random read and write performance is reduced, the small hfile block is more suitable for random read and write, but need more memory to hold the index default (128KB)
    5. HFileV2 's index is divided into multiple layers, although it can reduce the use of memory, speed up the boot speed, but more than 1-2 times the IO, especially the hfile file is large, random reading is likely to be more than the V1 2 IO operation, and then read slower.
Merge operations

Reference:

Http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf

http://duanple.blog.163.com/blog/static/7097176720120391321283/

LSM Tree Parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.