Index structure of TOKUDB-the realization of fractal tree

Source: Internet
Author: User
Tags uuid percona

A brief introduction to Fractal tree

Original: http://www.bitstech.net/2015/12/15/tokudb-index-introduction/

A fractal tree is a write-optimized disk index data structure. In general, the fractal Tree's write operation (Insert/update/delete) performance is better, and it can also ensure that the read operation is similar to the read performance of B + tree. According to Percona's test results, the Tokudb fractal Tree's write performance was better than that of the InnoDB B + tree, and the reading performance was slightly lower than that of the B + tree. Similar index structures have lsm-tree, but Lsm-tree's write performance is far superior to read performance.

The most important product for the industry to realize the fractal tree is the Ft-index (Fractal Tree index) key-value pair storage engine developed by Tokutek Corporation. The project has been in development since 2007 and has been open source until 2013, and the code is currently hosted on GitHub. The open source agreement is licensed by the GNU general Public License. In order to fully exploit the power of the Ft-index storage engine, Tokutek, based on the K-V storage engine, implements the MySQL storage engine plug-in to provide all API interfaces for the MySQL storage engine, a project called Tokudb, It also implements the API interface of the MONGODB storage engine, which is called TOKUMX. On April 14, 2015, Percona Company announced the acquisition of Tokutek Company, FT-INDEX/TOKUDB/TOKUMX This series of products are included under the Percona company. Since then, Percona company has claimed to be the first technology vendor to offer both MySQL and MongoDB software and solutions.

This paper mainly discusses the ft-index of Tokudb. Several important features of Ft-index compared to B + trees are:

    • From two angles of theoretical complexity and test performance, the insert/delete/update operation performance of Ft-index is better than that of B + tree. But read operation performance is lower than B + tree.
    • Ft-index uses larger index pages and data pages (Ft-index defaults to 4M, and InnoDB defaults to 16K), which makes the compression ratio of FT-INDEX data pages and index pages higher. That is, when you open index pages and data page compression, you insert an equal amount of data, and Ft-index consumes less storage space.
    • Ft-index supports online modification of the DDL (Hot Schema change). Simply put, the user can still perform write operations while doing DDL operations (for example, adding an index), a feature that is naturally supported by the FT-INDEX tree structure. Due to space limitations, this article does not specifically describe the implementation of the hot Schema changes.

In addition, Ft-index supports transaction (ACID) as well as transaction MVCC (multiple version cocurrency control multiple versioning concurrency control), which supports crash recovery.

Precisely because of the above characteristics, Percona company claimed that tokudb on the one hand to the customer a great performance improvement, on the other hand, but also reduce the customer's storage use costs.

Ft-index disk storage structure

The index structure of the Ft-index is as follows (here, for the sake of description and understanding, I have made a degree of simplification and abstraction of Ft-index binary storage, "right click", "Open in new tab" to see a larger view):

In, the gray area represents a page of the Ft-index fractal tree, the green area represents a key value, and the two green areas represent a son pointer. Blocknum represents the offset of the page that the son pointer points to. Fanout represents the fan-out of a fractal tree, that is, the number of pointers to the son. Nodesize represents the number of bytes that a page occupies. Nonleafnode indicates that the current page is a non-leaf node, Leafnode indicates that the current page is a leaf node, the leaf node is the lowest level of the Key-value key value pairs of nodes, non-leaf nodes do not store value. The heigth represents the height of the tree, the height of the root node is 3, the height of the next node in the root node is 2, and the lowest leaf node is 1 in height. Depth represents the depth of the tree, the root node has a depth of 0, and the root node's next-level node depth is 1.

The tree structure of a fractal tree is very similar to a B + tree, and its tree structure consists of several nodes (which we call node or block, which we call page or page in InnoDB). Each node consists of an ordered set of key values. Assuming a node's key value sequence is [3, 8], then the key value will (-00, +00) The entire interval is divided into (-00, 3), [3, 8), [8, +00) Such 3 intervals, each interval corresponds to a son pointer (child pointer). In the B + tree, the child pointer generally points to a page, and in the fractal tree, each child pointer is accompanied by a message Buffer (Msg_buffer) In addition to the address of node (Blocknum), which is a message Buffer is an FIFO (FIFO) queue that holds update operations such as Insert/delete/update/hotschemachange.

According to the implementation of the Ft-index source code, the Ft-index of the fractal tree is more rigorous:

    The
    • node (block or node, which we call page or page in InnoDB) is made up of an ordered set of key values, the first key value is set to a null key value, which indicates negative infinity. The
    • node is divided into two types, a leaf node and a non-leaf node. The son pointer of the leaf node points to the Basementnode, and the non-leaf node points to the normal node. The Basementnode node here holds multiple K-v key-value pairs, which means that all of the last lookup operations need to be anchored to the Basementnode to successfully obtain the data (value). This is similar to the leafpage of the B + tree, where the data (value) is stored in the leaf node, and the non-leaf node is used to store the key value (key) to index. When the leaf node is loaded into memory, in order to quickly find the data in the Basementnode (Value), Ft-index will convert the entire Basementnode key-value into a weakly balanced binary tree, this balanced binary tree has a very funny name, Called the scapegoat tree, which no longer unfolds.
    • The key-value interval for each node corresponds to a son pointer (child Pointer). The son pointer of a non-leaf node carries a messagebuffer, Messagebuffer is a FIFO queue. Used to store update operations such as Insert/delete/update/hotschemachange. Both the son pointer and the Messagebuffer are serialized in the disk file of node.
    • the number of child pointers for each non-leaf node (Non leaf nodes) must be within the range of [FANTOUT/4, Fantout]. Here fantout is a parameter of the fractal tree (b + Tree also has this concept), which is mainly used to maintain the height of the tree. When the number of sons of a non-leaf node is less than FANTOUT/4, we think that the space of this node is empty, and it needs to be merged with other nodes into a node merge, which can reduce the height of the whole tree. When the number of pointers to a non-leaf node is more than Fantout, we think the node is too full and needs to split a node into two (node split). With this constraint control, the disk data can theoretically be maintained in a normal, relatively balanced tree structure, which can control the upper limit of insertion and query complexity.

Note: In the Ft-index implementation, the condition of the control tree balance is more complex, for example, in addition to consider fantout, but also to ensure that the total number of nodes in [NODESIZE/4, nodesize] This interval, nodesize is generally 4M, when not in this interval, The corresponding merge or split (split) operation is required.

The insert/delete/update realization of fractal tree

In the previous article, we mentioned that the fractal tree is a write-optimized data structure, and its write operation performance is better than that of B + tree. So how exactly does it do better writing performance?

First of all, the write operation performance referred to here is the random write operation. As a simple example, let's say we're constantly executing this SQL statement in MySQL's InnoDB table: INSERT INTO sbtest set x = UUID (), where the Sbtest table has a unique index field x. The randomness of the UUID () will result in the data inserted into the Sbtest table scattered across the different leaf nodes. In the B + tree, a large number of such random writes will result in a large number of hot data pages in the Lru-cache falling on the upper level of the B + tree (as shown). This reduces the probability of the bottom-level leaf node hitting the cache, resulting in a large number of disk IO operations, resulting in a random write performance bottleneck for the B + tree. However, the sequential write operation of the B + tree is fast, because the sequential write operation takes full advantage of the local hotspot data and the disk IO frequency is greatly reduced.

The flow of the fractal tree insert operation is explained below. To facilitate the following description, the Convention is as follows:

A. We take the insert operation as an example, assuming that the inserted data is (Key, Value);
B. The load node (load Page)described below is the first to determine if the node is hit Lru-cache. Only when the cache is not hit, Ft-index will be positioned through the seed to offset the read data page to memory;
C. To reflect the core process, we do not consider crash logs and transactions for the time being.

The detailed process is as follows:

    1. Load the root node;
    2. Determine if the root node needs to split (or merge), or split (or merge) the root node if the split (or merge) condition is met. The process of splitting the root node specifically, interested students can open the brain hole.
    3. When the root node is height>0, that is, Root is a non-leaf node, a binary search finds the key value range range of key, wrapping (key, value) as a message (Insert, key, value). Into the message buffer of the child pointer corresponding to the key-value interval range.
    4. When the root node is height=0, that is, when root is a leaf node, the message (insert, key, value) is applied (apply) to the Basementnode, which is the insertion (key, value) into Basementnode.

There is a very strange place where the root node is constantly being plump with a large number of insertions (including random and sequential insertions), which will cause the root node to do a lot of splitting operations. Then, after the root node has done a lot of splitting operations, resulting in a large number of height=1 nodes, and then the Height=1 node is full, and will generate a lot of height=2 nodes, the final tree height is higher. This weird place hides the secret of a fractal tree writing operation that is higher than a B + tree: Every insertion is returned at the root, and each write operation does not need to search the basementnode of the tree structure, which causes a large number of hotspot data to fall on the top of the root node ( The Hotspot data distribution map is similar to this at this point, thus taking full advantage of the locality of hotspot data and greatly reducing the disk IO operation.

The update/delete operation is similar to the insert operation, but it is important to note that the fractal tree has a random read performance and is not as good as the B + Tree of InnoDB (described in detail later). As a result, the update/delete operation needs to be subdivided into two scenarios, where the performance of the test may be significantly different:

    • Update/delete (overwrite) of the covered type. That is, when a key is present, it executes the update/delete, and when the key does not exist, no action is taken and no error is required.
    • Strictly match the Update/delete. Executes the update/delete when key is present; When key does not exist, an error is required to the upper application. In this case, we need to query whether the key exists in the Ft-index Basementnode, so point-query silently drag the performance of the Update/delete operation back.

In addition, Ft-index has done some optimizations for sequential insert operations, such as sequential write acceleration, in order to improve sequential write performance, which is no longer expanded.

The point-query realization of fractal tree

In Ft-index, a query operation like select from table where id =? (where ID is an index) is called point-query; like a select fromtable where ID >=? and ID <=? The query operation (where the ID is an index) is called Range-query. As mentioned above, the Point-query read operation performance is not as InnoDB B + Tree, which describes Point-query's related processes in detail here. (It is assumed that the key value to be queried is keys)

    1. Loading the root node, the binary search determines that the key falls on the root node of the key range range, to find the corresponding range of the child pointer.
    2. Loads the node corresponding to the child pointer. If the node is a non-leaf node, continue searching down the fractal tree until the leaf node stops. If the current node is a leaf node, the lookup is stopped.

After finding the leaf node, we are not able to directly return the value of Basementnode in the leaf node to the user. Because the insertion of a fractal tree is inserted by means of a message, all messages from the root node to the leaf node need to be basementnode to the leaf node in turn. After all the messages are completed, find the value of the key in the Basementnode, which is what the user needs to find.

The search process of the fractal tree is basically similar to that of the InnoDB B + tree, except that the fractal tree needs to push down the messge buffer from the root node to the leaf node (refer to the code for the specific flow of the push, which is no longer expanded), and apply the message to the Basementnode node. Note that the lookup process requires a push message, which may cause some nodes on the path to be plump, but Ft-index does not split and merge the leaf nodes during the query because the Ft-index design principle is: insert/update/ The delete operation is responsible for the split and merge of the node, and the select operation is responsible for deferred push (lazy push) of the message. This allows the fractal tree to complete the update by applying the Insert/delete/update update operation to the specific data node through a future select operation.

The range-query realization of fractal tree

The following is an introduction to the query implementation of Range-query. In simple terms, the range-query of a fractal tree is basically equivalent to doing N-point-query operations, and the cost of operation is basically equivalent to the cost of N-point-query operations. Because the fractal tree stores the Basementnode update operation in the non-leaf node's msg_buffer, we need to find the leaf node from the root node when we look for the value of each key. The message on this path is then apply to the value of Basenmentnode. This process can be used to represent.

In the B + tree, however, the underlying leaf nodes are organized into a doubly linked list by pointers, as shown in the structure. Therefore, we just need to navigate from the node to the leaf node to the first key that satisfies the condition, and then iterate over the leaf node for the next pointer to get all the Key-value key values to range-query. Therefore, for the range-query operation of the B + tree, in addition to the first need to traverse from the root node to the leaf node to do random write operations, subsequent data read basically can be seen as sequential IO.

By comparing the range-query implementations of the fractal tree and the B + tree, it is obvious that the Range-query query cost of the fractal tree is significantly higher than that of the B + tree, because the parting tree needs to traverse the entire subtree of the root node covering range, while the B + The tree requires only one seed to range starting key, and subsequent iterations are basically equivalent to sequential IO.

Summarize

This paper takes the tree structure of the fractal tree as the starting point, and introduces the operation of adding and deleting the fractal tree in detail. In general, fractal tree is a kind of data structure of write optimization, its core idea is to use the Messagebuffer cache update operation of nodes, make full use of the principle of data locality, and convert the random write to sequential write, which greatly improves the efficiency of random writing. The Iibench test results from the Tokutek research and development team show that the performance of the TOKUDB insert operation (random write) is much faster than InnoDB, while the performance of the Select operation (Random Read) is lower than the InnoDB performance, but the gap is small. At the same time, because the TOKUDB uses 4 m large-page storage, so that compression is relatively high. This is why Percona company claims TOKUDB higher performance and lower costs.

In addition, the Online Update table structure (hot Schema change) implementation is also based on Messagebuffer, but unlike the insert/delete/update operation, the former message push-down mode is a broadcast push (a message from the parent node, Applied to all of the son nodes), the latter's message is pushed in a unicast push (a message from the parent node, applied to the son node of the corresponding key-value interval), and the description is no longer expanded because the implementation resembles the insert operation.

Finally, welcome to Ft-index interested students to exchange discussions.

Index structure of TOKUDB-the realization of fractal tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.