HBase for data products, the underlying storage architecture directly determines the characteristics and usage scenarios of the database. RDBMS (relational database) uses B-tree and + + trees as the data storage structure. HBase uses the LSM tree: !--more-->
Two fork tree all nodes have up to two child nodes. The node b tree search, starting from the root node, if the keyword of the query is equal to the keyword of the node, then hit;
B + Tree
Reading speed factor for data
because the traditional mechanical disk has fast sequential read-write, slow random read and write access characteristics, this feature has great influence on the choice of disk storage structure and algorithm.
In order to improve the data access characteristics, the file system or database system will usually be sorted after the data storage, speed up the data retrieval speed, which need to ensure that the data is constantly updated, inserted, deleted and still orderly, traditional relational database practice is to use B + tree.
B-Tree at the time of insertion, if it is the last node, then the speed is very fast, because it is a sequential write.
However, if there are comprehensive writes such as update insertions and deletions, then there will be more random IO due to the need to recycle disk blocks. A lot of time is spent on disk seek time.
--------------------------------------------------------------------------------------------------------------- --------------------
ps:b+ tree is on the basis of the B-tree plus two rules 1. The child nodes only have pointers, and the sub-nodes are stored with data 2. All child nodes are strung from left to right with a doubly linked list
B + Tree principle, B + tree should not be slow in the query process, but if the data inserted in a disorderly, such as inserting 5 and then 10000 then 3 and then 800 such a large span of data, you need to first " Locate the location where this data should be inserted ", and then insert the data. the process of finding a location, if very discrete, means that every time the search is done, his child nodes are not in memory , and then it is necessary to use the disk seek time lookup . Update Basic is the same as insert
LSM Tree
Simply put, the disk read performance is discarded in exchange for the order of writing. at first glance, it seems that reading should be the most guaranteed feature of most systems, so reading and writing does not seem like a good deal. But don't worry, listen to me. Analysis of LSM tree performance analysis.
1. Memory speed is over 1000 times times faster than disk. and read performance gains, mostly rely on memory hit rate rather than disk read times
2. Write to an IO that does not occupy the disk, and read it to get longer disk IO access, which can also improve read efficiency.
As a result, although sstable reduces read performance, if the read hit ratio of the data is guaranteed, the read performance is basically not reduced, and even improved, because the read can get more disk IO opportunities. and the performance of the write will get a larger increase, basically is about 5~10 times.
LSM tree Insert data can be as an n-order merge tree. Data write operations (including insertions, modifications, and deletions are also written) are performed in memory ,
the data is first inserted into the in-memory tree. When the amount of data in the memory tree exceeds the set threshold, a merge operation is performed. The merge operation facilitates and merges the child nodes of the in-memory tree with the child nodes of the tree in the disk from left to right, overwriting the old data with the latest updated data (or logging to a different version). When the amount of merged consolidated data reaches the disk's storage page size. Will persist the merged data to disk while updating the parent node pointer to the child node.
LSM tree Read Data The non-child node data of the book in the disk is also cached in memory. When a read operation is required , the search is always started from the in-memory sort tree, and if it is not found, it is searched from the sorting tree order on disk.
A data update on the LSM tree does not require disk access, and can be done in memory, much faster than a B + tree. Using the LSM tree can greatly reduce the number of disk accesses and speed up access when the data access is written primarily and the read operation is focused on the most recently written data.
The LSM tree deletes the data before speaking. All operations of the LSM tree are performed in memory, then the deletion is not a physical deletion. Instead, a tombstone is labeled on the deleted data, and when the data in memory reaches the threshold, it is written to the disk in sequential order with other data in memory. This takes a bit of space, but Lsm-tree provides some mechanism to reclaim the space.
As a storage structure, B + trees are not unique to relational databases, and NoSQL databases can also use B + trees. Similarly, the relational database can use LSM, and with the maturing of SSD hard disk and the advent of memory technology of large capacity persistent storage, I believe that the "old" storage structure of B + tree will be rejuvenated again.
Summary
binary tree : Each node stores only one keyword, equals hit, less than left node, greater than go right node;
binary tree ,b tree: Multiple search trees, each node storing to m keywords, non-leaf node storage points to the key All keywords appear in the whole tree and appear only once, non-leaf nodes can be hit;
B + Tree: In On the basis of the B- tree, the link table pointers are added for the sub-nodes, all the keywords appear in the sub-nodes , and the non-sub-nodes are indexed as the sub-nodes;B + tree is always hit by the child node;
b* Tree: (Seek) on the basis of the B + tree, for non-sub-nodes also increase the linked list pointer, the minimum utilization of the node from the set to increase to 2/3;
LSM Tree: (transmission) on the basis of B + tree, read-write separation, reading operation first memory disk, data write operation (including insert, modify, delete is also write) all in memory
(in terms of disk usage, there are two different types of database paradigms: One is seek, one is transmission) and the RDBMS is usually sought-after. This is primarily caused by a B-tree or a + + tree structure used to store data. Various operations are implemented at the rate level of the disk seek, and log (N) Seek operations are typically required for each access.
GodHas given me a gift. Only one. I am the most complete fighter. My whole life, I had trained. I must prove I am worthy of someting. rocky_24
From for notes (Wiz)
Binary tree, B-tree, + + tree, b* tree, LSM tree