About B + tree (with python simulation code)

Source: Internet
Author: User

A few days ago I wrote something btree (http://thuhak.blog.51cto.com/2891595/1261783), continue this idea today, continue to write B + tree.

In addition, B + tree is my goal. I have a better understanding of the basic principles of file and database indexes.


Previously, I only treated B + tree as a kind of deformation of B tree, or an optimization under some circumstances. In other cases, it may be better for B tree. After that, we can find that B + tree can completely replace B tree in various situations and make the index performance better than B tree. Because the core point of B + tree design is to make up for the biggest defect of B tree.


What are the biggest defects of B-tree?

First, we know that for a multi-path search tree such as B tree and B + tree, a very important feature is that the number of trees is very large. This is the only way to reduce the depth of the tree and the number of disk reads. The higher the level of the tree, the larger the percentage of leaf nodes in the tree. If the degree is 1000, the number of leaf nodes is at least 1000 times higher than the number of internal nodes in the previous layer, which is even more negligible in the previous layer. It can be said that 99.9% of the trees are leaf nodes. However, for B-tree, all nodes have the same structure and contain a certain amount of data and pointers to nodes. These two data items occupy almost all the space of the B-Tree node. The number of data in a node is one less than the number of hard disk pointers. It can be said that the number of pointers is almost equal. The dynamic type language like python does not feel like it, but for a fixed type language like C, even if the children list array is empty, the space of this array is reserved. The result is that the disk space occupied by the children list pointer array of the vast majority of leaf nodes is completely wasted.

The size of a data and the size of the hard disk pointer depend on the ratio of the key to the value in key-value. Assume that the ratio is 2 to 1. Therefore, btree wastes almost 1/3 of the space.

To address this problem, B + tree designs the data structure of leaf nodes and internal nodes separately so that leaf nodes do not store pointers. Therefore, for leaf nodes of the same size, the number of data contained in B + tree is larger than that of B tree. According to the above assumption, it is 1/2 larger. The depth of the number is likely to be shorter than that of the btree, and the number of disk loading times required for searching or traversing in a large range is also less.


In addition, B + tree also features that all data is stored on the leaf node. These leaf nodes can also form a linked list and take out the header of the linked list to facilitate direct data access. Some articles believe that this is a huge Optimization for range search. However, in my opinion, the biggest function of this feature is to make the code easier. In terms of performance, it will only be worse than the tree traversal, rather than better than the tree traversal. Because no matter whether you search by pointer to a leaf node or by traversing a tree, the number of searched nodes is almost the same. The performance of searching within a range of the same size depends only on the continuity of the access sequence. Traversing down from the root of the tree, you can obtain a large number of subnode ranges at a time, and sort the access to these nodes for better access continuity. If you search by pointer pointing to a sibling node, first, the sibling node may be inserted later, and the storage is not necessarily continuous with itself, second, you only need to load the node from the hard disk to the memory each time to know where the brother node is on the hard disk. This becomes a random synchronization operation on the hard disk, and the performance decline can be imagined.

B + tree is incorrect because there is a pointer to a sibling node to facilitate database scanning.


Let's continue with the code. It's still just a simulation of inserting, deleting, and searching data structures in the memory.


Be

#!/usr/bin/env pythonfrom random import randint,choicefrom bisect import bisect_right,bisect_leftfrom collections import dequeclass InitError(Exception):    passclass ParaError(Exception):    passclass KeyValue(object):    __slots__=('key','value')    def __init__(self,key,value):        self.key=key        self.value=value    def __str__(self):        return str((self.key,self.value))    def __cmp__(self,key):        if self.key>key:            return 1        elif self.key==key:            return 0        else:            return -1class Bptree_InterNode(object):    def __init__(self,M):        if not isinstance(M,int):            raise InitError,'M must be int'        if M<=3:            raise InitError,'M must be greater then 3'        else:            self.__M=M            self.clist=[]            self.ilist=[]            self.par=None    def isleaf(self):        return False    def isfull(self):        return len(self.ilist)>=self.M-1    def isempty(self):        return len(self.ilist)<=(self.M+1)/2-1    @property    def M(self):        return self.__Mclass Bptree_Leaf(object):    def __init__(self,L):        if not isinstance(L,int):            raise InitError,'L must be int'        else:            self.__L=L            self.vlist=[]            self.bro=None            self.par=None    def isleaf(self):        return True    def isfull(self):        return len(self.vlist)>self.L    def isempty(self):        return len(self.vlist)<=(self.L+1)/2    @property    def L(self):        return self.__Lclass Bptree(object):    def __init__(self,M,L):        if L>M:            raise InitError,'L must be less or equal then M'        else:            self.__M=M            self.__L=L            self.__root=Bptree_Leaf(L)            self.__leaf=self.__root    @property    def M(self):        return self.__M    @property    def L(self):        return self.__L    def insert(self,key_value):        node=self.__root        def split_node(n1):            mid=self.M/2            newnode=Bptree_InterNode(self.M)            newnode.ilist=n1.ilist[mid:]            newnode.clist=n1.clist[mid:]            newnode.par=n1.par            for c in newnode.clist:                c.par=newnode            if n1.par is None:                newroot=Bptree_InterNode(self.M)                newroot.ilist=[n1.ilist[mid-1]]                newroot.clist=[n1,newnode]                n1.par=newnode.par=newroot                self.__root=newroot            else:                i=n1.par.clist.index(n1)                n1.par.ilist.insert(i,n1.ilist[mid-1])                n1.par.clist.insert(i+1,newnode)            n1.ilist=n1.ilist[:mid-1]            n1.clist=n1.clist[:mid]            return n1.par        def split_leaf(n2):            mid=(self.L+1)/2            newleaf=Bptree_Leaf(self.L)            newleaf.vlist=n2.vlist[mid:]            if n2.par==None:                newroot=Bptree_InterNode(self.M)                newroot.ilist=[n2.vlist[mid].key]                newroot.clist=[n2,newleaf]                n2.par=newleaf.par=newroot                self.__root=newroot            else:                i=n2.par.clist.index(n2)                n2.par.ilist.insert(i,n2.vlist[mid].key)                n2.par.clist.insert(i+1,newleaf)                newleaf.par=n2.par            n2.vlist=n2.vlist[:mid]            n2.bro=newleaf        def insert_node(n):            if not n.isleaf():                if n.isfull():                    insert_node(split_node(n))                else:                    p=bisect_right(n.ilist,key_value)                    insert_node(n.clist[p])            else:                p=bisect_right(n.vlist,key_value)                n.vlist.insert(p,key_value)                if n.isfull():                    split_leaf(n)                else:                    return        insert_node(node)    def search(self,mi=None,ma=None):        result=[]        node=self.__root        leaf=self.__leaf        if mi is None and ma is None:            raise ParaError,'you need to setup searching range'        elif mi is not None and ma is not None and mi>ma:            raise ParaError,'upper bound must be greater or equal than lower bound'        def search_key(n,k):            if n.isleaf():                p=bisect_left(n.vlist,k)                return (p,n)            else:                p=bisect_right(n.ilist,k)                return search_key(n.clist[p],k)        if mi is None:            while True:                for kv in leaf.vlist:                    if kv<=ma:                        result.append(kv)                    else:                        return result                if leaf.bro==None:                    return result                else:                    leaf=leaf.bro        elif ma is None:            index,leaf=search_key(node,mi)            result.extend(leaf.vlist[index:])            while True:                if leaf.bro==None:                    return result                else:                    leaf=leaf.bro                    result.extend(leaf.vlist)        else:            if mi==ma:                i,l=search_key(node,mi)                try:                    if l.vlist[i]==mi:                        result.append(l.vlist[i])                        return result                    else:                        return result                except IndexError:                    return result            else:                i1,l1=search_key(node,mi)                i2,l2=search_key(node,ma)                if l1 is l2:                    if i1==i2:                        return result                    else:                        result.extend(l.vlist[i1:i2])                        return result                else:                    result.extend(l1.vlist[i1:])                    l=l1                    while True:                        if l.bro==l2:                            result.extend(l2.vlist[:i2+1])                            return result                        else:                            result.extend(l.bro.vlist)                            l=l.bro    def traversal(self):        result=[]        l=self.__leaf        while True:            result.extend(l.vlist)            if l.bro==None:                return result            else:                l=l.bro    def show(self):        print 'this b+tree is:\n'        q=deque()        h=0        q.append([self.__root,h])        while True:            try:                w,hei=q.popleft()            except IndexError:                return            else:                if not w.isleaf():                    print w.ilist,'the height is',hei                    if hei==h:                        h+=1                    q.extend([[i,h] for i in w.clist])                else:                    print [v.key for v in w.vlist],'the leaf is,',hei                                                                                                                                                                                                                                                                                                                                      def delete(self,key_value):        def merge(n,i):            if n.clist[i].isleaf():                n.clist[i].vlist=n.clist[i].vlist+n.clist[i+1].vlist                n.clist[i].bro=n.clist[i+1].bro            else:                n.clist[i].ilist=n.clist[i].ilist+[n.ilist[i]]+n.clist[i+1].ilist                n.clist[i].clist=n.clist[i].clist+n.clist[i+1].clist            n.clist.remove(n.clist[i+1])            n.ilist.remove(n.ilist[i])            if n.ilist==[]:                n.clist[0].par=None                self.__root=n.clist[0]                del n                return self.__root            else:                return n        def tran_l2r(n,i):            if not n.clist[i].isleaf():                n.clist[i+1].clist.insert(0,n.clist[i].clist[-1])                n.clist[i].clist[-1].par=n.clist[i+1]                n.clist[i+1].ilist.insert(0,n.ilist[i])                n.ilist[i]=n.clist[i].ilist[-1]                n.clist[i].clist.pop()                n.clist[i].ilist.pop()            else:                n.clist[i+1].vlist.insert(0,n.clist[i].vlist[-1])                n.clist[i].vlist.pop()                n.ilist[i]=n.clist[i+1].vlist[0].key        def tran_r2l(n,i):            if not n.clist[i].isleaf():                n.clist[i].clist.append(n.clist[i+1].clist[0])                n.clist[i+1].clist[0].par=n.clist[i]                n.clist[i].ilist.append(n.ilist[i])                n.ilist[i]=n.clist[i+1].ilist[0]                n.clist[i+1].clist.remove(n.clist[i+1].clist[0])                n.clist[i+1].ilist.remove(n.clist[i+1].ilist[0])            else:                n.clist[i].vlist.append(n.clist[i+1].vlist[0])                n.clist[i+1].vlist.remove(n.clist[i+1].vlist[0])                n.ilist[i]=n.clist[i+1].vlist[0].key        def del_node(n,kv):            if not n.isleaf():                p=bisect_right(n.ilist,kv)                if p==len(n.ilist):                    if not n.clist[p].isempty():                        return del_node(n.clist[p],kv)                    elif not n.clist[p-1].isempty():                        tran_l2r(n,p-1)                        return del_node(n.clist[p],kv)                    else:                        return del_node(merge(n,p),kv)                else:                    if not n.clist[p].isempty():                        return del_node(n.clist[p],kv)                    elif not n.clist[p+1].isempty():                        tran_r2l(n,p)                        return del_node(n.clist[p],kv)                    else:                        return del_node(merge(n,p),kv)            else:                p=bisect_left(n.vlist,kv)                try:                    pp=n.vlist[p]                except IndexError:                    return -1                else:                    if pp!=kv:                        return -1                    else:                        n.vlist.remove(kv)                        return 0        del_node(self.__root,key_value)def test():    mini=2    maxi=60    testlist=[]    for i in range(1,10):        key=i        value=i        testlist.append(KeyValue(key,value))    mybptree=Bptree(4,4)    for kv in testlist:        mybptree.insert(kv)    mybptree.delete(testlist[0])    mybptree.show()    print '\nkey of this b+tree is \n'    print [kv.key for kv in mybptree.traversal()]    #print [kv.key for kv in mybptree.search(mini,maxi)]if __name__=='__main__':    test()

The implementation process is similar to the btree, but there are several significant differences.

1. Internal nodes do not store key-value, but only store key

2. When you search for an inner node, the number of equal indexes must go to the right of the tree. Therefore, select bisect_right for binary search.

3. When the leaf node is full, it is not split before insertion, but inserted before splitting. Because B + tree cannot guarantee that the size of the two split nodes is equal. When data of an odd number is split, the child node on the right is larger than the child node on the left. If you split the node before inserting it, you cannot ensure that the inserted node will be inserted on fewer subnodes, meeting the condition that the number of nodes is balanced.

4. when data is deleted, the left and right subnodes of B + tree use data in a simpler and more effective way than B tree. Only the Subtrees of the subnodes are directly cut, just change the index, and the sibling pointer of the leaf node does not need to be moved.


This article is from the "Notes" blog, please be sure to keep this source http://thuhak.blog.51cto.com/2891595/1269059

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.