Boltdb Source Analysis-mvcc/Persistence-3

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Boltdb Persistence

Part of the persistence-related content has been described in the previous introduction section

boltdbA single file is used to store the data on disk, and the first 4 of the file page is fixed:

    1. The 1th page is a meta
    2. The 2nd page is a meta
    3. The 3rd page is freelist, which stores an int array,
    4. The 4th page is the leaf page.

Page

pageis the boltdb disk-related data structure that is persisted. Page size takes the size of the operating system memory page, the getpagesize return value of the system call, usually the 4k size.

Each page starts with a few bytes that are stored in page raw data:

type page struct {    ID       Pgid         //Page ordinal    Flags    uint16       //page type, there are several Branchpageflag/leafpageflag/metapageflag/freelistpageflag    Count    uint16       //When the page is a freelistpageflag type, the number of elements in the Pgid array in freelist is stored;                          //When the page is other types, the number of inode is stored    Overflow UInt32       //When the amount of write operation data is greater than 1 page size, the field records the number of page exceeded, for example: Write 12k data,                          //page size is 4k, then page.overflow=2, the 2 page size area after the page header is also the page's                          //area.     ptr      UIntPtr}

Each page corresponds to a block of data on a disk. The layout of this data block is:

| page struct data | page element items | k-v pairs |

It consists of 3 parts:

    • The first part page struct data is the page header, which stores the page data of the struct.
    • The second part page element items is actually the persisted part of the node data (which is going to be said below) inode .
    • The third part k-v pairs is stored in the inode specific key-value data.
type Meta struct {    Magic    UInt32  //Store magic number 0xed0cdaed    version  UInt32  //Mark the version of the storage format, it is now 2    pageSize UInt32  //Indicate the size of each page    Flags    UInt32  //is currently useless    Root     Buckets  //Root bucket    freelist Pgid    //Indicates which page the current freelist data exists in    Pgid     Pgid    //    Txid     Txid    //    Checksum UInt64  //Checksum of the above data, verifying that the data is damaged}

Freelist

According to the description in the previous introduction, the boltdb disk space of the space is not freed, so a mechanism is required to achieve the reuse of disk space, that is, the freelist implementation of the mechanism of the file page cache. Its data structure is:

type freelist struct {    IDs     []Pgid          //All free and available free page IDs.    pending Map[Txid][]Pgid //Mapping of soon-to-be Free page IDs by TX.    Cache   Map[Pgid]BOOL   //Fast lookup of all free and pending page IDs.}

It has three parts, ids recording the currently cached idle page Pgid, which is recorded in the cache Pgid, the use of map records convenient and quick to find.

Called when the user needs it, page freelist.allocate(n int) pgid where n is the page quantity needed, it traverses ids , picks up a contiguous n idle page , then rejects it from the cache, and then returns the Page-id to the caller. When there is no satisfying demand page , return 0, because the file starts with 2 page fixed as meta page, so the valid Page-id cannot be 0.

When a write transaction produces a useless page , freelist.free is called ( Txid Txid, P *page) will specify page p into pool and cache . When the next write transaction is turned on, there will be no tx reference pending in the Code class= "Highlighter-rouge" >page moved to ids cache. This is done to support the rollback of transactions and concurrent read transactions, thus implementing MVCC.

When initiating a read transaction, a Tx separate copy of the meta information, from this unique as a meta portal, can be read out the meta point of the data, even if there is a write transaction modified the relevant key data, the newly modified data will only be written to the new page , read transaction holding pagewill enter the pending pool, so the data related to the read transaction is not modified. Only page when the related read transaction ends will it be reused from the pending pool into the cache pool.

When the write transaction updates the data, it does not directly overwrite the old data, and allocates a new one that page writes the updated data, and then puts the old data page into the pending pool, creating a new index. When a transaction needs to be rolled back, only the pending release in the pool is released, and the index is rolled back to page the completion of the data rollback. This accelerates the rollback of the transaction. Reduces the memory usage of the transaction cache while avoiding interference with the transaction being read.

B-tree Index

cursor is an iterator that traverses the bucket , which declares:

Type elemref struct {page *page node *node index int}type Cursor struct {bucket *bucket//Parent BUCKETST                ACK []ELEMREF//traversal process recorded through the Page-id or node,elemref in the page, node can only have one existing}type Bucket struct {*bucket tx *tx              The associated transaction buckets Map[string]*bucket//subbucket Cache page *page    Inline page reference RootNode *node//materialized node for the root page. Nodes Map[pgid]*node//node cache fillpercent float64}type node struct {bucket *bucket isleaf bo ol//marks whether the node is a leaf node, determines what is recorded in the inode unbalanced bool//when there is a delete operation on that node, it is marked true when TX executes a commit, rebalance is executed and the inode is rearranged spill ed bool Key []byte//] When loading page becomes node cache, the key of the node bottom boundary inode[0] is cached on node for the parent node// Find itself when using Pgid pgid parent *node children nodes Inodes inodes}type inode struct {flags Uint3 2//When node is a leaf, record Key's flag Pgid Pgid    When node is the leaf nodes, is not used when the node is the branch nodes, the record points to the Page-id key []byte//When node is the leaf node, the record is the owner of the key; When node is a branch, the child is recorded The bottom boundary of the key on the node. For example, when the current node is a branch node and has 3 branches, [1,2,3][5,6,7][10,11,12]//This node may have 3 Inode, and the record key is [1,5,10].    When looking for 2 o'clock, 2<5, then go to the No. 0 sub-branch on the second//continuation of the search, when looking at 4 o'clock, 1<4<5, then go to the 1th branch to continue to look up.  Value []byte//When node is a leaf, record the owning Value}type bucket struct {root Pgid//page ID of the bucket ' s root-level Page sequence UInt64//monotonically incrementing, used by NextSequence ()}

boltdbThe index of the data is constructed by B-tree, its b-tree root is Bucket , each node on its number is the node and OR, inode branchPageElements and leafPageElement all key-value data on the B-tree is recorded on the leaf node.

When Bucket the data has not yet been written to the disk, it is recorded in memory with and to record node inode when the data is written to disk branchPageElements leafPageElement .

This is a mixed organization, so when searching for this b-tree, the data you meet can be data on disk or in memory, so use cursor Such an iterator. cursor stack []elemref The probability of each iteration of the action is the path traversed. The path may be one of the page in the disk, or it may be node that has not been brushed into the disk. elemref index The index to hit when recording search.

This big B-tree has a total of 2 nodes, one is Bucket , one is, node these two different nodes are stored in the B-tree k-v, but the flag is different. As the Bucket root node of a tree or subtree, node it is a normal node on the B-tree, depending on the negative one Bucket . Bucketas a subtree, so not with the same class node rebalance . The value recorded on the tree is the Bucket bucket Page-id and sequence of the root node.

Buckets

Therefore, it is good to understand the nesting relationship of bucket . Child bucket is to create a bucket" = "Highlighter-rouge" >bucket node. For bucket description can refer to Boltdb bucket (a)/boltdb bucket (ii) description of two articles. Look at these 2 articles, first ignore the inline bucket part of the content, otherwise it is not easy to understand. inline bucket is just an optimization for disk space, usually bucket The information on disk is very small, and if you consume a page some waste, corresponds to the remainder of page as bucket You can use a page , and you can save a page when the amount of data is small.

nodeis not a node on the B-tree, not the most total storage data k-v pair, on node the inode final storage of the data k-v pair. Each node corresponding to the only one page , is the page cached object in memory. Bucketthere will be a node collection of caches. When it is necessary to access a certain page , it will go to the cache first to find it node , only node when it does not exist, to load page .

MVCC

boltThe MVCC implementation relies primarily on cow and meta replicas for implementation.

Whenever a tx is created, a copy of the currently current meta is copied. Each operation in the tx caches b-tree tx bucket copy, this change only for that tx is visible. When there is a delete operation, tx will be released Page-id temporarily freelist The pending pool, when tx call Commit, the Code class= "Highlighter-rouge" >freelist will pending The Page-id is truly marked as free if tx calls rollback the Page-id in the pending pool is removed, so tx The delete operation for the is rolled back.

When Tx called Commit ,

Will be Bucket triggered rebalance , rebalance depending on the threshold conditions, as far as possible to improve the usage of each modified node (that is Bucket node , cached, only put / del over page will be loaded into the node cache Bucket under), After cutting out the empty data branches, merging adjacent and lower filling rate of the branch, finally through the data removal, release more page .

And then trigger Bucket spill , it will be spill rebalance gathered after the basis of node Bucket.FillPercent each node hold the data will be pagesize*Bucket.FillPercent below. Then get the new page (get new page , will go to Freelist lookup can be reused Page-id, if not found on a new allocation, when the newly allocated to make the page file size exceeds the size of the read-only mmap, will be mmap again), and then node write page.

Then update freelist the persisted record. The above action is then page flushed to disk. The last updated meta record on disk, freed Tx resources.

Therefore meta , the data written is not visible to the read transaction until the update is available.

File Map growth strategy

When the boltdb file is less than 1GB, the size of the map doubles every time it is remapped, and when the file is larger than 1GB, it grows 1GB at a time and aligns with the pagesize.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.