This is a creation in Article, where the information may have evolved or changed.
Boltdb Persistence
Part of the persistence-related content has been described in the previous introduction section
boltdb
A single file is used to store the data on disk, and the first 4 of the file page
is fixed:
- The 1th page is a meta
- The 2nd page is a meta
- The 3rd page is freelist, which stores an int array,
- The 4th page is the leaf page.
Page
page
is the boltdb
disk-related data structure that is persisted. Page size takes the size of the operating system memory page, the getpagesize
return value of the system call, usually the 4k size.
Each page starts with a few bytes that are stored in page
raw data:
type page struct { ID Pgid //Page ordinal Flags uint16 //page type, there are several Branchpageflag/leafpageflag/metapageflag/freelistpageflag Count uint16 //When the page is a freelistpageflag type, the number of elements in the Pgid array in freelist is stored; //When the page is other types, the number of inode is stored Overflow UInt32 //When the amount of write operation data is greater than 1 page size, the field records the number of page exceeded, for example: Write 12k data, //page size is 4k, then page.overflow=2, the 2 page size area after the page header is also the page's //area. ptr UIntPtr}
Each page
corresponds to a block of data on a disk. The layout of this data block is:
| page struct data | page element items | k-v pairs |
It consists of 3 parts:
- The first part
page struct data
is the page
header, which stores the page
data of the struct.
- The second part
page element items
is actually the persisted part of the node
data (which is going to be said below) inode
.
- The third part
k-v pairs
is stored in the inode
specific key-value data.
type Meta struct { Magic UInt32 //Store magic number 0xed0cdaed version UInt32 //Mark the version of the storage format, it is now 2 pageSize UInt32 //Indicate the size of each page Flags UInt32 //is currently useless Root Buckets //Root bucket freelist Pgid //Indicates which page the current freelist data exists in Pgid Pgid // Txid Txid // Checksum UInt64 //Checksum of the above data, verifying that the data is damaged}
Freelist
According to the description in the previous introduction, the boltdb
disk space of the space is not freed, so a mechanism is required to achieve the reuse of disk space, that is, the freelist
implementation of the mechanism of the file page cache. Its data structure is:
type freelist struct { IDs []Pgid //All free and available free page IDs. pending Map[Txid][]Pgid //Mapping of soon-to-be Free page IDs by TX. Cache Map[Pgid]BOOL //Fast lookup of all free and pending page IDs.}
It has three parts, ids
recording the currently cached idle page
Pgid, which is recorded in the cache
Pgid, the use of map records convenient and quick to find.
Called when the user needs it, page
freelist.allocate(n int) pgid
where n is the page
quantity needed, it traverses ids
, picks up a contiguous n idle page
, then rejects it from the cache, and then returns the Page-id to the caller. When there is no satisfying demand page
, return 0, because the file starts with 2 page
fixed as meta page, so the valid Page-id cannot be 0.
When a write transaction produces a useless page
, freelist.free is called ( Txid Txid, P *page)
will specify page
p into pool and cache
. When the next write transaction is turned on, there will be no tx
reference pending
in the Code class= "Highlighter-rouge" >page
moved to ids
cache. This is done to support the rollback of transactions and concurrent read transactions, thus implementing MVCC.
When initiating a read transaction, a Tx
separate copy of the meta
information, from this unique as a meta
portal, can be read out the meta
point of the data, even if there is a write transaction modified the relevant key data, the newly modified data will only be written to the new page
, read transaction holding page
will enter the pending
pool, so the data related to the read transaction is not modified. Only page
when the related read transaction ends will it be reused from the pending
pool into the cache
pool.
When the write transaction updates the data, it does not directly overwrite the old data, and allocates a new one that page
writes the updated data, and then puts the old data page
into the pending
pool, creating a new index. When a transaction needs to be rolled back, only the pending
release in the pool is released, and the index is rolled back to page
the completion of the data rollback. This accelerates the rollback of the transaction. Reduces the memory usage of the transaction cache while avoiding interference with the transaction being read.
B-tree Index
cursor
is an iterator that traverses the bucket
, which declares:
Type elemref struct {page *page node *node index int}type Cursor struct {bucket *bucket//Parent BUCKETST ACK []ELEMREF//traversal process recorded through the Page-id or node,elemref in the page, node can only have one existing}type Bucket struct {*bucket tx *tx The associated transaction buckets Map[string]*bucket//subbucket Cache page *page Inline page reference RootNode *node//materialized node for the root page. Nodes Map[pgid]*node//node cache fillpercent float64}type node struct {bucket *bucket isleaf bo ol//marks whether the node is a leaf node, determines what is recorded in the inode unbalanced bool//when there is a delete operation on that node, it is marked true when TX executes a commit, rebalance is executed and the inode is rearranged spill ed bool Key []byte//] When loading page becomes node cache, the key of the node bottom boundary inode[0] is cached on node for the parent node// Find itself when using Pgid pgid parent *node children nodes Inodes inodes}type inode struct {flags Uint3 2//When node is a leaf, record Key's flag Pgid Pgid When node is the leaf nodes, is not used when the node is the branch nodes, the record points to the Page-id key []byte//When node is the leaf node, the record is the owner of the key; When node is a branch, the child is recorded The bottom boundary of the key on the node. For example, when the current node is a branch node and has 3 branches, [1,2,3][5,6,7][10,11,12]//This node may have 3 Inode, and the record key is [1,5,10]. When looking for 2 o'clock, 2<5, then go to the No. 0 sub-branch on the second//continuation of the search, when looking at 4 o'clock, 1<4<5, then go to the 1th branch to continue to look up. Value []byte//When node is a leaf, record the owning Value}type bucket struct {root Pgid//page ID of the bucket ' s root-level Page sequence UInt64//monotonically incrementing, used by NextSequence ()}
boltdb
The index of the data is constructed by B-tree, its b-tree root is Bucket
, each node on its number is the node
and OR, inode
branchPageElements
and leafPageElement
all key-value data on the B-tree is recorded on the leaf node.
When Bucket
the data has not yet been written to the disk, it is recorded in memory with and to record node
inode
when the data is written to disk branchPageElements
leafPageElement
.
This is a mixed organization, so when searching for this b-tree, the data you meet can be data on disk or in memory, so use cursor
Such an iterator. cursor
stack []elemref
The probability of each iteration of the action is the path traversed. The path may be one of the page
in the disk, or it may be node
that has not been brushed into the disk. elemref
index
The index to hit when recording search.
This big B-tree has a total of 2 nodes, one is Bucket
, one is, node
these two different nodes are stored in the B-tree k-v, but the flag is different. As the Bucket
root node of a tree or subtree, node
it is a normal node on the B-tree, depending on the negative one Bucket
. Bucket
as a subtree, so not with the same class node
rebalance
. The value recorded on the tree is the Bucket
bucket
Page-id and sequence of the root node.
Buckets
Therefore, it is good to understand the nesting relationship of bucket
. Child bucket
is to create a bucket" = "Highlighter-rouge" >bucket
node. For bucket
description can refer to Boltdb bucket (a)/boltdb bucket (ii) description of two articles. Look at these 2 articles, first ignore the inline bucket
part of the content, otherwise it is not easy to understand. inline bucket
is just an optimization for disk space, usually bucket The
information on disk is very small, and if you consume a page
some waste, corresponds to the remainder of page
as bucket
You can use a page
, and you can save a page
when the amount of data is small.
node
is not a node on the B-tree, not the most total storage data k-v pair, on node
the inode
final storage of the data k-v pair. Each node
corresponding to the only one page
, is the page
cached object in memory. Bucket
there will be a node
collection of caches. When it is necessary to access a certain page
, it will go to the cache first to find it node
, only node
when it does not exist, to load page
.
MVCC
bolt
The MVCC implementation relies primarily on cow and meta
replicas for implementation.
Whenever a tx
is created, a copy of the currently current meta
is copied. Each operation in the tx
caches b-tree tx
bucket
copy, this change only for that tx
is visible. When there is a delete operation, tx
will be released Page-id temporarily freelist The
pending
pool, when tx
call Commit, the Code class= "Highlighter-rouge" >freelist will pending
The Page-id is truly marked as free if tx
calls rollback
the Page-id in the pending
pool is removed, so tx The delete operation for the
is rolled back.
When Tx
called Commit
,
Will be Bucket
triggered rebalance
, rebalance
depending on the threshold conditions, as far as possible to improve the usage of each modified node
(that is Bucket
node
, cached, only put
/ del
over page
will be loaded into the node
cache Bucket
under), After cutting out the empty data branches, merging adjacent and lower filling rate of the branch, finally through the data removal, release more page
.
And then trigger Bucket
spill
, it will be spill
rebalance
gathered after the basis of node
Bucket.FillPercent
each node
hold the data will be pagesize*Bucket.FillPercent
below. Then get the new page
(get new page
, will go to Freelist lookup can be reused Page-id, if not found on a new allocation, when the newly allocated to make the page
file size exceeds the size of the read-only mmap, will be mmap again), and then node
write page
.
Then update freelist
the persisted record. The above action is then page
flushed to disk. The last updated meta
record on disk, freed Tx
resources.
Therefore meta
, the data written is not visible to the read transaction until the update is available.
File Map growth strategy
When the boltdb
file is less than 1GB, the size of the map doubles every time it is remapped, and when the file is larger than 1GB, it grows 1GB at a time and aligns with the pagesize.