Detailed analysis of B-Tree implementation in SQLite

Source: Internet
Author: User
Tags sqlite format 3

SQLite is organized by B-Tree in external databases. For details about B-tree, refer
**
** Donald E. Knuth, the art of computer programming, Volume 3:
** "Sorting And Searching", pages 473-480. Addison-Wesley
** Publishing Company, Reading, Massachusetts.
**
The basic idea is that each page of a file contains N database entries and N + 1 pointer to sub-pages. Files are stored on multiple pages. The reason for this is that the memory paging management mechanism is so troublesome. Each page in the external store is a node of the B-tree.
----------------------------------------------------------------
| Ptr (0) | Key (0) | Ptr (1) | Key (1) |... | Key (N-1) | Ptr (N) |
----------------------------------------------------------------
The value of all keys on the page to which Ptr (0) points is smaller than the value of key (0 ). The values of all keys on the pages and subpages pointed to by all Ptr (1) are greater than (0) and smaller than (1 ). All Ptr (N) points to the page and the Child page with a key value greater than the Key (N-1), and so on.

To know a specific key, you need to read it from the disk by O (long (M), where M is the order of the tree. If no page is found in the memory, the page is interrupted.
It mainly solves the problem that memory cannot be found. On the one hand, it is replaced with some. On the one hand, it should be replaced with some. Find the page on which the hard disk is attached.
(B-tree is suitable for block storage devices .) So we can know on which page they are on.

In the implementation of SQLite, a file can contain one or multiple independent btrees. Each BTree is identified by its root page index. The keys and data of all entries constitute the payload ). One page of the database has a fixed total effective load. If the load is greater than the preset value, the remaining bytes will be stored on the overflow page. The effective load of an entry plus the forward pointer (the preceding pointer) constitute a cell ). Each page has a small header that contains the Ptr (N) pointer and other information, such as the key and data size.

Format details
A file is divided into multiple pages. The first page is called page 1, the second page is called page 2, and so on. The number of pages is 0, indicating no page. The page size ranges from 512 to 65536. Each page is either a B-tree page, a freelist page, or an overflow page.
The first page must be a B-tree page. The first 100 bytes on the first page contain a special header (File Header), which is the description of the file.
The number of file headers is as follows:
** OFFSET SIZE DESCRIPTION
** 0 16 Header string (first string): "SQLite format 3 \ 000"
** 16 2 Page size in bytes (number of Page bytes ).
** 18 1 File format write version (version of the File write operation)
** 19 1 File format read version (version of the File read operation)
** 20 1 Bytes of unused space at the end of each page (unused Bytes at the end of each page)
** 21 1 Max embedded payload fraction (maximum embedded payload partition)
** 22 1 Min embedded payload fraction (minimum embedded payload partition)
** 23 1 Min leaf payload fraction (minimum page payload partition)
** 24 4 File change counter (File change counter)
** 28 4 Reserved for future use (Reserved bytes)
** 32 4 First freelist page (First freelist page)
** 36 4 Number of freelist pages in the file (Number of freelist pages in this file)
** 40 60 15 4-byte meta values passed to higher layers ()
**
All integers are large.

Each time a file is modified, the file change counter is added. This counter allows other processes to know when files are modified and whether their cache needs to be cleared.

The maximum embedding of a payload Shard is all the available space on one page, which is used by a separate capacity of the Standard B-tree (non-leaf data) table. The value 255 indicates 100%. By default, the maximum number of cells is limited. At least four cells can fill a page. Therefore, the default maximum embedded load partition is 64.

If the load on one page exceeds the maximum load, the remaining data will be stored on the overflow page. Once an overflow page is assigned, a large amount of data may be transferred to this overflow page, but the cell size will not be smaller than the minimum embedded payload partition.

The least-page payload slice is similar to the least-embedded payload slice, but it is applied to leaf nodes in the LEAFDATA tree. The maximum slice of a LEAFDATA is 100% (or the value is 255), and it does not need to be specified first.

Each page of B-tree is divided into three parts: Header, cell pointer array, and cell content. Page 1 also has a 100-Byte File Header at the top.
**
** | ---------------- |
*** | File header | 100 bytes. Page 1 only.
** | ---------------- |
** | Page header | 8 bytes for leaves. 12 bytes for interior nodes
** | ---------------- |
** | Cell pointer | 2 bytes per cell. Sorted order.
** | Array | Grows downward
** | V
** | ---------------- |
** | Unallocated |
** | Space |
** | ---------------- | ^ Grows upwards
** | Cell content | Arbitrary order interspersed with freeblocks.
** | Area | and free space fragments.
** | ---------------- |
**
Shows the top part:
**
** OFFSET SIZE DESCRIPTION
** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf
** 1 2 byte offset to the first freeblock
** 3 2 number of cells on this page
** 5 2 first byte of the cell content area
** 7 1 number of fragmented free bytes
** 8 4 Right child (the Ptr (N) value). Omitted on leaves.
**
The flag defines the format of the BTree page. The leaf flag indicates that this page has no children. Zerodata0 indicates that this page contains only keys without data. The intkey indicates that the key is an integer and is stored in the key size of the cell Header, rather than in the payload area.

The lattice cell pointer array starts from the header. The cell pointer array contains 0 or 2 bytes of numbers. This number represents the offset of the cell content in the cell content area of the table from the start position of the file. Lattice cells are sorted by pointer. The system tries its best to ensure that the free space is placed after the last cell pointer, so that the new cell can be quickly added without the need to refresh the page.

The content of cells is stored at the end of the page and increases to the initial direction of the file.

Unused space in the cell content area is collected to the freeblocks linked list. Each freeblock must contain at least four bytes. The offset of the first freeblock is given at the top. Freeblock is incremental. Because a freeblock contains at least four bytes, three or three unused spaces in the content area of the cell cannot exist in the freeblock linked list. These three or less free spaces are called fragments. The total number of all fragments is recorded and stored at the offset 7 at the top.

** SIZE DESCRIPTION
** 2 Byte offset of the next freeblock
** 2 Bytes in this freeblock
**

Cells are of variable length. Cells are stored in the content area of cells at the end of the page. The cell pointer array pointing to the lattice cell follows the header. Cells do not have to be continuous or ordered, but cell pointers are continuous and ordered.

The cell content fully utilizes variable-length integers. The Variable Length Integer ranges from 1 to 9 bytes, and the low 7 bits of each byte are used. The entire integer is composed of eight bytes, and the first byte's 8th bits are cleared. The most important integer byte appears in the first one. The Variable Length Integer is generally no more than 9 bytes. As a special case, all eight bytes of the ninth byte are considered as data. This allows 64-bit Integers to be encoded into 9 bytes.
** 0x00 becomes 0x00000000
** 0x7f becomes 0x0000007f
** 0x81 0x00 becomes 0x00000080
** 0x82 0x00 becomes 0x00000100
** 0x80 0x7f becomes 0x0000007f
** 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678
** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081
This article from the Linux community website (www.linuxidc.com) original link: http://www.linuxidc.com/Linux/2012-11/75009.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.