Detail analysis of B-tree implementation in SQLite

Detail analysis of B-tree implementation in SQLite _ Database Digest

Last Update:2017-01-18 Source: Internet

Author: User

Tags reserved sqlite sqlite format 3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SQLite is organized by B-tree in the database that is stored externally. For details on B-tree, refer to
**
* * Donald E. Knuth, the ART of COMPUTER programming, Volume 3:
* * "Sorting and Searching", pages 473-480. Addison-wesley
* * Publishing Company, Reading, Massachusetts.
**
The basic idea is that every page contained in a file includes n database portals and n+1 pointers to child pages. The file is divided into many page stores. Why do this, because memory paging management mechanism. Each page in the external memory is a node of the B-tree.
----------------------------------------------------------------
| PTR (0) | Key (0) | PTR (1) | Key (1) | ... | Key (N-1) | PTR (N) |
----------------------------------------------------------------
The value of all keys on the page to which Ptr (0) points is less than key (0). All key values for all of the page and sub pages to which PTR (1) point are greater than key (0) and less than key (1). All the values for the key of the page and sub pages pointed to by PTR (N) are greater than key (N-1), and so on.

In order to know a specific key, you need to read from the disk with an O (M), where M is the order of the tree. Unable to find in memory, there will be missing pages.
The main solution is to solve the problem that cannot be found in memory. On the one hand to change out some. On the one hand in some. In exchange for the time to find their hard disk on which page ah.
(The advantage of B-tree is that it is suitable for storage devices that are stored in blocks.) So, you can know which page they are on.

In SQLite implementations, a file can contain 1 or more independent btree. Each btree is identified by the index of its root page. The key and data for all entries comprise a payload (payload). A page of the database has a fixed payload total. If the load is greater than the preset value, the remaining bytes are stored on the overflow page. A payload of a portal plus a forward pointer (the preceding pointer) forms a lattice (cell). Each page has a small head that contains a PTR (N) pointer and some other information, such as key and the size of the data.

Format details
A file is divided into multiple pages. The first page is called Page 1, and the second page is called Page 2, one analogy. The number of pages is 0, which means there are no pages. The size of the page can be from 512 to 65536. Each page is either a btree page, or a freelist page, or an overflow page.
The first page must be a btree page. The front 100 bytes of the first page contain a special header (file header), which is the description of this file.
The number of file headers is as follows:
* * OFFSET SIZE DESCRIPTION
* * 0 Header string (header): "SQLite format 3\000"
* 2 page size in bytes (number of bytes in page).
* * 1 file format write version (versions of file writes)
* * 1 file format read version (file read)
* * 1 Bytes of unused spaces at the end of each page (bytes not used at the ends of each sheet)
* * 1 MAX Embedded payload fraction (maximum embedded payload fragmentation)
* * 1 min embedded payload fraction (minimum embedded payload fragmentation)
* * 1 min leaf payload fraction (min. effective load fragment)
* 4 File Change counter (file changing counter)
* * 4 Reserved for future use (reserved bytes)
* * 4 freelist page (first freelist page)
* * 4 of freelist pages in the file (number of freelist pages in this document)
* 4-byte meta values passed to higher layers ()
**
All integers are big-endian.

The file change counter increases each time the file is modified. This counter allows other processes to know when files have been modified and whether their cache needs to be cleaned.

The maximum embedded payload fragment is a page of all available space, the total amount that can be used by the standard B-tree (non-leaf data) table alone. A value of 255 represents 100%. By default, the maximum number of cells is limited to at least 4 cells to fill a page. Therefore, the default maximum embedded load fragment is 64.

If a page's payload is greater than the maximum payload, the remaining data is stored on the overflow page. Once an overflow page is allocated, a lot of data may be transferred to the overflow page, but the size of the cell is not smaller than the minimum embedded payload fragment.

The minimum effective load fragment is similar to the minimum embedded payload fragment, but it is applied to the leaf nodes in the leafdata tree. The maximum payload fragment of a leafdata is 100% (or a value of 255), and it does not have to be first specified.

Each page of the btree is divided into three parts: the first, the array of cell pointers, and the contents of the cell. Page 1 also has a 100-byte header on the page header.
**
** |----------------|
** | File header | Bytes. Page 1 only.
** |----------------|
** | Page Header | 8 bytes for leaves. Bytes for interior nodes
** |----------------|
** | Cell pointer | | 2 bytes per cell. Sorted order.
** | Array | | Grows downward
** | | V
** |----------------|
** | Unallocated |
** | Space |
** |----------------| ^ Grows upwards
** | Cell Content | | Arbitrary order interspersed with freeblocks.
** | Area | | and free spaces fragments.
** |----------------|
**
The page header is shown in the following illustration:
**
* * OFFSET SIZE DESCRIPTION
* * 0 1 Flags. 1:intkey, 2:zerodata, 4:leafdata, 8:leaf
* * 1 2 byte offset to the Freeblock
* * 3 2 Number of cells on this page
* * 5 2 of the cell content area
* * 7 1 Number of fragmented free bytes
* * 8 4 Right child (the PTR (N) value). omitted on leaves.
**
The flag bit defines the format of this btree page. The leaf flag means that the page has no children children. ZERODATA0 data indicates that this page contains only key and no data; The INTKEY flag means that the key is an integer and is stored in the key size of the cell header, not in the payload area.

The grid cell pointer array starts at the page header. The cell pointer array contains 0 or more 2-byte digits that represent the offset from the start of the file for the cell contents in the cell content area. Lattice cell pointer-type orderly. The system tries to ensure that the free space is located after the last cell pointer, which ensures that the new cells can be added quickly without having to rearrange the page (defragment).

The cell contents are stored at the end of the page and grow to the beginning of the file.

Unused space in the cell content area is collected on the linked list freeblocks. Each freeblock has at least 4 bytes. The offset of the first freeblock is given in the page header. Freeblock is an added order. Because a freeblock has at least 4 bytes, all 3 or 3 unused spaces in the cell content area cannot exist on the Freeblock list. These 3 or less 3 free spaces are called fragments. The total number of all fragments is recorded and stored in the position of the page header offset 7.

* * SIZE DESCRIPTION
* * 2 Byte offset of the next Freeblock
* * 2 Bytes in this Freeblock
**

The lattice cell is variable length. The cell is stored at the end of the page in the cell content area. The cell pointer array that points to the cells immediately follows the page header. The lattice cell does not have to be sequential or orderly, but the cell pointer is sequential and orderly.

The cell contents make full use of variable-length integers. Variable-length integers are from 1 to 9 bytes, and the lower 7 bits per byte are used. The entire integer is composed of 8-bit bytes, where the 8th digit of the first byte is zeroed. The most important byte of an integer appears in the first. Variable-length integers are generally no more than 9 bytes. As a special case, all 8 bytes of the nineth byte will be considered data. This allows the 64-bit integer to be encoded to 9 bytes.
* * 0x00 becomes 0x00000000
* * 0x7f becomes 0x0000007f
* * 0x81 0x00 becomes 0x00000080
* * 0x82 0x00 becomes 0x00000100
* * 0x80 0x7f becomes 0x0000007f
* * 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678
* * 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081
This article comes from the Linux commune website (www.linuxidc.com) original link: http://www.linuxidc.com/Linux/2012-11/75009.htm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More