SQLite Kernel Research

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The design and function of each module of SQLite kernel is grasped from the global angle. SQLite uses a hierarchical, modular design that makes it highly scalable and portable. Moreover, the architecture of SQLite is not very different from the common DBMS, so it is important to understand the common DBMS. The kernel of SQLite is generally divided into three parts, virtual machine, back-end (backend), and compiler (compiler).

1. Virtual Machine
VDBE is the core of SQLite, and its upper and lower modules are essentially services for it. Its implementation is located in Vbde.c, Vdbe.h, Vdbeapi.c, VdbeInt.h, and vdbemem.c several files. It executes the byte code generated by the compiler (Compiler) through the underlying infrastructure b+tree, which is specifically designed for querying, reading, and modifying the database (bytecode programming lauguage).
The byte code is encapsulated in memory as a Sqlite3_stmt object (internally called VDBE, see VdbeInt.h), and vdbe (or statement) contains everything needed to execute the program:
A) a bytecode program
b) Names and data types for all result columns
c) Values bound to input parameters
d) a program counter
e) An execution stack of operands
f) An arbitrary amount of "numbered" memory cells
g) Other Run-time state information (such as open BTree objects , sorters, lists, sets)

The byte code and assembler are very similar, each instruction is composed of opcode and three operands: <opcode, P1, P2, p3>. OpCode is a function of the operation code, in order to understand, can be regarded as a functional. P1 is a 32-bit signed integer, p2 is a 31-bit unsigned integer, which is usually the destination address (destination) of the instruction that caused the jump (jumping), and of course it has other uses; P3 is a pointer to a null-terminated string or other struct. Unlike the C API, the Vdbe opcode often changes, so you should not write programs with bytecode.
The following several C APIs interact directly with VDBE:
sqlite3_bind_xxx () functions
Sqlite3_step ()
Sqlite3_reset ()
sqlite3_column_xxx () functions
sqlite3_finalize ()

In order to have a sensibility, look at a specific bytecode program below:
Sqlite>. M col
sqlite>. h on
Sqlite>. W 4 15 3 3 15
Sqlite> Explain select * from episodes;
Addr opcode p1 P2 p3
---- --------------- --- --- ---------------
0 Goto 0 12
1 Integer 0 0
2 OpenRead 0 2 # Episodes
3 Setnumcolumns 0 3
4 Rewind 0 10
5 RECNO 0 0
6 Column 0 1
7 Column 0 2
8 Callback 3 0
9 Next 0 5
Ten Close 0 0
Halt 0 0
Transaction 0 0
Verifycookie 0 10
Goto 0 1
Noop 0 0

1.1. Stacks (Stack)
A vdbe program is usually composed of different segments (sections) that accomplish a particular task, and each segment has instructions for the Operation Stack. This is because different instructions have different number of parameters, some directives have only one parameter; some directives have no parameters; some directives have several parameters, in which case three operands are not satisfied.
Given these conditions, the instruction takes a stack to pass the parameters. (Note: From a compilation point of view, there are several ways to pass parameters, such as registers, global variables, and stacks, which are often used in modern languages, with great flexibility). These instructions do not do these things on their own, so before they do, some other instructions are needed to help. Vdbe the intermediate results of the calculation into the memory cells, in fact, the stack and the cells are based on mem (see VDBEINT.H) data structure (note: The stack here, memory units are virtual, remember a computer scientist said: Computer Science 90% The above science is the problem of virtualization. Not fake, OS is essentially a virtual machine, and here SQLite, we also see the virtual figure everywhere, to the back of the OS interface module to discuss this problem carefully.

1.2. Procedure Body (program body)
This is a procedure to open a episodes table.
The first instruction: the integer is prepared for the second instruction, that is, the second instruction executes the required parameters into the stack, openread the parameter values from the stack and executes. SQLite can open multiple database files in a connection through the attach command, whenever SQLite opens a data, it assigns an index number (index), the main database is indexed to 0, the first one is 1, and so on. The value of the integer instruction database index is pressed into the stack, and OpenRead takes the value out of it and decides which data to open, to look at the explanation in the SQLite document:
Open a read-only cursor for the database table whose root page was P2 in a database file.
The database file is determined by a integer from the top of the stack. 0 means the main database and 1 means the database used for temporary tables. Give the new cursor an identifier of P1. The P1 values need not being contiguous but all P1 values should is small integers. It is a error for P1 to being negative.
If P2==0 then take the root page number from off of the stack.
There'll be a read lock on the database whenever there are an open cursor. If the data-
Base was unlocked prior to this instruction then a read lock are acquired as part of the this instruction. A read lock allows other processes to read the database but prohibits any other process from modifying the database. The read lock is released if all cursors is closed. If This instruction attempts to get a read lock but fails, the script terminates with an sqlite_busy error code.
The P3 value is a pointer to a KEYINFO structure that defines the content and collating

Sequence of indices. P3 is NULL for cursors, that is not pointing to indices.

Then look at the setnumcolumns directive to set the column the cursor will point to. P1 is the index of the cursor (here 0, just opened), P2 is the number of columns, and the episodes table has three columns.
Continues the rewind instruction, which sets the cursor back to the beginning of the table, checks whether the table is empty (that is, there is no record), and, if there is no record, causes the instruction pointer to jump to the instruction specified by P2. Here, P2 is 10, which is the close directive. Once rewind sets the cursor, the next step is to execute 5-9 of these instructions, whose primary function is to traverse the result set, recno the keyword of the record specified by the cursor P1 into the stack. The column directive P2 the specified columns from the cursor specified by P1. 5,6,7 Three instructions put the values of the ID (primary key), season, and name fields into the stack respectively. Next, the callback instruction takes three values (P1) from the stack and then forms an array of records, stored in the memory cell. Callback will stop the vdbe operation, give control to Sqlite3_stemp (), and the function returns Sqlite_row.

Once Vdbe has created the record structure, we can remove the value from the record structure's domain by SQLITE3_COLUMN_XXX () functions. The next time you call Sqlite3_step (), the instruction pointer will point to the next command, and the next command will move the cursor down one line, and if there are other records, it will jump to the instruction specified by P2, where for instruction 5, create a new record structure that loops until the end of the result set. The close command closes the cursor and then executes the halt instruction to end the VDBE program.

1.3. Program Start and stop
Now look at the rest of the instructions, the goto instruction is a jump instruction, jump to P2 place, that is, the 12th instruction. Instruction 12 is transaction, which starts a new transaction, and then executes Verifycookie, its primary function vdbe whether the database schema changes after the program is compiled (that is, whether the update operation has been performed). This is a very important concept in SQLite, when SQL is Sqlite3_prepare () compiled into VDBE code to the time the program calls Sqlite3_step () to execute bytecode, another SQL command may change the database schema (such as Alter Table, DROP table, or CREATE table). Once this happens, the previously compiled statement becomes invalid, and the database schema information is recorded in the root page of the database file. Similarly, each statement has a copy of the pattern to compare at compile time, and the function of Verifycookie is to check if they match, and if not, take action.

If they match, the next instruction is executed Goto; it jumps to the main part of the program, which is the first instruction, which opens the table to read the records. Here are two points worth noting:
(1) The Transaction instruction itself does not acquire the lock (the Transaction instruction doesn ' t acquire any locks in itself). Its function is equivalent to the begin, which is actually obtained by the OpenRead command share lock. When the transaction is closed, the lock is released, depending on the halt directive, which will mop up the work.
(2) The storage space required for the statement object (VDBE program) is determined before the program executes. This has two important facts: first, the stack will not have more depth than the number of instructions (usually much less). Second, SQLite can calculate the memory needed to allocate resources before executing the VDBE program.

1.4 Type of instruction (instruction Types)
Each instruction completes a specific task and is usually related to other directives. Broadly speaking, directives can be divided into three categories:
(1) Value manipulation: These directives usually perform arithmetic operations such as add, subtract, divide, logical operations such as and and or, and string manipulation.
(2) Data management: These instructions operate on memory and on disk. Memory instructions are either stack operations or pass data between the cells. Disk operations Directives control B-tree and pager open or manipulate cursors, start or end transactions, and so on.

(3) Control flow: The main command is to move the instruction pointer.

1.5. Execution of programs (program execution)
Finally, let's see how the VM interpreter is implemented and how the byte code is roughly executed. There is a key function in the vdbe.c file:
Execute VDBE Program
int Sqlite3vdbeexec (
VDBE *p/* The VDBE */
)
This function is the entry for executing the VDBE program. Take a look at its internal implementation:

/* Start executing instructions from here
**PC for program counter (int)
*/
for (pc=p->pc; Rc==sqlite_ok; pc++) {
Get Operation code
POp = &p->aOp[pc];
Switch (pop->opcode) {
Case Op_goto: {/* jump */
Check_for_interrupt;
PC = pop->p2-1;
Break
}
... ...
}
}
From this code, we can roughly introduce the principle of VM execution: The VM interpreter is actually a for loop with a large number of switch statements, and each switch statement implements a specific operation instruction.

2, B-tree and pager
B-tree allows vdbe to query, insert and delete data under O (Logn), and O (1) to traverse the result set in two directions. B-tree does not directly read and write to the disk, it only maintains the relationship between pages (pages). When B-tree needs a page or modifies a page, it calls Pager. When the page is modified, pager guarantees that the original page is written to the log file first, and when it finishes the write operation, pager depends on the transaction state. B-tree not directly read and write files, but through the page cache this buffer module read and write files for performance is important (note: This is similar to the operating system read and write files, in Linux, the operating system's upper module does not directly invoke device-driven read and write devices, Instead, a buffer module is invoked to invoke the device driver to read and write files and to store the results in the cache.

2.1. Database file Format
All pages in the database are marked as starting from 1. A database is made up of many b-tree-each table and index has a b-tree (note: The index takes the B-tree, and the table uses the B+tree, which is determined by the different requirements of the table and index and the structure of B-tree and B+tree: B + All the leaf nodes of the tree contain all the keyword information and can be searched in two order--see data structure, Min. And B-tree is more suitable for indexing). The root pages of all tables and indexes are stored in the Sqlite_master table.
The first page in the database (page 1) is a bit special, and the first 100 bytes of page 1 contain a special file header that describes the database file. It includes the version of the library, the version of the schema, the page size, the encoding, and all the parameters that were set when the database was created. The content of this particular header is defined in btree.c, and Page 1 is also the root page of the Sqlite_master table.

2.2, page reuse and recycling (page reuse and Vacuum)
SQLite uses a free list to recycle pages. When all the records of a page are deleted, they are inserted into the list. When the vacuum command is run, the free list is cleared, so the database shrinks, essentially it is re-establishing the database in the new file, and all the pages used are copied past, and the free list does not, and the result is a new, smaller database. When the database's autovacuum is turned on, SQLite does not use the free list, and it automatically compresses the database every time a commit occurs.

2.3. B-tree Record
B-tree pages are made up of B-tree Records, also known as payloads. Each b-tree record, or payload has two domains: the key field and the data field. The key field is the value of the ROWID, or the key of the table in the database. From the b-tree point of view, data field can be anything without structure. The database records are stored in these data fields. B-tree's task is to sort and traverse, and it needs to be the keyword. The size of the payloads is variable, and this is related to the internal keywords and data fields, when a payload is too large to exist in a page to save to multiple pages.

B+tree by keyword, all the keywords must be unique. The table takes B+tree, and the internal page contains no data, as follows:

B+tree (Root page) and internal pages (internal pages) are used for navigation, and the data fields of these pages are pointers to subordinate pages, and contain only keywords. All database records are stored on the leaf page (in leaf pages). At the leaf node level, the records and pages are in the order of the keywords, so the b-tree can traverse horizontally and the time complexity is O (1).

2.4. Record and field (Records and fields)
Records of data fields located on leaf node pages are managed by VDBE, and database records are stored in binary form, but with certain data formats. The recording format consists of a logical header (logical header) and a data segment, header segment includes the header size and an array of data types, the type of data used in data segment, as follows:

2.5. Hierarchical data Organization (hierarchical Organization)

From the top down, data becomes more and more chaotic, and from the bottom up, data is becoming more structured.

2.6. B-tree API
The B-tree module has its own API, which can be used independently of the C API. Another feature is that it supports transactions. Transactions, locks, and logs handled by pager are serviced for B-tree. There are several types of functions that can be categorized as follows:
2.6.1, access, and transaction functions
Sqlite3btreeopen: Opens a new database file. Returns a B-tree object.
Sqlite3btreeclose: Closes a database.
Sqlite3btreebegintrans: Starts a new transaction.
Sqlite3btreecommit: commits the current transaction.
Sqlite3btreerollback: Rolls back, the current transaction.
sqlite3btreebeginstmt: Starts a statement transaction.
sqlite3btreecommitstmt: commits a statement transaction.
sqlite3btreerollbackstmt: Rolls back a statement transaction.

2.6.2, table functions
sqlite3btreecreatetable: Creates a new, empty b-tree in a database file.
sqlite3btreedroptable: Destroys a b-tree in a database file.
sqlite3btreecleartable: Removes all data from a b-tree, but keeps the b-tree intact.
2.6.3, cursor function (cursor Functions)
Sqlite3btreecursor: Creates a new cursor pointing to a particular b-tree.
Sqlite3btreeclosecursor: Closes the b-tree cursor.
Sqlite3btreefirst: Moves the cursor to the first element in a b-tree.
Sqlite3btreelast: Moves the cursor to the last element in a b-tree.
Sqlite3btreenext: Moves the cursor to the next element after the one it is currently
Pointing to.
sqlite3btreeprevious: Moves the cursor to the previous element before the one it is
Currently pointing to.

Sqlite3btreemoveto:moves the cursor to an element, the matches the key value passed in as a parameter.

2.6.4, recording function (record Functions)
Sqlite3btreedelete: Deletes the record, the cursor is pointing to.
Sqlite3btreeinsert: Inserts a new element in the appropriate place of the b-tree.
sqlite3btreekeysize: Returns the number of bytes in the key of the
Cursor is pointing to.
Sqlite3btreekey: Returns The key of the currently pointing to.
sqlite3btreedatasize: Returns the number of bytes in the data record, the the cursor is
Currently pointing to.
Sqlite3btreedata: Returns the data in the record the cursor is currently pointing to.

2.6.5, config function (configuration Functions)
Sqlite3btreesetcachesize:controls the page cache size as well as the synchronous
Writes (as defined in the synchronous pragma).
Sqlite3btreesetsafetylevel:changes the synced to disk in order to increase
or decrease how well the database resists damage due to OS crashes and power failures.
Level 1 is the same as asynchronous (no syncs () occur and there are a high probability of
Damage). The equivalent to pragma synchronous=off. Level 2 is the default. There
is a very low but non-zero probability of damage. The equivalent to pragma
Synchronous=normal. Level 3 reduces the probability of damage to near zero and with a
Write performance reduction. The equivalent to pragma synchronous=full.
Sqlite3btreesetpagesize:sets the database page size.
Sqlite3btreegetpagesize:returns the database page size.
Sqlite3btreesetautovacuum:sets the Autovacuum property of the database.
Sqlite3btreegetautovacuum:returns whether the database uses autovacuum.
Sqlite3btreesetbusyhandler:sets the busy handler
2.7. Example Analysis
Finally, the discussion of this section is concluded with Sqlite3_open implementation (see version 3.6.10 's source code):

As can be known, all of SQLite's IO operations are eventually translated into the system calls of the operating system (one: The DBMS is built on the pain of the OS). At the same time, we can see that the implementation of SQLite is very hierarchical and modular, which makes sqlite more extensible and portable.

3. Compiler (Compiler)
3.1. Word breaker (tokenizer)
The interface passes the SQL statement to be executed to tokenizer,tokenizer, dividing it into one word according to the lexical definition of SQL, and passing it to the parser (Parser) for parsing. The word breaker is written by hand and is mainly implemented in TOKENIZER.C.
3.2. Analysis Device (Parser)
SQLite's parser is a generator with lemon--, an open source LALR (1) parser, and the resulting file is parser.c.
A simple syntax tree:
SELECT rowID, name, season from episodes WHERE rowid=1 LIMIT 1

3.3. Code generator (Generator)
The code generator is the largest and most complex part of SQLite. It is closely related to parser, which generates VDBE programs to execute SQL statements based on the parsing tree. Composed of many files: select.c,update.c,insert.c,delete.c,trigger.c,where.c and other documents. These files generate the corresponding VDBE program directives, such as the SELECT statement generated by SELECT.C. The following is a build implementation of the code that opens the table in a read operation:
/* Generate code that would open a table for reading.
*/
void Sqlite3opentableforreading (
Vdbe *v,/* Generate code into this vdbe */
int Icur,/* The cursor number of the table */
Table *ptab/* The table to be opened */
){
Sqlite3vdbeaddop (V, Op_integer, ptab->idb, 0);
Sqlite3vdbeaddop (V, Op_openread, Icur, ptab->tnum);
Vdbecomment ((V, "#%s", ptab->zname));
Sqlite3vdbeaddop (V, Op_setnumcolumns, Icur, Ptab->ncol);
}
The SQLITE3VDBEADDOP function has three parameters: (1) Vdbe instance (it will add instructions), (2) opcode (one instruction), (3) two operands.

3.4. Query optimization
The code generator is not only responsible for generating code, but also for query optimization. The main implementation is in WHERE.C, where statement blocks generated are usually shared by other modules, such as select.c,update.c and delete.c. These modules call Sqlite3wherebegin () to start the where statement block instruction generation, then add their own vdbe code to return, and finally call the Sqlite3whereend () End command to generate, as follows:

SQLite Kernel Research

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

SQLite Kernel Research

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

SQLite Kernel Research

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support