Introduction
In this article, there is a query operation for a 41G size database that contains millions records, and if you take advantage of the index, you can reduce the operation time from 37s to 0.2s.
So what is an index? What is the rationale for using indexes to speed up database query operations?
Fundamentals of Indexing
Database provides a persistent way of data storage, querying the database from the database is a basic operation, the efficiency of the query operation is very important.
For query operations, the efficiency of query operations can be greatly improved if the data being queried is organized in some way.
In a database, there are many columns for a record. If these records are organized in a data structure according to column Col1, then the column Col2 must be out of order.
Therefore, in addition to the original data, the database maintains a data structure that satisfies a particular lookup algorithm, pointing to the original data, called an index .
For example, in the following figure, the database has two columns Col1, Col2. When stored, the rows are organized by column Col1, such as Col1, which is organized in binary tree mode. If you look for a value in col1, you can use a binary tree for binary lookups, and you do not need to traverse the entire database.
So the column Col2 is disorderly. To solve this problem, an index was established for Col2, which organizes the Col2 according to a data structure (here is a two-fork tree). Look for column Col2 this way you only need to do a two-point lookup.
?
Implementation of the Index
Because the database is stored on disk, the data structures that implement the reference are stored on disk. Disk IO is an issue to be aware of.
- Two-fork Tree
Binary tree is a classic data structure, but it is not suitable for database indexing.
The reason is that the degree of each node in the binary tree is only 2, and the depth of the tree is higher. When storing, a node usually needs one disk IO, the depth of the tree is higher, the higher the number of disk IO to query a data, the longer it takes to find it.
-
B-Tree
B-Tree is a variant of the two-fork tree, the main difference is that each node can be more than 2 degrees, that is, each node can be divided into a lot of forks, greatly reducing the depth of the tree.
?
- Each piece of data is represented as [Key,data]
- Each non-leaf node has (n-1) bar data n pointers composed of
- All leaf nodes have the same depth, equal to tree height h
- Pointer to node key greater than left record less than the right record
These features greatly reduce the depth of the B + tree and enable an orderly organization of the data.
-
B + Tree
The B + tree is an extension of the tree, characterized in that non-leaf nodes do not store data and store only key. If the size of each node is fixed (such as 4k, as in SQLite), you can further increase the degree of internal nodes and reduce the depth of the tree.
?
- Non-leaf nodes store only key, leaf nodes do not store pointers
- Each node has a fixed size and requires a read Disk operation (page)
-
B + tree with sequential access pointers
A little change in the B + tree, each leaf node adds a pointer to the adjacent leaf node, which can improve the performance of the interval access.
?
, access key from 15 to 30 of data.
- If you don't have a horizontal pointer
B + Tree Find the data of key=15 and find the data of key=18 in the same block. Then the second B + lookup, find key=20 data, in the same block to find the key=30 data.
- Have a level of pointers
B + Tree Find the data for the key=15, find the contents of the same block, or walk right down the horizontal pointer.
How data is stored in SQLite
- Tables (table) and index (index) are B + trees with sequential access pointers
- Table corresponding to the B + Tree, key is Rowid,data is the row of other column data (SQLite assigned a rowid for each row)
- Index corresponding to the B + tree, key is the column that needs to be indexed, data is ROWID
Find data by index in two steps
- Find rowID by index (first B + Tree lookup)
- Find data for other columns based on ROWID (second B + Tree lookup)
A full table scan was avoided with two B + tree lookups.
1. Add a PRIMARY KEY or UNIQUE constraint to a row or rows, the database will automatically create an index for these columns
2. Specify a column as INTEGER PRIMARY KEY, then this column and rowid are specified as the same column. That is, it can be obtained by rowid or by column name.
An example
The following is a statistical information for a table in a database, which is obtained through the Sqlite3_analyzer tool.
?
You can see a total of 3,651 records in the table, the depth of the B-tree is only 2, there are 33 leaf nodes, 1 non-leaf nodes. Therefore, a maximum of 2 disk IO is required to find a row of data based on the ROWID.
Improve search efficiency with indexes
For example, we have such a watch.
?
-
-
Benchmark
The query statements are as follows
SELECT price FROM fruitsforsale WHERE fruit=‘Peach’
Because there is no index, you have to do a full table scan. Iterate through each record (record) with sequential access pointers, compare whether the fruit column and ' peatch ' are consistent, and if so, return the value of the price column for that row.
?
-
-
Index the ' Fruit ' column
As below, run the same statement, you can find the target column corresponding to the ROWID of 4 according to the index, and then find the corresponding row according to rowID, so as to select the price. All-table lookups are avoided by two B + tree lookups. This is the simplest case.
?
-
-
Multiple index Hits
Indexing is not required to be uique, that is, the key in the index table can be the same.
For example, there are two records in the Index table, andorangewhen the first record is found, the next index can be easily found by accessing the pointer sequentially, avoiding another B + tree lookup. (Rowid=1 and rowid=23 may be located in two different leaf nodes)
That is, the process of finding the index can be done with a B + tree check and a next operation, and the next operation is fast.
?
-
-
Accelerate search and sequencing with indexes
In most cases, we need to find and sort operations at the same time, and if an appropriate index is established, the search efficiency can be improved.
For example, the fruit and state two columns are indexed in the following table, and when you run the following SQL statement, you do not need to sort operations because the index table is sequential.
SELECT price FROM fruitforsale WHERE fruit=‘Orange‘ ORDER BY state
?
Explain the questions in the introduction
There is a command called in SQLite toexplain query plansee how SQLite performs the find operation. The following database statement is not a query statement in the introduction, the same principle
Note the detail column. Without indexing, the word "scan" is used, which is the full table scan. When using the index, the word "SEARCH" is used.
For a 41G table, the cost of a full table scan is obviously significant.
Reference links
- On the algorithm and data structure: a ten-balanced search tree B-Tree
- The data structure and algorithm principle behind MySQL index
- Query planning (This is SQLite's document on indexing)
- EXPLAIN QUERY PLAN
- MySQL single-table million data record paging performance optimization
The principles of SQLite indexing