Question 1. Why is the database designed to be indexed?
Library to save 1000W books, to find the "architect of the Road", a check, to find out when to go?
The librarian then devised a set of rules:
- On the first floor of the history class, the second floor of literature, third floor put it class ...
- It class, sub-software class, hardware class ...
- Software classes, sorted by book title Sequencer ...
To find a book quickly.
With the analogy, the database stores 1000W data, to find Name= "Shenjian" records, a search, to find out when to go?
Therefore, an index is used to improve the database lookup speed.
Question 2. The hash (hash) is faster than the tree, why is the index structure designed to be tree-shaped?
There are two common types of data structures that speed up the search speed:
- Hash, such as HashMap, the average time complexity of query/insert/modify/delete is O (1);
- Trees, such as the balanced binary search tree, the average time complexity of query/insert/modify/delete is O (LG (n));
You can see that the index of the hash type, whether it's a read request or a write request, is faster than the tree index, so why is the index structure designed to be tree-shaped?
VoiceOver: 80% of the students, the interview can not be answered.
The index is designed to be tree-related and is relevant to the requirements of SQL.
For SQL requirements for such a single-line query:
It is true that the hash index is faster because only one record is queried at a time.
VoiceOver: So, if the business requirements are single-line access, such as passport, you can actually use a hash index.
But for the SQL requirements of the sort query:
- Group: GROUP BY
- Sort: ORDER BY
- Compare:<, >
- ...
A hash index, the time complexity degrades to O (n), and the "ordered" nature of the tree is still able to maintain the efficiency of O (log (n)).
Any design that is out of demand is bullying.
To say the word, InnoDB does not support hash indexes.
Question 3. Why do database indexes use B + trees?
In order to maintain the integrity of the knowledge system, a brief introduction of the next several trees.
1. The first type: two fork search tree
Binary search tree, for example, is the most well-known data structure, do not expand the introduction, why it is not suitable for database index?
- When the amount of data is large, the height of the tree will be higher, the amount of data is large, the query will be relatively slow;
- Each node stores only one record, which may cause a query to have multiple disk IO at a time;
The tree often appears in university textbooks, so it is most well known.
2. Second type: B-Tree
B-Tree, for example, it is characterized by:
- is no longer a two-fork search, but M-fork search;
- Leaf nodes, non-leaf nodes, all store data;
- The middle sequence traversal, can obtain all nodes;
Outside the voice, really do not want to introduce this feature: non-root node contains the number of keywords J satisfied, (┌m/2┐)-1 <= J <= m-1, node splitting to meet this condition.
The B-Tree is created as a data structure for indexing, because it makes perfect use of the "local principle".
(1) What is the principle of locality?
The logic of the local principle is this:
- Memory read-write block, disk read and write slow, and much slower;
- Disk pre-reading: Disk read-write is not read on-demand, but read by the page, read one page of data at a time, load more data, if the future to read the data on this page, you can avoid future disk IO, improve efficiency; (voiceover: Typically, a page of data is 4K.) )
- Local principle: Software design to follow the "data read set" and "use to a data, the approximate rate will use the data near it", so that disk pre-reading can fully improve disk IO;
(2) Why is B-tree suitable for indexing?
- Because of the M fork, the height can be greatly reduced;
- Each node can store J records, if the node size is set to the page size, such as 4 K, can fully utilize the pre-read features, greatly reduce disk IO;
Third type: B + Tree
B + trees, such as the M-fork search tree, have made some improvements on the basis of the tree:
- Non-leaf nodes no longer store data, and the data is stored only on leaf nodes of the same layer; (voiceover: B + Tree has the same path length as the root to each node, and the second tree is not.) )
- Among the leaves, the chain list is added, and all nodes are retrieved, and the sequence traversal is no longer required;
These improvements have made the B + tree more superior than that of the tree:
- Range lookup, positioning min and Max, the middle leaf node, is the result set, not the middle sequence backtracking; (voice-over: Range queries are used much in SQL, which is the biggest advantage of B + trees over the tree. )
- Leaf node stores the actual record line, the record line relatively tight storage, suitable for large data volume disk storage, non-leaf node storage record PK, for query acceleration, suitable for memory storage;
- A non-leaf node, which does not store the actual record, but stores only the key of the record, the B + tree can store more indexes in the same memory;
Finally, to quantify, why is the B + tree of M fork significantly lower than the height of the binary search tree?
Probably calculate:
(1) The principle of locality, the size of a node is set to one page, a page 4K, assuming a key has 8 bytes, a node can store 500 keys, that is j=500
(2) M fork tree, probably m/2<= J <=m, that can be almost 1000 fork tree
So:
- One-layer tree: 1 nodes, 1*500 key, size 4K
- Two-tier tree: 1000 nodes, 1000*500=50w key, size 1000*4k=4m
- Three-layer tree: 1000*1000 nodes, 1000*1000*500=5 billion key, size 1000*1000*4k=4g
Voice: Well, help to see whether it is wrong.
As you can see, storing large amounts of data (500 million) does not require too high tree depth (height 3), and the index is not too much memory (4G).
Summarize
(1) database index for accelerated query
(2) Although the hash index is O (1), the tree index is O (log (n)), but SQL has a lot of "ordered" requirements, so the database uses a tree-type index
(3) InnoDB does not support hash indexes
(4) The idea of data pre-reading is that disk reads and writes are not read on-demand, but read-by-page, read one page at a time, and load more data at a time to reduce disk IO in the future
(5) Local principle: Software design to follow the "data read set" and "use to a data, the approximate rate will use the data near it", so that disk pre-reading can fully improve the disk IO
(6) Index of the database most commonly used B + trees:
- Very suitable for disk storage, can take full advantage of the principle of local, disk pre-reading;
- Very low tree height, capable of storing large amounts of data;
- The index itself occupies very little memory;
- Can be very good support of single-point query, range query, ordered query;
Database index, exactly what it is made of