This paper mainly collated the algorithms commonly used in database.
Although it is not necessary to start from scratch to understand what the underlying algorithm of the database is, it is necessary to understand the approximate principle.
In fact, many technologies now can be found in the classic algorithm, such as Hadoop is actually the merging algorithm evolved.
In this way, the algorithm is equivalent to the internal strength, if you can understand these algorithms, and then learn other techniques, is a whip a streak a slap a palm of blood
Before we know all the algorithms, we need to understand the complexity of the algorithm, and the complexity of the algorithm mainly refers to the time complexity , which is a measure of how the operation increases when the amount of data increases. Equivalent to giving the algorithm a ruler , so that we can compare that algorithm better. At the same time, when the amount of data has reached a massive level, we must deduct the performance as much as possible to ensure the availability of the entire architecture.
Complexity of the algorithm
The complexity here mainly refers to the time complexity of the algorithm, which is how the operation increases when the amount of data increases.
Identify the complexity of several common algorithms, we can have an intuitive understanding.
When faced with massive amounts of data
- Hash table: O (1)
- Search Equalization Tree: O (log (n))
- Search array: O (n)
- Best sorting algorithm O (n*log (n))
- Bad sort: O (n^2)
It is precisely because of the time complexity of the hash table, the balanced tree, and the good sorting algorithm that we will use it.
Common sorting algorithm: merging sorting principle
The most commonly used sorting algorithm in the database is the merge sort , which is mainly used for query optimization and database connection .
Suppose we now give us two sequences with a number of 4, to combine them into a sequence of 8 elements in order from small to large, and what to do.
It is possible for both sides to come up with an element to compare, who is smaller in the 8 element sequence. For example, in:
- The first round: 1:2 small, so put 1 in the sequence.
The second round: the left 1 has been placed below, the right 2 has not moved. So the comparison is 3 and 2, of course 2 small.
Third round: Compare 3 and 4
Repeat
The whole process can be seen as GIF
To summarize:
- Compares the current element, the so-called current element, which refers to the first in a sequence.
- Small elements into a sequence of 8 elements
- Continue to compare
Look closely at the algorithm above, is not the two sequence after the merger has become an orderly.
To execute this algorithm, however, there is a requirement that the sequence to be merged must be ordered so that only the current element can be sorted.
However, it is not possible to be completely orderly for any sequence . At this point, we are stuck in a stalemate.
So can we think that the individual elements are definitely ordered, so if we combine two sequences of 1 elements, we can definitely use this algorithm, so we get an ordered sequence of 2 elements, and if there is an ordered sequence of 2 elements, is it possible to merge again. Then a 4-element sequence is merged, followed by a 8-element sequence merge.
It's probably like this.
So how do you get a sequence of such 1 elements? Of course it's split. 8 elements are split into 4, 4 split into 2, split into 1.
Well, the algorithm is complete.
First, in order to get the sequence of 1 elements, we need to split the sequence to be sorted and then merge later.
- Split stage, dividing the sequence into smaller sequences
- Sorting phase, combining small sequences together (using the merge algorithm) to form larger sequences
Split phase
Use 3 steps to divide a sequence into a unary sequence. The value of the number of steps is log (n), for example, now n=8, log (n) =3
Why is it? Since each step of splitting divides the length of the original sequence by 2, the number of steps to be executed is to divide the original sequence by 2 times, which is exactly the definition of the logarithm .
Sort Stage
Similarly, the sort stage has a log (N) step, which is the same for the split stage above.
At each step, all elements need to be moved to move to the next sequence, so that each step needs to perform an n operation.
That is to say * * The overall cost is N*log (N) operations. **
The complete process is as follows:
Application of merge sort
If you are familiar with Hadoop, the inside of the MapReduce is actually this idea, divide and conquer , a large task split into a number of small tasks, and finally conquer, merging can be.
It can be said that MapReduce is the result of merging algorithm modification, it can be run on multi-processing, multi-server architecture.
Arrays of commonly used data structures
When it comes to the data structure of a database, it's easiest to think of data tables like Excel.
Such as
Each row represents a principal, and each column is a number of properties or fields.
The advantages are very intuitive, the disadvantage is too simple, when the amount of data is too large, the search is not easy.
So in order to optimize the search, there are two main ways to build a search tree, one is a hash table. Here we introduce separately.
Tree
If you are looking directly at an array or array, and if it happens to be orderly and natural, you can use binary, interpolation, and so on. But in fact, most of the arrays are unlikely to be ordered, so they need to be sorted and consume a lot of resources and time.
Then there is no way to insert and delete efficiency is good, but also more efficient to find it?
If we are helpless, we can start from the simplest case, if there is only {62}, and then need to insert 88 in, it becomes {62,88}, if you want to insert 58 now, but also keep order, naturally need to move 88 back. Can we not move it?
We know that the tree structure can be conveniently inserted and removed, and then the two-fork tree structure is extended.
First, the 62 is set as the root node, 88:62 large, so 62 of the right subtree, the same, 58 for the left dial hand tree.
is a two-fork sort tree , so long as it's in the middle sequence, you get an ordered sequence.
For example, when we want to query 93, you can query the same as below.
We just have to find out 93, we can know which line, and then in this line to find, the scope of a lot less natural.
So what is the cost of the query? Nature is the number of layers of a tree, that is $log (N) $
Then imagine such an example.
If a table in the database contains a country field, who is going to work for China now? If it's an array, we need to sweep the whole table.
But if you create a binary lookup tree for all the elements in the Country field, you can use up to $log (N) $ to find the node that represents China, and then through that node you know which rows to consider.
This is the index , which is to use other data structures to represent some columns, which can speed up the lookup of this column.
But the new problem comes again, finding a value with a binary lookup tree is good, but what if you want to find multiple elements between two values ? Using a two-fork lookup tree requires finding a node for each tree to determine whether it is between two values.
So we introduced an improved tree--B + Tree
Its characteristics are:
- Only leaf nodes save row location information for related tables
- Other nodes are only used to refer to the path, that is, to guide the correct node in the search.
As you can see, a whole line is used to hold the information, if you want to find the value between 40~100,
You only need to find 40 first, and then traverse the subsequent nodes.
For example, to find the M successor node, you need $m+log (N) $ times.
But the B + nodes need to be kept in order, and if a row is added or deleted in the database, the B + tree is required to make a big change, and the check-in and delete are also O (logn) complexity.
So the index cannot be too much, it slows down the INSERT, update, and delete rows.
Hash
As mentioned earlier, the use of hash table to find, its time complexity of only O (1), should be the fastest way to find.
Whether it is a normal sequence or a sequential table, we need to compare The value we are looking for with the elements in the sequence, can we find the corresponding memory location by the keyword key only?
That is, there is a function:
$$ storage location = f (keyword) $$
This is called hashing, the $f $ is a hash function, also known as a hash function.
Using hashed data to store records in a contiguous space, this space is called a hash table or hash table.
When stored, hash functions can be used to calculate the hashed address of a record and store it according to this address.
This hash function is also used to calculate the appropriate address for the lookup.
That is, where to find where to go, then hashing technology is not only a storage technology, but also a search technology
The best scenario for hashing is to find records that are equal to a given value. because there is no need to compare, the efficiency is greatly improved.
However, if you encounter a key corresponding to multiple records , it is not suitable for the hash table. For example, using the keyword "male" to find a class of students, it is obviously inappropriate.
Similarly, the hash table is not suitable for range search , that is, to find the 40~100 student number.
In addition, we often encounter two keywords using hash function but get the same address, so it is a conflict , you can call this key hash function synonym , so also need to design methods to avoid conflicts.
Constructing a hash function
It said that a good hash function is the most important thing is to let the hash address evenly distributed in the storage space, which can reduce the conflict , generally used hash function is the original number according to a certain law into another number.
- Extract the numbers. If we get the number is a long string, that is, key is a large number of bits, and some of the number of bits on the average, you can extract it, reverse, right-ring displacement and so on to make the key more average.
That is, the hash function is actually part of the extracted keyword.
such as mobile phone number, you can draw several of them
- If you do not know the distribution of the keyword, the number of bits is not small, you can square, and then take the middle 3 bits as the hash address. For example, 1234, the square is 1522756, take the middle 3 bits 227 as the hash address.
- The above is a bit special, in fact, the most commonly used is to take the key to modulo, or the keyword is squared, folded and then modulo
$ $f (key) =key \qquad mod \qquad P (p<=m) $$
So if $p$ is chosen badly, the conflict will be much more.
Empirically, if the table length is $m$, then $p$ is the smallest prime number less than or equal to $m$, or composite that does not contain less than 20 quality factors.
Resolve Conflicts
Resolving conflicts we only introduce one, that is, the chain address method
We can store all the keywords as synonyms in a single linked list , so that each linked list is called a hash bucket
So the hash table stores only the head pointers for all synonym sub-tables, so that no matter how many conflicts there are, simply add nodes to the single-linked list at the current location.
Then a good hash function should make the elements inside the hash bucket very small, so that it can be found directly in the table, the time complexity is O (1)
Reference
If someone asks you how the database works, tell him to read this article.
Database (iii), the underlying algorithm