This article mainly introduces how to implement binary search in MySQL. it comes from the original questions of the computer graduate examination. if you need a friend, you can refer to a given natural number array in ascending order, the array contains repeated numbers, such as [1, 2, 3, 4, 4, 5, 6, 7]. Problem: given any natural number, perform binary search on the array, return the correct position of the array, and provide function implementation. Note: If the number is the same consecutively, the first matching position or the last matching position is returned, which is determined by the input parameter of the function.
Why did I have this question?
Binary search is very important in database kernel implementation
In the database kernel implementation, binary search is a very important logic. more than 99% of SQL statements (range scans on all indexes, equivalent queries, and Unique queries ), binary search is used to locate data.
Consider a database table t1 (a int primary key, B int). the B field in the table has a B + tree index. The value of the B field recorded in the table is [1, 2, sequence 2, 3, 4, 4, 5, 6, 7. Given the following two query statements, different binary search logic is used:
SQL1:
select * from t1 where b > 4;
SQL2:
select * from t1 where b >= 4;
For SQL1, the binary search of the index requires skipping all 4 and returning all records from the last 4. for SQL2, the binary search needs to locate the first 4, then read all records in sequence.
In addition, for other query logics in the database, binary search also requires more features, such:
SQL3:
select * from t1 where b < 2;
SQL4:
select * from t1 where b <= 2;
Because database indexes support reverse scanning at the same time, SQL 3 and SQL 4 statements can all use reverse scanning of indexes. During reverse scanning, SQL3 needs to locate the first 2 in the index, while SQL4 needs to locate the last 2 in the index, and then start to return the index records that meet the query conditions in reverse.
Binary search is a basic and error-prone function in programming.
The first correct binary search algorithm was published 12 years after the first binary search was implemented. Enter Binary Search or Binary Search keywords through Google. There are a large number of related articles or blogs discussing this topic.
Problems to be aware of when implementing binary search
This article does not describe how to implement a correct binary search. after all, there are a large number of correct versions on the Internet. Next, we will make some simple analysis based on some problems found during the exam correction process. we hope to implement an effective binary search algorithm, or even a binary search algorithm available in the database, it is helpful.
Question 1: Check whether the parameter is valid
A large number of papers, when providing an algorithm to solve this problem, directly start the calculation with the low and high parameters, but do not check the low/high parameters. Are records in the array the same as low/high? Is the interval composed of low and high valid? The code is not robust enough.
In the database's binary search implementation, binary search is generally performed on an index page. There may be no user records on the index page (when all the records on the index page are deleted and are not merged with the brother page). at this time, low/high is 0, at this time, if you read records based on the mid calculated by low/high, there will be a logical error.
Problem 2: Calculation of binary search median
This is a classic topic. how to calculate the median value in binary search? In the exam, we generally provide two calculation methods:
Algorithm 1: mid = (low + high)/2
Algorithm 2: mid = low + (high-low)/2
At first glance, the algorithm is concise. After Algorithm 2 is extracted, it is no different from Algorithm 1. But in fact, the difference exists. In extreme cases, there is a risk of overflow in Algorithm 1 (low + high), which leads to incorrect mid results and program errors. Algorithm 2 ensures that the calculated mid must be greater than low, less than high, and there is no overflow problem.
Return to the database binary search. a database index page (8 k or 16 k in size) has limited index records that can be stored, so it will certainly not appear (low + high) the risk of overflow. This is also why the value in InnoDB is implemented by algorithm 1. However, as a rigorous program designer, Algorithm 2 is recommended to eliminate any potential risks.
Problem 3: recursive binary search
For more than half of the exam, binary search is performed using recursive calling. It cannot be said that recursive implementation is wrong, but it is about implementation efficiency. In general, recursive calls have overhead of stack pressure/output, and their efficiency is relatively low. The efficiency is the first in the case of a database system that provides fast query response to optimize code efficiency. We do not recommend that you use recursive binary search, at least in database kernel implementation. As far as I know, all open-source database systems, such as InnoDB and PostgreSQL, do not implement binary search recursively.
Question 4: How to find the first/last equivalence
Return to the question. the first/last equivalent item must be returned based on different input parameters. In the background section of this article, I also explained the corresponding database query (>,>= the query requirements are different ). In the exam, more than 80% of students first perform a binary search. after the same value is to be determined, the results are based on the passed-in flag (user requirements: flag = 1, returns the first equivalent item; flag = 0, returns the last equivalent item), performs sequential traversal until the items that meet the conditions are located.
Likewise, this implementation cannot be said to be wrong, but there are also performance problems. Performance and performance are always one of the key considerations for implementing the database kernel (I believe it is also an indicator of all applications ). In a database, except for primary key indexes and Unique Indexes, many secondary indexes have the same key value, sometimes there are more than items with the same key value (consider a user's order or purchase record ).
Assume that an index page stores 400 records, all of which are the same key value. In this case, the first binary search and then the sequential traversal algorithm are used. The second binary search can only be used once, and the order is traversed for 199 times. Finally, the comparison is made for 200 times. Very low efficiency. Of course, I am also delighted to see the practices of another small number of students (the algorithms I expect to see) and use flag to correct the final results of each comparison. For example, if the comparison is equal (equal to 0, greater than 1, less than-1), but flag = 1, the comparison result after correction is 1, you need to move the high to mid of the binary search to continue the binary search. (if the flag is 0, the corrected result is-1. you need to move the low to mid of the binary search, continue to the second part ). In this way, the equivalence can still be searched in binary mode, and the final comparison only takes 9 times, far less than 200 times.
This problem further raises the next question: how to implement a common and more complex binary search algorithm in the database?
Question 5: Implementation example of binary search in the database
Binary search in databases is more complex. a general binary search algorithm is required to be used in different SQL query scenarios.
InnoDB summarizes four different Search modes for different SQL statements:
# Define PAGE_CUR_G 1> query
# Define PAGE_CUR_GE 2 >=, = query
# Define PAGE_CUR_L 3 <查询< p>
# Define PAGE_CUR_LE 4 <= query
Then, based on the four different Search modes, adjust the Search Mode when the binary Search encounters the same key value. For example, if the Search Mode is PAGE_CUR_G or PAGE_CUR_LE, move low to mid and continue the binary Search. if the Search Mode is PAGE_CUR_GE or PAGE_CUR_L, move high to mid and continue the binary Search.
Our TNT engine adopts different solutions from InnoDB, but also implements the same functions. The adjustment of the TNT engine to the same key value is summarized as that I will not explain it here. you can try your own analysis.
/* Operator includeKey forward compare result: 1 0-1 */
========================================================== ============================================
> = 1 1 | 1-1-1
= 1 1 | 1-1-1
> 0 1 | 1 1-1
<0 0 | 1-1-1
<= 1 0 | 1 1-1
========================================================== ============================================