MySQL Notes: The default use of select is an improper index, resulting in huge performance losses. The database uses one cainiao, which only has the most basic select statement. I recently encountered a performance problem caused by improper indexing of a select statement in mysql. It is interesting to note that indexes are the most relevant factor in database operation performance. However, have you ever wondered whether the database may not have enough "intelligence" to select the most appropriate index for a select statement even if an appropriate index is created? This was the first time I used the magic of force index ~ Let's talk about the Background: The system environment of www.2cto.com: OS: windows 7 home edition 64 bitdb: MySQL 5.5.28 x64 involves a database table, which is called flow and uses the MyISAM engine, there are the following columns: start: intend: inttime: timestampamount: int meaning: each row in the table refers to the data flow of a total of amount from the location start to the location end at the time point. Note: In our application scenarios, start = end is possible, that is, sending and receiving at the same location. Data volume: more than 1000 selectable locations (can be used as start or end), with a time span of about 15 days. A total of more than 20,000,000 pieces of data www.2cto.com have the following indexes on this table: idx_start_time: start, time is keyidx_end_time: end, time is keyidx_time: There are three problems to be solved with time as key: within a time period T, the total data sent from A starting point A is within A time period T, and the total data received from A destination A is within A time period T, the sum of data generated from the starting point or the ending point of A (if both the starting point and the ending point are A, data flow is only calculated once) is A very simple problem, three select statements are released. Set location A to 1, time range T to all day problem 1: select sum (amount) from flow where start = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59' problem 2: select sum (amount) from flow where end = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59' problem 3: select sum (amount) from flow where (start = 1 or end = 1) and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59' run the first test, it takes 0.01 s. (Good .) Run the second one, which takes 0.01 s. (Very good. This job is too easy ~) Third, it takes 3 s. (Wait, is this a bad thing? It's too unscientific! How can I merge the data to reduce the time consumption by 300 times ??) No way, it's really not that simple and easy, and it's hard to find a solution. Otherwise, you will not be able to count the data of each day for more than 1000 locations within six months, which is not counted as 1000*180*3 = 540,000 s = 150 h. For 150 hours, let's just make a simple and explosive summary! Fortunately, with the help of the first two questions, and taking advantage of the solid "Set Theory" foundation laid in the primary school, I thought of a roundabout method: question 3 answer = Question 1 answer + Question 2 answer) (A is the data flow in the time period T at the same time as the start point and end point). This is not simple. simply change or in question 3 to and: select sum (amount) from flow where (start = 1 and end = 1) and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59' Run this again, it only takes 0.01 s. Combining these three queries is less than 0.05 s, which is much better than the 3 s of the potholes. The entire statistics can always be completed in a few hours ~ However, if the problem ends here, there will be no such article. I can't accept this awkward temporary solution. Besides, write "I don't know why, but this method is faster "is a bit too sb ~ Therefore, I began to think about a more 'elegantly 'solution based on the excellent habit of breaking the sandpot question from an early age. Finally, I remembered in my mind that in my crazy reading age, I saw a command called explain in an SQL tutorial, which can be used to analyze select statements. Okay, fuck this guy. Because of my poor database knowledge, I can only think that this is an index, so I also pay attention to the index part in the explaint result (to be honest, I do not know much about others = ). Question 1 ~ In the explain command Analysis of SQL statement 3, the indexes that are given priority are as follows: Problem 1: idx_start_time problem 2: idx_end_time problem 3: idx_time www.2cto.com. 1st and 2nd use perfect indexes, but 3rd are incorrect. MySQL uses time as the index by default, that is, it uses time to filter all data. In the current issue, the possible causes of efficiency caused by time filtering are ): the time comparison operation uses the between range comparison, while the start and end operations both directly compare the data of about 15 days in a table. Therefore, when querying by day, after time is filtered for the first time, about 1/15 of the remaining data needs to be filtered. On the contrary, if you use start or end for filtering for the first time, because there are about 1000 different locations, only about 1/1000 of the remaining data needs to be filtered by subsequent conditions. So how can I make MySQL correct this index and make a judgment error. One search found a thing called force index and started to try: select sum (amount) from flow force index (idx_start_time, idx_end_time) where (start = 1 or end = 1) and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59' result 2012 s. It's faster, but it's not much improved. Then, I thought, how can I run this formula quickly? My preliminary conclusion is: Use start to filter the original data and get a filtering result. r1 uses end to filter the original data again, obtain a filtering result. r2 merges r1 and r2 into r and filters time. Well, isn't the SQL statement written in question 3 that MySQL cannot find this solution? So let's rewrite the method. If this is not the case, MySQL will be able to get started. Expand or: select sum (amount) from flow where (start = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59 ') or (end = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59') does not add force index, but it is still 3 s. Next, add force indexselect sum (amount) from flow force index (idx_start_time, idx_end_time) to www.2cto.com) where (start = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59 ') or (end = 1 and time between '2017-01-01 00:00:00 'and '2017-01-01 23:59:59') It's time to witness the miracle, 0.01s MySQL was finally transferred on this issue! Note: As I mentioned at the beginning, I have no strong database knowledge and experience. Therefore, the solution and viewpoint mentioned above may be inaccurate or even incorrect. Although I finally seem to have come to some conclusions, the root cause of this problem is still not fully understood. Further analysis may require some understanding of the underlying implementation mechanisms of MySQL or other similar relational databases. For me, this is a complete blank. I can only say that for MySQL, changing the literal Writing Method of SQL statements and forcing the specified index in some cases may be miraculous. This is not just a theoretical possibility, but a real problem that may occur during actual work and study.