Text inside June found that few people can complete the understanding of the MySQL join type and algorithm, the online spread to improve the performance of the join, increase the number of join_buffer_size fallacy is everywhere. Of course, there are some ignorant Pger attack MySQL does not support hash Join, so it is not suitable for some analysis class operation. MySQL does not support hash joins or sort Merge joins, but MySQL also has its own unique optimizations and processing on Join, in addition, branch version MARIADB has support for hash join, so take MySQL to do some "simple" The analysis query is also fully acceptable. Of course, if the amount of data is really up, then even a relational database of traditional MPP architectures that support hash joins might not be appropriate, and such analytic queries should probably be given to more specialized Hadoop clusters. The cost of join is the process of looking at the join of the two tables before describing the join type and algorithm of MySQL:
The fetch stage is that when the column associated with the inner table is a secondary index, but needs to access the data in the table, then it is necessary to access the primary key index in order to get the data process, whether the table storage engine is InnoDB storage engine or MyISAM, this is unavoidable, Just myisam the table faster, because its secondary index holds pointers to records, and the InnoDB storage Engine is an indexed organization table that needs to be looked up again to locate data. The fetch stage is not a must, if it is a clustered index link, then directly can get the data, no need to return to the table, there is no fetch this stage. In addition, the above gives the join process between the two tables, the join of multiple tables is to continue the above process. then calculates the cost of two table joins, here are the following concepts: The number of appearances scanned, recorded as O. Usually the appearance of the number of scans is 1, that is, when the join scan the driver table data can be scanned in the table, recorded as I. According to the different join algorithms, the number of scans in the table is different, and it is recorded as R. Depending on the join algorithm, the number of read records may be different from the comparison of joins, as noted in M. According to the different join algorithms, the number of read records of the comparison number of different tables is recorded as F. If the join is a secondary index, it may be necessary to get the final data back to the table evaluate whether a join algorithm is good or bad, that is, to see if these operations are less expensive. Of course, this also has to consider I/O access, sequential or random, in short, join tuning is also the art, not as simple as imagined. simple nested-loop join Online Most say MySQL only supports Nested-loop Join, so performance is poor. But Nested-loop join must be bad? is Hash join stronger than Nested-loop join? Inside feel that this understanding is one-sided, Hash join may only be a variant of the Nested-loop join. So inside intends to analyze MySQL-supported joins from an algorithmic perspective and analyze the optimizations for join statements. First look at Simple Nested-loop join (hereinafter referred to as SNLJ), which is the simplest nested-loop join, the algorithm pseudo-code is as follows: for each row R in R do Foreach row s in S do If R and S satisfy the join CONDITION  Then output the tuple to better display the entire SNLJ process:
The SNLJ algorithm is quite simple and straightforward. That is, each record in the appearance (driver table) is judged by the record in the inner table. But the algorithm is also quite rude, because the cost of the algorithm is actually very large. Assuming that the appearance of the record number is R, the inner table records the number of bits s, according to the previous section inside the criteria for the join algorithm, the cost of SNLJ is shown in the following table:
You can see that the cost of reading the number of records and the cost of comparison is s*r, that is, Cartesian product. Assuming that the outer table is 10,000 records, the number of records it reads and how many joins are compared will need to be billions. Such an algorithm overhead, inside June also only: hehe. The Index nested-loop Join SNLJ algorithm is simple and straightforward, but it is quite rude. Therefore, in the optimization of the join, it is usually recommended to establish an index in the inner table, in order to reduce the cost of the Nested-loop join algorithm, the MySQL database used more is this algorithm, the following is called INLJ. To see the pseudo-code of this algorithm: for every row R in R does LOOKUPR in s index if found s = = R then output the tuple because there are indexes on the inner table, so the comparison is no longer required a The records are compared, and the comparisons can be reduced by the index to speed up the query. The entire process is as follows:
You can see that each record in the appearance is accessed through the index of the inner table because the cost of the index query is relatively fixed, so the optimizer tends to use a table with a small number of records as the appearance (is there a potential problem here?). )。 Therefore, the cost of the INLJ algorithm is shown in the following table:
The above table Smatch indicates the number of matching records found by index. At the same time, it can be found that through the index can greatly reduce the number of joins in the table, each time the comparison of 1 appearances of records, is actually a indexlookup (index lookup), and each index lookup cost is the height of the tree, that is, indexheight. INLJ's algorithm is not complex, but also easy to understand. But is efficiency up to the user's expectations? In fact, if it is through the table's primary key index to join, even if the large data volume, the efficiency of the INLJ is also quite good. Because the cost of index lookups is very small, and the access pattern is sequential (assuming that most of the clustered index accesses are in the order of comparison). Most people accuse MySQL of inlj slow, mainly because the index that may be used when the join is not the primary key of the clustered index, but the secondary index, then the process of INLJ need a step fetch process, and this process cost is quite large:
Because the access is a secondary index, if the query needs to access the columns on the clustered index, it is necessary to make the table fetch data, it seems that each record is only one more time table operation, but this is because the access is a secondary index, if the query needs to access the column on the clustered index, it is necessary to take back the table data, It seems that each record is just one more time back table operation, but this is the biggest drawback of the INLJ algorithm. First, index lookup for the secondary index is a comparison of random I/O access operations. Second, the return table is a random I/O operation based on index lookup. So, the biggest drawback of INLJ is that it may require a lot of discrete operations, which is the biggest bottleneck before SSDs appear. And even if the SSD's appearance greatly increased the random access performance, but compared sequential I/O, it is still a lot slower, still not an order of magnitude.
MySQL Join algorithm and Tuning White paper (i)