Why is the filtered column value always 100% in Mysql explain extended?

Source: Internet
Author: User

Why is the filtered column value always 100% in Mysql explain extended?
1. execute the Mysql explain extended command to output a column of filtered (MySQL5.7 outputs filtered by default ), it indicates the percentage of rows in the returned results to be read (the value of the rows column. Filtered is a very useful value, because for join operations, the result set size of the previous table directly affects the number of cycles. However, the test result in my environment is that the value of filtered is always 100%, that is, it is meaningless.

Refer to the following mysql 5.6 code. The filtered value is only valid for index and all scans (this can be understood. In other cases, the rows value is usually equal to the size of the estimated result set .).
SQL/opt_explain.cc

  1. Bool Explain_join: explain_rows_and_filtered ()
  2. {
  3. If (table-> pos_in_table_list-> schema_table)
  4. Return false;

  5. Double examined_rows;
  6. If (select & select-> quick)
  7. Examined_rows = rows2double (select-> quick-> records );
  8. Else if (tab-> type = JT_INDEX_SCAN | tab-> type = JT_ALL)
  9. {
  10. If (tab-> limit)
  11. Examined_rows = rows2double (tab-> limit );
  12. Else
  13. {
  14. Table-> pos_in_table_list-> fetch_number_of_rows ();
  15. Examined_rows = rows2double (table-> file-> stats. records );
  16. }
  17. }
  18. Else
  19. Examined_rows = tab-> position-> records_read;

  20. Fmt-> entry ()-> col_rows.set (static_cast (Examined_rows ));

  21. /* Add "filtered" field */
  22. If (describe (DESCRIBE_EXTENDED ))
  23. {
  24. Float f = 0.0;
  25. If (examined_rows)
  26. F = 100.0 * tab-> position-> records_read/examined_rows;
  27. Fmt-> entry ()-> col_filtered.set (f );
  28. }
  29. Return false;
  30. }

However, after I construct a full table scan, the result of filtered is incorrect. The result is still 100%, and what I expect is 0.1%.
  1. Mysql> desc tb2;
    + ------- + -------------- + ------ + ----- + --------- + ------- +
    | Field | Type | Null | Key | Default | Extra |
    + ------- + -------------- + ------ + ----- + --------- + ------- +
    | Id | int (11) | NO | PRI | 0 |
    | C1 | int (11) | YES | NULL |
    | C2 | varchar (100) | YES | NULL |
    + ------- + -------------- + ------ + ----- + --------- + ------- +
    3 rows in set (0.00 sec)

    Mysql> explain extended select * from tb2 where c1 <100;
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + ------------- +
    | Id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + ------------- +
    | 1 | SIMPLE | tb2 | ALL | NULL | 996355 | 100.00 | Using where |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + ------------- +
    1 row in set, 1 warning (10 min 29.96 sec)

    Mysql> select count (*) from tb2 where c1 <100;
    + ---------- +
    | Count (*) |
    + ---------- +
    | 1, 1001 |
    + ---------- +
    1 row in set (1.99 sec)

Through gdb tracking, it is found that the branch of the Code is correct, but the following values are incorrect.
  1. (Gdb) p table-> file-> stats. records
  2. $18 = 996355
  3. (Gdb) p tab-> position-> records_read
  4. $19 = 996355
The preceding tab-> position-> records_read should be the estimated number of returned rows. The correct value should be about 1001, rather than the full table size of 996355.

2. Why is the above problem? Later, I checked the statistics collected by MySQL and understood it.
Like other mainstream databases, MySQL automatically needs to collect statistics to generate better execution plans. You can also use analyze table to collect statistics manually. The collected statistics are stored in mysql. innodb_table_stats and mysql. in innodb_index_stats.
Reference: http://dev.mysql.com/doc/refman/5.6/en/innodb-persistent-stats.html#innodb-persistent-stats-tables

However, this is not the point. The point is that, when you view these two tables, you will find that there are very few statistics collected by MySQL.
  1. Mysql> select * from mysql. innodb_table_stats where table_name = 'tb2 ';
    + --------------- + ------------ + --------------------- + -------- + ---------------------- + -------------------------- +
    | Database_name | table_name | last_update | n_rows | clustered_index_size | sum_of_other_index_sizes |
    + --------------- + ------------ + --------------------- + -------- + ---------------------- + -------------------------- +
    | Test | tb2 | 06:26:54 | 996355 | 3877 | 0 |
    + --------------- + ------------ + --------------------- + -------- + ---------------------- + -------------------------- +
    1 row in set (0.00 sec)

    Mysql> select * from mysql. innodb_index_stats where table_name = 'tb2 ';
    + --------------- + ------------ + --------------------- + -------------- + ------------ + --------------- + --------------------------------- +
    | Database_name | table_name | index_name | last_update | stat_name | stat_value | sample_size | stat_description |
    + --------------- + ------------ + --------------------- + -------------- + ------------ + --------------- + --------------------------------- +
    | Test | tb2 | PRIMARY | 06:26:54 | n_diff_pfx01 | 996355 | 20 | id |
    | Test | tb2 | PRIMARY | 06:26:54 | n_leaf_pages | 3841 | NULL | Number of leaf pages in the index |
    | Test | tb2 | PRIMARY | 06:26:54 | size | 3877 | NULL | Number of pages in the index |
    + --------------- + ------------ + --------------------- + -------------- + ------------ + --------------- + --------------------------------- +
    3 rows in set (0.00 sec)
There are two important information. One is the total number of records of the table (n_rows), and the other is the unique number of values of the columns in the index (n_diff_pfx01 ). That is to say, MySQL does not count the value distribution of non-index columns. in the previous query example, because c1 is not indexed, therefore, MySQL cannot estimate the number of records that "c1 <100" will eventually filter. In this way, the value of filtered is rarely effective. To create an index on the column that appears in the where condition, the execution plan should not go through the range or ref scan of the index, the filtered column is useless because it performs full table scan or overwrite index scan.

3. What are the consequences of the lack of MySQL statistics?
It is hard to imagine that, without indexes, MySQL may generate execution plans with poor performance, such as the wrong join sequence between large tables and small tables, as shown below.
  1. Mysql> explain extended select count (*) from tb1, tb2 where tb1.c1 = tb2.c1 and tb2.c2 = 'xx ';
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + -------------------------------------------------- +
    | Id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + -------------------------------------------------- +
    | 1 | SIMPLE | tb1 | ALL | NULL | 1000 | 100.00 | NULL |
    | 1 | SIMPLE | tb2 | ALL | NULL | 996355 | 100.00 | Using where; Using join buffer (Block Nested Loop) |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------- + -------------------------------------------------- +
    2 rows in set, 1 warning (0.00 sec)
Although Table t1 is a small table and table tb2 is a large table, the result set is 0 after tb2.c2 = 'xx' is added to table tb2, therefore, scanning the tb2 table is a better choice for performance.
For the same query, PostgreSQL provides a better execution plan. It first scans table t2 for recycling and scanning table t1.

  1. S = # explain select count (*) from tb1, tb2 where tb1.c1 = tb2.c1 and tb2.c2 = 'xx ';
  2. QUERY PLAN
  3. -------------------------------------------------------------------
  4. Aggregate (cost = 20865. 50 .. 20865.51 rows = 1 width = 0)
  5. -> Nested Loop (cost = 0. 00 .. 20865.50 rows = 1 width = 0)
  6. Join Filter: (tb1.c1 = tb2.c1)
  7. -> Seq Scan on tb2 (cost = 0. 00 .. 20834.00 rows = 1 width = 4)
  8. Filter: (c2): text = 'xx': text)
  9. -> Seq Scan on tb1 (cost = 0. 00 .. 19.00 rows = 1000 width = 4)
  10. (6 rows)
The following is a comparison of the execution time.

MySQL took 0.34 s

  1. Mysql> select count (*) from tb1, tb2 where tb1.c1 = tb2.c1 and tb2.c2 = 'xx ';
  2. + ---------- +
  3. | Count (*) |
  4. + ---------- +
  5. | 0 |
  6. + ---------- +
  7. 1 row in set (0.34 sec)

It took 0.139 s for PostgreSQL
  1. S = # select count (*) from tb1, tb2 where tb1.c1 = tb2.c1 and tb2.c2 = 'xx ';
  2. Count
  3. -------
  4. 0
  5. (1 row)

  6. Time: 139.600 MS

In the above example, the performance difference is not very big. If tb2.c2 = 'xx' is removed, the difference will be very big.
Mysql took 1 minute 8 seconds
  1. Mysql> explain select count (*) from tb1, tb2 where tb1.c1 = tb2.c1;
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------------------------------------------------- +
    | Id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------------------------------------------------- +
    | 1 | SIMPLE | tb1 | ALL | NULL | 1000 | NULL |
    | 1 | SIMPLE | tb2 | ALL | NULL | 996355 | Using where; Using join buffer (Block Nested Loop) |
    + ---- + ------------- + ------- + ------ + --------------- + ------ + --------- + ------ + -------- + ---------------------------------------------------- +
    2 rows in set (0.00 sec)

    Mysql> select count (*) from tb1, tb2 where tb1.c1 = tb2.c1;
    + ---------- +
    | Count (*) |
    + ---------- +
    | 1, 9949 |
    + ---------- +
    1 row in set (1 min 8.26 sec)

PostgreSQL only takes 0.163 seconds
  1. S = # explain select count (*) from tb1, tb2 where tb1.c1 = tb2.c1;
  2. QUERY PLAN
  3. -------------------------------------------------------------------------
  4. Aggregate (cost = 23502. 34 .. 23502.35 rows = 1 width = 0)
  5. -> Hash Join (cost = 31. 50 .. 23474.97 rows = 10947 width = 0)
  6. Hash Cond: (tb2.c1 = tb1.c1)
  7. -> Seq Scan on tb2 (cost = 0. 00 .. 18334.00 rows = 1000000 width = 4)
  8. -> Hash (cost = 19. 00 .. 19.00 rows = 1000 width = 4)
  9. -> Seq Scan on tb1 (cost = 0. 00 .. 19.00 rows = 1000 width = 4)
  10. (6 rows)

  11. Time: 0.690 MS
  12. S = # select count (*) from tb1, tb2 where tb1.c1 = tb2.c1;
  13. Count
  14. -------
  15. 10068
  16. (1 row)

  17. Time: 163.868 MS

However, this performance difference has nothing to do with statistics, because PG supports Nest Loop Join, Merge Join, and Hash Join, while MySQL only supports Nest Loop Join, if the index Nest Loop is missing, the Join operation will be slow.

4. conclusion 1. mySQL has very few statistics, only the number of rows in the table and the number of unique values in the index column, this makes MySQL optimizer often fail to have a correct understanding of the data size and give an execution plan with poor performance.
2. The efficiency of MySQL join operations is very dependent on indexes (I used to help you tune MySQL SQL statements for indexing twice ). It does not mean that PostgreSQL joins do not require indexes, but it is not as responsive as MySQL lacks indexes. In the example where MySQL runs for more than one minute, after the index is added, the execution time of both MySQL and PG is immediately reduced to less than 10 milliseconds. Therefore, when designing a table, developers should evaluate the possible query methods and build the indexes (neither less nor more ).
3. In contrast, PG not only counts the value distribution of all columns, but also contains information such as histograms and frequent values in addition to the unique values. It supports the optimizer of PG to make correct decisions. This is also the reason for speculation. The PG community believes that the PostgreSQL optimizer is smart enough and does not need to add hint functions similar to Oracle to the kernel of PG (because hint may be abused, this makes the system difficult to maintain. However, you can install the pg_hint_plan plug-in yourself if you really want to use it ).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.