Basic principles of the distinct statement in MySQL and its comparison with groupby _ MySQL

Source: Internet
Author: User
This article mainly introduces the basic principles of the distinct statement in MySQL and its comparison with groupby. Generally, the implementation principles of groupby and distinct are similar and the performance is slightly better, for more information, see DISTINCT. it is actually very similar to the implementation of the group by operation, except that only one record is retrieved in each GROUP after group. Therefore, the implementation of DISTINCT is similar to that of group by, and there is no big difference. It can also be achieved through loose index scanning or compact index scanning. of course, MySQL can only complete DISTINCT by using a temporary table when the index cannot be used only. However, the difference from group by is that DISTINCT does not need to be sorted. That is to say, MySQL uses a temporary table to "cache" data when only the Query operated by DISTINCT cannot complete operations by using indexes ", but does not perform the filesort operation on the data in the temporary table. Of course, if we use group by for DISTINCT and use aggregate function operations like MAX, we cannot avoid filesort.

Below we will use a few simple Query examples to demonstrate the implementation of DISTINCT.

1. First, let's take a look at how to complete the DISTINCT operation through a loose index scan:

sky@localhost : example 11:03:41> EXPLAIN SELECT DISTINCT group_id   -> FROM group_messageG

*************************** 1. row ***************************      id: 1 SELECT_type: SIMPLE    table: group_message     type: rangepossible_keys: NULL     key: idx_gid_uid_gc   key_len: 4     ref: NULL     rows: 10    Extra: Using index for group-by1 row in set (0.00 sec)

We can clearly see that the Extra information in the execution plan is "Using index for group-by". what does this mean? When I did not perform the GROUP BY operation, the execution plan will tell me that the GROUP BY operation is performed through the index? In fact, this is related to the implementation principle of DISTINCT. during the implementation of DISTINCT, it is also necessary to group and then retrieve one from each group of data and return it to the client. The Extra information here tells us that MySQL has completed the entire operation by using loose index scanning. Of course, if MySQL Query Optimizer can replace the information here with "Using index for distinct", it will be easier to understand.

2. let's take a look at the compact index scan example:

sky@localhost : example 11:03:53> EXPLAIN SELECT DISTINCT user_id   -> FROM group_message  -> WHERE group_id = 2G

*************************** 1. row ***************************      id: 1 SELECT_type: SIMPLE    table: group_message     type: refpossible_keys: idx_gid_uid_gc     key: idx_gid_uid_gc   key_len: 4     ref: const     rows: 4    Extra: Using WHERE; Using index1 row in set (0.00 sec)

The display here is exactly the same as implementing group by through compact index scanning. In fact, during the implementation of this Query, MySQL will let the storage engine scan all the index keys of group_id = 2 to get all user_id, and then use the sorted feature of the index, each time you replace the index key value of user_id, you can complete the entire DISTINCT operation after scanning all the index keys of gruop_id = 2.

3. let's look at how DISTINCT can be completed without using an index separately:

sky@localhost : example 11:04:40> EXPLAIN SELECT DISTINCT user_id   -> FROM group_message  -> WHERE group_id > 1 AND group_id < 10G

*************************** 1. row ***************************      id: 1 SELECT_type: SIMPLE    table: group_message     type: rangepossible_keys: idx_gid_uid_gc     key: idx_gid_uid_gc   key_len: 4     ref: NULL     rows: 32    Extra: Using WHERE; Using index; Using temporary1 row in set (0.00 sec)

When MySQL cannot perform the DISTINCT operation simply by relying on indexes, it has to use a temporary table for corresponding operations. However, we can see that when MySQL uses a temporary table to complete DISTINCT, there is a difference from processing group by, that is, filesort is missing. In fact, in the MySQL grouping algorithm, grouping operations are not necessarily performed BY sorting. I have mentioned this in the GROUP BY optimization tips above. In fact, MySQL implements the DISTINCT operation after grouping without sorting, so the filesort sorting operation is missing.

4. try again with group:

sky@localhost : example 11:05:06> EXPLAIN SELECT DISTINCT max(user_id)   -> FROM group_message  -> WHERE group_id > 1 AND group_id < 10  -> GROUP BY group_idG

*************************** 1. row ***************************      id: 1 SELECT_type: SIMPLE    table: group_message     type: rangepossible_keys: idx_gid_uid_gc     key: idx_gid_uid_gc   key_len: 4     ref: NULL     rows: 32    Extra: Using WHERE; Using index; Using temporary; Using filesort1 row in set (0.00 sec)

Finally, let's take a look at the example of using an aggregate function with group by. compared with the third example above, we can see that there are more filesort sorting operations, it is because we use the MAX function. To obtain the MAX value after the group, you cannot use the index to complete the operation. you can only sort the operation.

Who is better at mysql distinct and group?
1. preparations before testing

// Prepare a test TABLE mysql> create table 'test _ test' (-> 'id' int (11) not null auto_increment,-> 'num' int (11) not null default '0',-> primary key ('id')->) ENGINE = MyISAM default charset = utf8 AUTO_INCREMENT = 1;


Query OK, 0 rows affected (0.05 sec) 


Mysql> delimiter | // change the mysql command Terminator to | // Insert 10 million pieces of data into the table during the creation of a storage process. mysql> create procedure p_test (pa int (11 )) -> begin-> declare max_num int (11) default 100000;-> declare I int default 0;-> declare rand_num int;-> select count (id) into max_num from test_test;-> while I <pa do-> if max_num <100000 then-> select cast (rand () * 100 as unsigned) into rand_num; -> insert into test_test (num) values (rand_num);-> end if;-> set I = I + 1;-> end while;-> end |

Query OK, 0 rows affected (0.00 sec) 


mysql> call p_test(100000)|| 

Query OK, 1 row affected (5.66 sec) 


Mysql> delimiter; // change the mysql command Terminator to; mysql> select count (id) from test_test; // data is included

+-----------+ | count(id) | +-----------+ |  100000 | +-----------+ 1 row in set (0.00 sec) 


Mysql> show variables like "% pro %"; // check whether the profiling is enabled and disabled by default.

+---------------------------+-------+ | Variable_name       | Value | +---------------------------+-------+ | profiling         | OFF  | | profiling_history_size  | 15  | | protocol_version     | 10  | | slave_compressed_protocol | OFF  | +---------------------------+-------+ 4 rows in set (0.00 sec) 


Mysql> set profiling = 1; // enabled

Query OK, 0 rows affected (0.00 sec) 

2. test

// Four groups of mysql> select distinct (num) from test_test; mysql> select num from test_test group by num; mysql> show profiles; // view the results

+ ---------- + ------------ + ----------------------------------------- + | Query_ID | Duration | Query | + ---------- + ------------ + Duration + | 1 | 0.07298225 | select distinct (num) from test_test | 2 | 0.07319975 | select num from test_test group by num | 3 | 0.07313525 | select num from test_test group by num | 4 | 0.07317725 | select distinct (num) from test_test | 5 | 0.07275200 | select distinct (num) from test_test | 6 | 0.07298600 | select num from test_test group by num | 7 | 0.07500700 | select num from test_test group by num | 8 | 0.07331325 | select distinct (num) from test_test | 9 | 0.57831575 | create index num_index on test_test (num) | // I added an index here | 10 | 0.00243550 | select distinct (num) from test_test | 11 | 0.00121975 | select num from test_test group by num | 12 | 0.00116550 | select distinct (num) from test_test | 13 | 0.00107650 | select num from test_test group by num | + ---------- + ------------ + ----------------------------------------- + 13 rows in set (0.00 sec)

The above 1-8 is four groups of data without an index. we can see that distinct is a little better than group.
10-13 is two groups of data that are indexed. we can see that group by is a little better than distinct.
In general, associated fields in tables with a large data volume are indexed, and the retrieval time after the index is added is only about 1/6 of the previous time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.