MySQL distinct statement to query duplicate records and related performance discussion _ MySQL

Source: Internet
Author: User
This article mainly introduces the distinct statement in MySQL to query duplicate records and related performance discussions. the point in this article is to avoid using distinct in top-level queries under certain circumstances, for more information, see MySQL Query. duplicate values may be included. This is not a problem. However, sometimes you may want to list only different values.

The key word DISTINCT is used to return a unique value, that is, deduplication. The usage is also simple:

SELECT DISTINCT * FROM tableName

The DISTINCT keyword filters out redundant duplicate records and only keeps one record.

In addition, if you want to deduplicate a field, try:

SELECT *, COUNT(DISTINCT nowamagic) FROM table GROUP BY nowamagic

In this usage, the MySQL version cannot be too low.

Before writing a query, we should even sort the filtering conditions. The True and efficient conditions (there may be multiple, involving the same table) are the main driving force of the query, inefficient conditions only play a secondary role. What are the criteria for defining efficient filtering conditions? First, it depends on whether the filtering conditions can reduce the amount of data that must be processed as soon as possible. Therefore, we must pay attention to the write method of the condition.
Assume that there are four tables: MERS, orders, orderdetail, and articles. now let's assume that the SQL statement is to find out all the customers who have lived in Gotham and purchased the bat in the last six months. Of course, there are many ways to write this query. the advocates of ansi SQL may write the following statements:

select distinct c.custnamefrom customers cjoin orders oon o.custid = c.custidjoin orderdetail odon od.ordid = o.ordidjoin articles aon a.artid = od.artidwhere c.city = 'GOTHAM'and a.artname = 'BATMOBILE'and o.ordered >= somefunc

Somefunc is a function that returns the specific date six months ago. Note that the preceding distinct is used, because considering that a customer can be a large buyer, he has recently ordered several bat vehicles.
The optimizer will not consider how to rewrite this query. let's take a look at the meaning of this code. First, only the city name Gotham records should be retained for data from the MERS table. Then, search for the orders table, which means that the custid field is better indexed. Otherwise, you can create a hash table by sorting, merging, or scanning the orders table to ensure query speed. For the orders table, filter the Order Date. if The Optimizer is smart, it filters out some data before join to reduce the data volume to be processed later; the less intelligent optimizer may connect first and then filter. specifying filter conditions in the connection will improve performance. for example:

join orders oon o.custid = c.custidand a.ordered >= somefunc

Note:

left outer join orders o ono.custid = c.custidand a.ordered >= somefunc

The filtering condition of the left table is invalid because it is a left outer join, and all columns in the left table will appear in the join result set ).
Even if the filter condition is irrelevant to join, the optimizer is affected by the filter condition. For example, if the primary key of orderdetail is (ordid, artid), that is, ordid is the first attribute of the index, we can use the index to find the order-related records. However, if the primary key is (artid, ordid), it will be too unfortunate (note, in terms of relational theory, no matter which version is the same). at this time, the access Efficiency ratio (ordid, artid) as an index, it must be poor. even some database products cannot use this index (note 3). The only hope is to add an independent index to ordid.
After the orderdetail and orders tables are connected, check the articles Table. this is not a problem because the order table contains the artid field. Finally, check whether the value in articles is Batmobile. This is the end of the query. because distinct is used, the customer names filtered by layers must also be sorted to remove duplicate items.
It should be a basic rule to avoid using distinct at the highest level. The reason is that even if we omit a certain join condition, distinct will make the query run "seemingly correct"-undeniable, it is easy to find duplicate data, and it is difficult to find data inaccurate, therefore, avoiding using distinct at the highest level should be a basic rule.
It is more difficult to find that the results are incorrect. for example, if multiple customers are named "Wayne", distinct will not only remove duplicate projects generated by multiple orders of the same customer, duplicate projects generated by different customers with the same name are also excluded. In fact, a unique customer ID and customer name should be returned at the same time to ensure a complete list of bat car buyers.
To get rid of distinct, consider the following idea: the customer purchased a bat car in Gohtam and satisfied the existence test. Note that most (but not all) SQL dialects support the following syntax:

select c.custnamefrom customers cwhere c.city = 'GOTHAM'and exists (select nullfrom orders o,orderdetail od,articles awhere a.artname = 'BATMOBILE'and a.artid = od.artidand od.ordid = o.ordidand o.custid = c.custidand o.ordered >= somefunc )

In the previous example, the same name may appear multiple times, but each customer only appears once, no matter how many orders he has. Some people think that I'm a little harsh on the ansi SQL syntax (referring to the example of "Bat Car Buyer"), because the customers table in the code above has not declined. In fact, the key difference is that the customers table in the new query is the only source of the query results (nested subqueries are responsible for finding the customer subset), while the previous query uses join.
This nested subquery is closely related to the select statement of the outer layer. As shown in row 11th of the code (in bold), subqueries refer to the current record of the outer query. Therefore, the intra-layer subquery is the so-called correlated subquery ).
This type of subquery has a weakness and cannot be executed before the current customer is determined. If the optimizer does not rewrite this query, you must first identify each customer and then check whether the existence test is satisfied one by one. when there are very few customers from Gotham, the execution efficiency is very high, otherwise, the situation will be very bad (in this case, the excellent Optimizer should try other ways to execute the query ).

select custnamefrom customerswhere city = 'GOTHAM'and custid in(select o.custidfrom orders o,orderdetail od,articles awhere a.artname = 'BATMOBILE'and a.artid = od.artidand od.ordid = o.ordidand o.ordered >= somefunc)

In this example, the inner-layer query no longer depends on the outer-layer query. it has been changed to an uncorrelated subquery, which only needs to be executed once. Obviously, this code uses the original execution process. In the previous example in this section, you must first search for customers that meet the location criteria (for example, all from GOTHAM) and then check each order in sequence. Now, customers who have purchased a bat car can obtain it through the inner layer query.
However, if you analyze the code more carefully, there are some more subtle differences between the two versions. In the code that contains correlated subqueries, it is vital that the custid field in the orders table must have an index, which is not important to another piece of code, this is because the index (if any) to be used is the primary key index of the customers table.
You may have noticed that the new query version executes the implicit distinct. Indeed, due to the connection operation, the subquery may return multiple records related to a customer. Duplicate projects do not affect the project because the in condition only checks whether the project appears in the list returned by the subquery, and the in condition does not matter whether a value appears once or one hundred times in the list. To ensure consistency, the same rule should be applied to subqueries and primary queries, that is, the existence test should also be added to subqueries:

select custnamefrom customerswhere city = 'GOTHAM'and custid in(select o.custidfrom orders owhere o.ordered >= somefuncand exists (select nullfrom orderdetail od,articles awhere a.artname = 'BATMOBILE'and a.artid = od.artidand od.ordid = o.ordid))

Or

select custnamefrom customerswhere city = 'GOTHAM'and custid in(select custidfrom orderswhere ordered >= somefuncand ordid in (select od.ordidfrom orderdetail od,articles awhere a.artname = 'BATMOBILE'and a.artid = od.artid)

Although nesting becomes deeper and more difficult to understand, the selection rules of exists or in should be the same in the subquery: the selection depends on the validity of the date and commodity conditions. Unless the business in the past six months is very light, the product name should be the most effective filter condition, so in is better than exists in subqueries. this is because, it is faster than the reverse operation to first find orders for all the bat vehicles and then check whether the sales have occurred in the last six months. If the artid field of the orderdetail table has an index, this method will be faster. Otherwise, this clever move will be eclipsed.
When performing a existence check on a large number of records, you must consider whether to select in or exists.
For most SQL dialects, non-correlated subqueries can be rewritten to the embedded view in the from clause. However, it is important to remember that in will implicitly remove duplicate projects. when the subquery is rewritten to the embedded view in the from clause, it is necessary to explicitly remove duplicate projects. For example:

select custnamefrom customerswhere city = 'GOTHAM'and custid in(select o.custidfrom orders o,(select distinct od.ordidfrom orderdetail od,articles awhere a.artname = 'BATMOBILE'and a.artid = od.artid) xwhere o.ordered >= somefuncand x.ordid = o.ordid)

Conclusion: ensuring that the SQL statement returns the correct results is only the first step to establish the optimal SQL statement.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.