Distinct statements in Mysql query for duplicate records and related performance discussions

Distinct statements in Mysql query for duplicate records and related performance discussions _mysql

Last Update:2017-01-19 Source: Internet

Author: User

Tags mysql query mysql version ord

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In a MySQL query, duplicate values may be included. This is not a problem, but sometimes you may want to list only the different values (distinct).

Keyword DISTINCT is used to return only a different value, that is, to go heavy. The usage is also very simple:

SELECT DISTINCT * FROM TableName

DISTINCT this keyword to filter out the extra duplicates and keep only one.

In addition, if you want to weight a field, you can try:

SELECT *, COUNT (DISTINCT nowamagic) from table GROUP by nowamagic

In this usage, the MySQL version cannot be too low.

Before writing a query, we should even sort the filter conditions, and the really efficient conditions (which may have multiple, involved in the same table) are the main driving force of the query, and the inefficient condition only plays an auxiliary role. So what are the criteria for defining efficient filtration conditions? First, it depends on whether the filtration conditions can reduce the amount of data that must be processed as quickly as possible. Therefore, we must pay more attention to the way the conditions are written.
Suppose there are four tables: Customers, Orders, OrderDetail, articles, now assume that the problem with SQL is to find all the customers who have lived in the city of Gotham in the last six months and ordered a bat car. Of course, there are several ways to write this query, and the proponents of ANSI SQL may write the following statements:

Select distinct c.custname from
customers c
join orders o on
o.custid = C.custid
join OrderDetail od
On od.ordid = O.ordid
Join articles A on
A.artid = Od.artid
where c.city = ' GOTHAM ' and
a.artname = ' Batmo Bile ' and
o.ordered >= SomeFunc

Where SomeFunc is a function that returns the exact date six months ago. Note that the distinct is used because, given that a customer can be a big buyer, several bat cars have been ordered recently.
Without considering how the optimizer will overwrite this query, let's take a look at the meaning of this code first. First, the data from the Customers table should retain only the records of the city named Gotham. Next, search the Orders table, which means that the CustID field is best indexed, otherwise the query speed can be guaranteed only by creating a hash table by sorting, merging, or scanning orders tables. For the Orders table, also filter for order dates: If the optimizer is smarter, it will filter out some data before joining (join), thereby reducing the amount of data to be processed later; a less intelligent optimizer may make a connection and then filter it, in which case the filter condition is specified in the connection to improve performance. For example:

Join orders o on
o.custid = C.custid and
a.ordered >= SomeFunc

Note that if you are:

Left OUTER join orders o on
o.custid = C.custid and
a.ordered >= SomeFunc

The filter condition for the left table is invalidated here because it is a left outer join and all the columns in the table will appear in this connection result set. The optimizer is affected by filtering conditions even if the filter condition is independent of the connection (join). For example, if the primary key for OrderDetail is (Ordid, artid), that is, Ordid is the first attribute of the index, then we can use the index to find the records related to the order. But if the primary key is (ArtID, ordid) is too unfortunate (note that, in relation to theory, regardless of which version is exactly the same), at this time the access efficiency ratio (Ordid, artid) as the index is poor, even some database products can not use the index (note 3), The only hope is to add an independent index to the Ordid.
After the table OrderDetail and orders are connected, it is not problematic to look at the articles table, because the table order includes the ArtID field. Finally, check to see if the value in the articles is Batmobile. This is the end of the query, because with distinct, the customer names filtered through layers must also be sorted to eliminate duplicate items. The
avoid using distinct at the top level should be a basic rule. The reason is that even if we omit a condition of the connection, distinct makes the query "seemingly correct"-admittedly, it is easy to find duplicate data, it is difficult to find data inaccurate, so it should be a basic rule to avoid using distinct at the highest level.
Finding incorrect results is harder, for example, if multiple customers happen to be called "Wayne", distinct will not only remove duplicates from multiple orders from the same customer, but also remove duplicate items from different customers with the same name. In fact, a unique customer ID and customer name should be returned to ensure a complete list of the buyers of the bat vehicle.
to get rid of distinct, consider the following ideas: Customers in the Gohtam city, and meet the existence of the test, that is, in the last six months ordered a bat car. Note that most (but not all) SQL dialects support the following syntax:

Select C.custname
from customers c
where c.city = ' GOTHAM '
and exists (select null
from Orders O,
Ord Erdetail OD,
articles a
where a.artname = ' Batmobile ' and
A.artid = Od.artid and
od.ordid =
O.ordid and O.custid = C.custid and
o.ordered >= somefunc)

The existence test of the previous example, the same name may appear several times, but each customer only appears once, no matter how many orders he has. Some people think I'm a bit harsh on ANSI SQL syntax (refer to the example of a "bat car buyer"), because the Customers table in the code above does not have a lower status. The key difference, in fact, is that the Customers table in the new query is the only source of the query's results (nested subqueries are responsible for finding a subset of the customer), while the previous query uses a join.
This nested subquery is closely related to the outer Select. As shown in line 11th of the Code (in bold), the subquery references the current record of the outer query, so the inner subquery is the so-called associated subquery (correlated subquery).
Such subqueries have a weakness that cannot be performed before the current customer is determined. If the optimizer does not overwrite this query, it must first identify each customer, and then check to see if the existence test is met, and when the customer from Gotham is very young to perform efficiently, otherwise the situation will be bad (at this point, the good optimizer should try the other way to execute the query).

Select CustName from
customers
where city = ' GOTHAM ' and
CustID in
(select O.custid from
orders o,< C6/>orderdetail OD,
articles a
where a.artname = ' Batmobile ' and
A.artid = Od.artid and
od.ordid = O.ord ID and
o.ordered >= somefunc)

In this example, the inner query no longer relies on the outer query, and it has become a uncorrelated subquery (subquery), which needs to be executed only once. Obviously, this code uses the original execution process. In the previous example of this section, you must first search for customers that meet the location criteria (such as all from GOTHAM), and then check each order sequentially. Now, the customer who ordered the bat car can be obtained through the inner-layer query.
However, if you look more closely, there are some more subtle differences between the two versions of the code. In the code that contains the associated subquery, it is important that the CustID field in the Orders table be indexed, which is not important for another piece of code, because the index (if any) to be used is the primary key index of the table customers.
You may have noticed that the new version of the query implemented an implicit distinct. Indeed, because of the connection operation, subqueries may return multiple records about one customer. However, duplicate items do not have an impact because the in condition only checks that the item appears in the list returned by the subquery, and in does not care if a value appears in the list once or 100 times. But for consistency, as a whole, you should apply the same rules to both subqueries and the main query, which is to include the presence test in the subquery:

Select CustName from
customers
where city = ' GOTHAM ' and
CustID in
(select O.custid from
orders O
   where o.ordered >= SomeFunc
and exists (select null
from OrderDetail OD,
articles a
where A.artname = ' Batmobile ' and
A.artid = Od.artid and
od.ordid = O.ordid)

Select CustName from
customers
where city = ' GOTHAM ' and
CustID in
(select CustID from
orders< C20/>where ordered >= SomeFunc
and Ordid in (select Od.ordid
from OrderDetail od,
articles a
where a.artname = ' Batmobile ' and
A.artid = Od.artid)

Although nesting becomes deeper and more difficult to understand, the subquery should choose the same selection rule as exists or in: This choice depends on the validity of the date and the commodity condition. Unless the business is very light in the last six months, the name of the product should be the most effective filtration condition, so the subquery is better than exists, because it is quicker to find out all the orders for the bat and then check if the sales occurred in the last six months. If the table OrderDetail artid field is indexed, this method will be faster, otherwise, this clever action will be eclipsed.
Whenever a large number of records are checked for existence, choose in or exists to be considered.
For most SQL dialects, unrelated subqueries can be rewritten as inline views in the FROM clause. However, it is important to remember that in implicitly excluding duplicate items, you must explicitly eliminate duplicates when the subquery is rewritten as an inline view in the FROM clause. For example:

Select CustName from
customers
where city = ' GOTHAM ' and
CustID in
(select O.custid from
orders o,< c6/> (SELECT distinct Od.ordid
from OrderDetail od,
articles a
where a.artname = ' Batmobile '
and A.artid = od.artid) x
where o.ordered >= somefunc and
x.ordid = O.ordid)

Summary: Ensuring that SQL statements return the correct results is just the first step in establishing the best SQL statement.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More