Difference and usage of distinct and row_number () over () in SQL, distinctrow_number

Source: Internet
Author: User

Difference and usage of distinct and row_number () over () in SQL, distinctrow_number

1 Preface

When writing SQL statements to operate data in the database, we may encounter some unpleasant problems. For example, for records with the same field name, we only need to display one record, but in fact, the database may contain multiple records with the same name, so that multiple records are displayed during retrieval. This is against our original intention! Therefore, in order to avoid this situation, we need to perform "deduplication". What is "deduplication? To put it bluntly, only one record is displayed for records with the same content for the same field.

So how can we implement the "de-duplication" function? There are two ways to implement this function.

First, when writing a select statement, add the distinct keyword;

Second, when writing a select statement, call the row_number () over () function.

The above two methods can achieve the "de-duplication" function. What are the similarities and differences between the two? Next, the author will give a detailed description.

2 distinct

In SQL, the keyword distinct is used to return unique values. The syntax format is:

Select distinct column name FROM Table Name

Assume that there is a table "CESHIDEMO", which contains two fields: NAME and AGE. The specific format is as follows:

Observe the above table and we will find that there are two records with the same NAME and three records with the same AGE. If we run the following SQL statement,

/*** PPPRDER is the Schema name, that is, the CESHIDEMO table is in PPPRDER */select name from PPPRDER. CESHIDEMO

The following result is displayed:

Looking at this result, we will find that the above four records contain two records with the same NAME value, that is, the values of 2nd records and 3rd records are "gavin ". How can we achieve this if we want to display only one record with the same NAME? The distinct keyword is required! Next, run the following SQL statement,

select distinct name from PPPRDER.CESHIDEMO

The following result is displayed:

Observe this result. Obviously, our requirements are met! However, we can't help but think about what effect will happen if the distinct keyword is used in both fields at the same time? Now, let's try to run the following SQL statement,

select distinct name, age from PPPRDER.CESHIDEMO

The result is as follows:

Looking at this result, it seems useless? She showed all the records! There are two records with the same NAME value and three records with the same AGE value, which are completely unchanged! But in fact, the result should be like this. BecauseWhen distinct is used in multiple fields, it only deletes records with the same field values.Obviously, the four "poor" records do not meet this condition. Therefore, distinct considers that the preceding four records are different. Leave it empty. Next, let's add a completely identical record to the table "CESHIDEMO" and verify it. The table after adding a record is as follows:

Run the following SQL statement,

select distinct name, age from PPPRDER.CESHIDEMO

The result is as follows:

Observe this result and verify our conclusion perfectly.

In addition, you need to pay special attention to the following points:The distinct keyword can only be placed at the beginning of all fields in the SQL statement. If an error is placed, the SQL statement does not report an error, but does not play any effect.

3 row_number () over ()

In the SQL Server database, we provide a function row_number () for marking the records in the database table. When used, it is followed by a function over (), the function over () is used to group and sort records in a table. The syntax is as follows:

ROW_NUMBER() OVER(PARTITION BY COLUMN1 ORDER BY COLUMN2)

The records in the table are grouped by field COLUMN1 and sorted by field COLUMN2.

Partition by: Indicates group order by: Indicates sorting

Next, we will use the data in the table "CESHIDEMO" for testing. First, the query result is displayed if the row_number () over () function is not used, as shown below:

Then, run the following SQL statement,

select PPPRDER.CESHIDEMO.*, row_number() over(partition by age order by name desc) from PPPRDER.CESHIDEMO

The result is as follows:

From the above results, we can see that a column marked with numbers is added to the original table. In turn, we analyze the SQL statements we run and find that they are indeed grouped by the value of the field AGE, and sorted by the value of the field NAME! Therefore, the function is verified.

Next, we will study how to use the row_number () over () function to implement the "deduplication" function. By observing the above results, we can find that, if we group by NAME, sort by AGE, and then retrieve the first record of each group, we may be able to achieve the "deduplication" function! Run the following SQL statement,

/** Here, rn indicates the last column added */select * from (select PPPRDER. CESHIDEMO. *, row_number () over (partition by name order by age desc) rn from PPPRDER. CESHIDEMO) where rn = 1

After running, the result is as follows:

After observing the above results, we found that the data deduplication function was accidentally implemented! But unfortunately, if we are careful, we will find a very uncomfortable thing, that is, when we execute the preceding SQL statement for "deduplication, A record with a NAME value of "gavin" and an AGE value of "18" is filtered out. However, in real life, it is too normal for people of different ages with the same NAME.

4. Summary

By reading and practicing the above content, we already know that data deduplication can be implemented by using the distinct keyword or the row_number () over () function. However, in the process of implementation and use, we should pay special attention to the usage characteristics and differences between the two.

When using the distinct keyword, we need to know that it is different when applied to a single field and multiple fields. when applied to a single field, its "de-duplication" refers to all the data with duplicate field values in the table. When multiple fields are used, all fields in the table whose "deduplication" is used (that is, multiple fields for distinct) have the same value.

When the row_number () over () function is used, it is sorted by grouping first, then retrieve the first record of each group for "deduplication" (in this article ). Of course, here we can also "de-duplicate" through different restrictions. We need to think about how to implement it ourselves!

Finally, in this blog post, I have elaborated on my understanding of "de-duplication" of data using the distinct keyword and row_number () over () function, I hope the above content will help you!

Thank you for reading this article. I hope it will help you. Thank you for your support for this site!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.