Illustration of how to compare data of each group of two tables in SQL Server

Source: Internet
Author: User

Start

A while ago, a project encountered such an SQL query requirement. There were two tables with the same structure (table_left & table_right), as shown below:

Figure 1.

Check whether there is a group of (groupId) data in table table_right that is exactly the same as its data.

1. We can see that the table_left and table_right tables have two groups of data that are completely equal:

Figure 2.

Analysis

From the above two tables, we can know that they store a group of data. Then, I will use the column lift method and operation of the mathematical set for analysis.

First, the data of each group of the two tables is described by using the column method of the Set:

Figure 3.

There are only two cases: equal and unequal. For not equal, it can be divided into parts equal, inclusive, and completely unequal. Use the set description. intersection, subset, and union can be used. As shown in figure 4 below, I will list these common situations:

Figure 4.

Implementation

In the database, there are many methods to locate the groups in which the table table_left and table table_right have the same data. Here I will list two common methods.

(The following SQL script is based on the data shown in Figure 4)

Method 1:

Use "Select... From... Order... Xml for path ('')" sorts the data columns of each group (for example, figure 4. the table_left group #11 column data into a "data1-data2-data3"), other groups (including the table table_right) in this way to achieve data column data together; then, after comparing the two tables, check whether the fields are equal. If they are equal, the two groups of data are equal, it can be used to determine which data group of the table table_left exists in the table table_right, which is exactly the same as the data group.

For method 1, you need to add a field dataPath to the original table to store the results of the data column data partition, such:

Copy codeThe Code is as follows: alter table table_left add dataPath nvarchar (200)
Alter table table_right add dataPath nvarchar (200)

Group the external data column data and update it to the newly added column dataPath, for example:

Copy codeThe Code is as follows: update
Set dataPath = B. dataPath
From table_left
Cross apply (select '-' + x. data from table_left x where x. groupId =. groupId order by x. data for xml path ('') as dataPath) B

Update
Set dataPath = B. dataPath
From table_right
Cross apply (select '-' + x. data from table_right x where x. groupId =. groupId order by x. data for xml path ('') as dataPath) B

The following is the query, for example:

Copy codeThe Code is as follows: select distinct a. groupId
From table_left
Where exists (select 1 from table_right x where x. dataPath = a. dataPath)

Complete code:

Copy codeThe Code is as follows: View Code
Use tempdb
Go
If object_id ('table _ left') is not null drop table table_left
If object_id ('table _ right') is not null drop table table_right
Go
Create table table_left (groupId nvarchar (5), data nvarchar (10 ))
Create table table_right (groupId nvarchar (5), data nvarchar (10 ))
Go
Alter table table_left add dataPath nvarchar (200)
Alter table table_right add dataPath nvarchar (200)
Go
Create nonclustered index ix_left on table_left (dataPath)
Create nonclustered index ix_right on table_right (dataPath)
Go
Set nocount on
Go
Insert into table_right (groupId, data)
Select '# 1', 'data1' union all
Select '# 1', 'data2' union all
Select '# 1', 'date3' union all
Select '# 2', 'data55' union all
Select '# 2', 'data55' union all
Select '# 3', 'data91' union all
Select '# 3', 'data92' union all
Select '# 4', 'data65' union all
Select '# 4', 'data66' union all
Select '# 4', 'data67' union all
Select '# 4', 'data68' union all
Select '# 4', 'data69' union all
Select '# 5', 'data77' union all
Select '# 5', 'data79'
Insert into table_left (groupId, data)
Select '# 11', 'data1' union all
Select '# 11', 'data2' union all
Select '# 11', 'date3' union all
Select '# 22', 'data55' union all
Select '# 22', 'data57' union all
Select '# 33', 'data99' union all
Select '# 33', 'data99' union all
Select '# 44', 'data66' union all
Select '# 44', 'data68' union all
Select '# 55', 'data77' union all
Select '# 55', 'data78' union all
Select '# 55', 'data79'
Go
Update
Set dataPath = B. dataPath
From table_left
Cross apply (select '-' + x. data from table_left x where x. groupId =. groupId order by x. data for xml path ('') as dataPath) B
Update
Set dataPath = B. dataPath
From table_right
Cross apply (select '-' + x. data from table_right x where x. groupId =. groupId order by x. data for xml path ('') as dataPath) B
--
Select distinct a. groupId
From table_left
Where exists (select 1 from table_right x where x. dataPath = a. dataPath)

Method 2:

The set operator "distinct T" provided by SQL Sever is used to determine the two groups of non-duplicate data. If neither of the two groups has non-duplicate data for each other, the two groups are completely equal. For example, if the group #11 in Table table_left and the group #1 in Table table_right are used to calculate the "T" set of column data, none of them are (# 11à #1) if you perform the distinct t set operation or (#1 à #11) perform the distinct t set operation, all null results are returned. This means that the data in group #1 and #11 are completely equal, for example:

Copy codeThe Code is as follows: select data from table_left where groupId = '# 11' doesn't select data from table_right where groupId =' #1'
Select data from table_right where groupId = '# 1' doesn't select data from table_left where groupId =' #11'

Similarly, we perform the "distinct T" set operation on column data for groups #11 in table_left and #2 in table_right, for example:

Copy codeThe Code is as follows: select data from table_left where groupId = '# 11' doesn't select data from table_right where groupId =' #2'
Select data from table_right where groupId = '# 2' doesn't select data from table_left where groupId =' #11'

As long as there is a record in the calculation results of the (# 11à #2) or (# 2à #11) "distinct T" set, the data in the two groups is not equal.

All groups in the two tables are compared. We need to use the following SQL script, for example:

Copy codeThe Code is as follows: select distinct a. groupId
From table_left
Inner join table_right B on B. data = a. data
Where not exists (select x. data from table_left x where x. groupId = a. groupId before t select y. data from table_right y where y. groupId = B. groupId)
And not exists (select x. data from table_right x where x. groupId = B. groupId before t select y. data from table_left y where y. groupId = a. groupId)

Complete code:

Copy codeThe Code is as follows: View Code
Use tempdb
Go
If object_id ('table _ left') is not null drop table table_left
If object_id ('table _ right') is not null drop table table_right
Go
Create table table_left (groupId nvarchar (5), data nvarchar (10 ))
Create table table_right (groupId nvarchar (5), data nvarchar (10 ))
Go
Create nonclustered index ix_left on table_left (data)
Create nonclustered index ix_right on table_right (data)
Go
Set nocount on
Go
Insert into table_right (groupId, data)
Select '# 1', 'data1' union all
Select '# 1', 'data2' union all
Select '# 1', 'date3' union all
Select '# 2', 'data55' union all
Select '# 2', 'data55' union all
Select '# 3', 'data91' union all
Select '# 3', 'data92' union all
Select '# 4', 'data65' union all
Select '# 4', 'data66' union all
Select '# 4', 'data67' union all
Select '# 4', 'data68' union all
Select '# 4', 'data69' union all
Select '# 5', 'data77' union all
Select '# 5', 'data79'
Insert into table_left (groupId, data)
Select '# 11', 'data1' union all
Select '# 11', 'data2' union all
Select '# 11', 'date3' union all
Select '# 22', 'data55' union all
Select '# 22', 'data57' union all
Select '# 33', 'data99' union all
Select '# 33', 'data99' union all
Select '# 44', 'data66' union all
Select '# 44', 'data68' union all
Select '# 55', 'data77' union all
Select '# 55', 'data78' union all
Select '# 55', 'data79'
Go
-- Select
Select distinct a. groupId
From table_left
Inner join table_right B on B. data = a. data
Where not exists (select x. data from table_left x where x. groupId = a. groupId before t select y. data from table_right y where y. groupId = B. groupId)
And not exists (select x. data from table_right x where x. groupId = B. groupId before t select y. data from table_left y where y. groupId = a. groupId)

Method 1 Vs. Method 2:

Both method 1 and method 2 can find that table_left has a group with exactly the same data in table_right #11. However, from the performance perspective, method 2 is slightly better than method 1. You can view the statistics of their execution process:

Method 1:

Figure 5.

Method 2:

Figure 6.

If the data volume is large, method 2 has obvious advantages over method 1. Because method 1 has two more dataPath updates and the data volume increases, the location update consumes a lot of resources. If the data size of the dataPath column exceeds 900 bytes, the index cannot be created in dataPath, which affects the performance of the Select query.

Extension

Here, the extension is mainly for the above method 2. When the data size of a column exceeds 900 bytes, or multiple data columns need to be compared, check whether the corresponding columns of the two groups (groupId) are one-to-one equal.

Figure 7.

In this case, you can create a hash index for the dataSub1 & dataSub2 fields, such:

Copy codeThe Code is as follows: alter table table_left add dataChecksum as checksum (dataSub1, dataSub2)
Alter table table_right add dataChecksum as checksum (dataSub1, dataSub2)
Go
Create nonclustered index ix_table_left_cs on table_right (dataChecksum)
Create nonclustered index table_right_cs on table_right (dataChecksum)

The following select query statement can be slightly changed in the Inner Join Section, for example:

Copy codeThe Code is as follows: select distinct a. groupId
From table_left
Inner join table_right B on B. dataChecksum = a. dataChecksum
And B. dataSub1 = a. dataSub1
And B. dataSub2 = a. dataSub2
Where not exists (select x. dataSub1, x. dataSub2 from table_left x where x. groupId =. groupId limit t select y. dataSub1, y. dataSub2 from table_right y where y. groupId = B. groupId)
And not exists (select x. dataSub1, x. dataSub2 from table_right x where x. groupId = B. groupId limit t select y. dataSub1, y. dataSub2 from table_left y where y. groupId =. groupId)

Complete code:

Copy codeThe Code is as follows: View Code
Use tempdb
Go
If object_id ('table _ left') is not null drop table table_left
If object_id ('table _ right') is not null drop table table_right
Go
Create table table_left (groupId nvarchar (5), dataSub1 nvarchar (10), dataSub2 nvarchar (10 ))
Create table table_right (groupId nvarchar (5), dataSub1 nvarchar (10), dataSub2 nvarchar (10 ))
Go
Alter table table_left add dataChecksum as checksum (dataSub1, dataSub2)
Alter table table_right add dataChecksum as checksum (dataSub1, dataSub2)
Go
Create nonclustered index ix_table_left_cs on table_left (dataChecksum)
Create nonclustered index table_right_cs on table_right (dataChecksum)
Go
Set nocount on
Go
Insert into table_right (groupId, dataSub1, dataSub2)
Select '# 1', 'data1', 'data7' union all
Select '# 1', 'data2', 'data8' union all
Select '# 1', 'data3', 'data9' union all
Select '# 2', 'data55', 'data4' union all
Select '# 2', 'data55', 'data5'
Insert into table_left (groupId, dataSub1, dataSub2)
Select '# 11', 'data1', 'data7' union all
Select '# 11', 'data2', 'data8' union all
Select '# 11', 'data3', 'data9' union all
Select '# 22', 'data55', 'data0' union all
Select '# 22', 'data57', 'data2' union all
Select '# 33', 'data99', 'data4' union all
Select '# 33', 'data99', 'data6'
Go
-- Select
Select distinct a. groupId
From table_left
Inner join table_right B on B. dataChecksum = a. dataChecksum
And B. dataSub1 = a. dataSub1
And B. dataSub2 = a. dataSub2
Where not exists (select x. dataSub1, x. dataSub2 from table_left x where x. groupId =. groupId limit t select y. dataSub1, y. dataSub2 from table_right y where y. groupId = B. groupId)
And not exists (select x. dataSub1, x. dataSub2 from table_right x where x. groupId = B. groupId limit t select y. dataSub1, y. dataSub2 from table_left y where y. groupId =. groupId)

Summary

There may be other or better solutions to this problem. in addition, the actual production environment may encounter different situations. In any case, you need to analyze and experiment more to find the optimal solution.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.