08. Delete duplicate and massive data

Source: Internet
Author: User
Original article: 08. Delete duplicate and massive data

There are two types of repeated data: one is a completely repeated record, that is, the values of all fields are the same; the other is a record with some field values repeated.

1. delete completely Repeated Records
Completely duplicated data is usually caused by the absence of primary key/unique key constraints.
Test data:

if OBJECT_ID(‘duplicate_all‘) is not nulldrop table duplicate_allGOcreate table duplicate_all(c1 int,c2 int,c3 varchar(100))GOinsert into duplicate_allselect 1,100,‘aaa‘ union allselect 1,100,‘aaa‘ union allselect 1,100,‘aaa‘ union allselect 1,100,‘aaa‘ union allselect 1,100,‘aaa‘ union allselect 2,200,‘bbb‘ union allselect 3,300,‘ccc‘ union allselect 4,400,‘ddd‘ union allselect 5,500,‘eee‘GO

 

(1) use temporary tables

You can use distinct to obtain a single record, delete the source data, and export back records that do not repeat.

if OBJECT_ID(‘tempdb..#tmp‘) is not nulldrop table #tmpGOselect distinct * into #tmpfrom duplicate_allwhere c1 = 1GOdelete duplicate_all where c1 = 1GOinsert into duplicate_all select * from #tmp

If the table is not large, you can export all records once, and then truncate the table, and then use distinct to export back, so as to avoid Delete log operations.

 

(2) Use row_number

with tmpas(select *,ROW_NUMBER() OVER(PARTITION BY c1,c2,c3 ORDER BY(getdate())) as numfrom duplicate_allwhere c1 = 1)delete tmp where num > 1

If multiple tables have completely repeated rows, you can use Union to join multiple tables to a new table with the same structure, SQL Server helps remove duplicate rows between tables.

 

Ii. delete some duplicate records
Duplicate data of some columns. Generally, the table has primary key/unique key constraints, which may be caused by repeated data of some non-primary key/unique key column values by program logic.
Test data:

if OBJECT_ID(‘duplicate_col‘) is not nulldrop table duplicate_colGOcreate table duplicate_col(c1 int primary key,c2 int,c3 varchar(100))GOinsert into duplicate_colselect 1,100,‘aaa‘ union allselect 2,100,‘aaa‘ union allselect 3,100,‘aaa‘ union allselect 4,100,‘aaa‘ union allselect 5,500,‘eee‘GO

 

(1) unique index

The unique index has an option to ignore duplicate creation. You can use this option when creating a primary key constraint or a unique key constraint.

if OBJECT_ID(‘tmp‘) is not nulldrop table tmpGOcreate table tmp(c1 int,c2 int,c3 varchar(100),constraint UQ_01 unique(c2,c3) with(IGNORE_DUP_KEY = ON))GOinsert into tmp select * from duplicate_colselect * from tmp

 

(2) Use primary key/unique key to delete
Generally, the Maximum/minimum value of the primary key/unique key is retained, and other rows are deleted. Only the rows with the smallest C1 record are retained as follows:

Delete from duplicate_col where exists (select 1 from duplicate_col B where duplicate_col.c1> B. c1 and (duplicate_col.c2 = B. c2 and duplicate_col.c3 = B. c3) -- or delete from duplicate_col where c1 not in (select Min (C1) from duplicate_col group by C2, C3)

If you want to retain the nth row in the record, you can refer to 05. obtain certain rows of data in the SQL group.

 

(3) row_number
The statement is basically the same as that for deleting a completely repeated record.

with tmpas(select *,ROW_NUMBER() OVER(PARTITION BY c2,c3 ORDER BY(getdate())) as numfrom duplicate_col)delete tmp where num > 1select * from duplicate_col

 

3. Delete massive data
When deleting massive data, if you want to improve performance, you need to consider an important factor: How to Reduce log operations?

1. delete a full table
There are usually three ways to delete a full table: Drop, truncate, delete
(1) Drop/truncate
Drop and truncate are DDL operations, and the log volume is very small (only records on the data page are recycled, and details of each data page are not recorded). All data pages are released, and reset the flag spaces in IAM, PFS, GAM, and SGAM. The released data page can be used by other tables;

The difference is that drop also deletes the table definitions in the system directory. Accordingly, all the objects defined in the Table: Index, constraint, trigger, and so on will also be deleted, the IAM, PFS, GAM, and SGAM pages related to the table will also be released (not just reset the flag );

(2) Delete
Conditional Delete can be used to delete all table data. All deleted rows will be logged, and the efficiency of table-wide deletion is poor, which is not recommended;

 

2. Partially Delete
Delete part of the data in the table. If it is a partitioned table, it is best to directly truncate the partition. Even if you delete part of the data in the partition, the efficiency will not be too bad;

If it is not a partition table:

(1) Delete a small amount of data in the table
Delete directly using Delete;

(2) Delete most data in the table
Export a few records to be retained to the temporary table, truncate the original table, and export the temporary table data back;
Example:

SELECT * INTO tmp FROM TAB_NAME WHERE DATE_COL > = GETDATE()-1TRUNCATE TABLE TAB_NAMEINSERT INTO TAB_NAME SELECT * FROM tmpDROP TABLE tmp

If you do not want to export temporary table data back, you can also directly Delete the original table tab_name and rename TMP as the original table name, but do not forget to create the original table object on TMP, such: indexes, constraints, and triggers.

EXEC sp_rename ‘TAB_NAME‘, ‘TAB_NAME_OLD‘EXEC sp_rename ‘tmp‘, ‘TAB_NAME‘--create index/constraint/trigger...on new TAB_NAMEDROP TABLE TAB_NAME_OLD

 (3) Delete about half of the data in the table

At this time, if there is no partition in the table, it will be very slow and obvious, there is no good way, you can only use Delete to slowly Delete.

In addition, the nologging option in Oracle is similar to the bulk_logged recovery mode in SQL Server and is valid only when batch data operations are performed, such as: select... Into (Create Table as select * from… in Oracle ...), Create/alter index and so on. This option is not always valid.

 Summary:

(1) To delete massive data, try to select a method with a small log volume;
(2) nologging option/bulk_logged recovery mode, which is usually used for batch data import or update when data is deleted.

 

08. Delete duplicate and massive data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.