How to Delete redundant data in a database

How to Delete redundant data in a database _ MySQL

Last Update:2018-04-20 Source: Internet

Author: User

Tags add numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

As database developers, we often face the task of finding and deleting redundant data in the database. if there is a large amount of redundant data in the database (as a percentage of the total ), the data accuracy and reliability will be affected, and the database performance will also be affected. how can this problem be solved? I will discuss the solution to this problem below. as a database developer, we often face the task of finding and deleting redundant data in the database, if the database contains a large amount of redundant data (too much percentage), the accuracy and reliability of the data will be affected, and the database performance will also be affected. how can this problem be solved? Next, I will discuss the solution to this problem. oracle also provides us with a solution, but the solution provided by Oracle is not perfect, and it is very slow to solve the problem of mass data.

How should we delete redundant data?
Here we apply a PL/SQL solution (a custom stored procedure) or an SQL statement solution (using an analytical function RANK () and a nested subquery) to eliminate redundant data and then control the record to be retained

What is redundant data?
Redundant data is a data table. the rows in this table contain some identical values. These values should theoretically be unique (these values can usually determine a record). For example, A collection of social insurance numbers, surnames, and names. we call the data contained in such rows with the same information redundant data. now all database tables have primary key constraints, and the primary key records the unique values in a row of records, from the database perspective, each row is unique, but from the perspective of our users, these records are the same, because they all contain the same key value (First Name + Last Name), even if they have different primary keys
ID Last Name First Name City Phone
------------------------------------------------------
1005 Krieger Jeff San Ramon 9252997100
1012 Krieger Jeff San Ramon 9252997100
1017 Krieger Jeff San Ramon 9252997100

So how does this redundant data appear? There are two common cases: 1. load or merge data from different tables

Input data through graphical user interfaces, and a unique key is generated by the computer as the primary key of this record.

How can we find redundant data? Let's create a table named "Customer" and add redundant data to it. let's see Table 1. as you can see, we have not made any restrictions on this table to prevent redundant data, the following code creates a unique constraint to prevent redundant data generation.
SQL
Listing 1. create a Customer table
We intentionally added redundant data to this table.
Drop table mers cascade constraints;

Create table Customers (
Id integer not null,
LastName VARCHAR2 (15) not null,
FirstName VARCHAR2 (10 ),
Address VARCHAR2 (20 ),
City VARCHAR2 (15 ),
State CHAR (2 ),
Zip VARCHAR2 (10 ),
Phone VARCHAR2 (10 ),
RecDate DATE,
CONSTRAINT Customers_PK
Primary key (ID ))
Tablespace tallydata;

INSERT INTO MERS
VALUES (1000, 'Bradley ', 'Tom', '1970 3rd Str. #12 ',
'San Francisco ', 'CA', '123', '123 ',
TO_DATE ('01-JAN-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1001, 'Stone ', 'Tony', '12 Main St. Apt. 3 ',
'Oakland', 'CA', '123', '123 ',
TO_DATE ('12-MAR-2001 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1002, 'Chang', 'Jim ', '2017 26th Ave .',
'Seattle ', 'wa', '123', '123 ',
TO_DATE ('15-JUN-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1003, 'loene', 'julil', '12 Keith St .',
'Astro Valley ', 'CA', '123', '123 ',
TO_DATE ('22-NOV-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1004, 'King', 'Chuck ', '1970 Broadway St .',
'Pleasant Hill ', 'CA', '123', '123 ',
TO_DATE ('14-APR-2001 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1005, 'krieger ', 'Jeff', '2017 Mercury Rd .',
'San ramon', 'CA', '123', '123 ',
TO_DATE ('02-FEB-2001 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1006, 'Monroe ', 'John', '1970 West Ave .',
'Phoenix ', 'az', '123', '123 ',
TO_DATE ('02-OCT-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1007, 'Lord', 'Don', '573 Hill view ',
'Atlanta ', 'Gal', '123', '123 ',
TO_DATE ('12-DEC-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1008, 'gride', 'David', '10 Fulton Rd. apt.4 ',
'San Francisco ', 'CA', '123', '123 ',
TO_DATE ('15-DEC-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1009, 'Simon ', 'Michael', '573 Hill view ',
'Santa Monica ', 'CA', '123', '123 ',
TO_DATE ('22-MAY-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1010, 'Simon ', 'Michael', '573 Hill view ',
'Santa Monica ', 'CA', '123', '123 ',
TO_DATE ('02-OCT-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1011, 'Stone ', 'Tony', '12 Main St. Apt. 3 ',
'Oakland', 'CA', '123', '123 ',
TO_DATE ('07-DEC-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1012, 'krieger ', 'Jeff', '2017 Mercury Rd .',
'San ramon', 'CA', '123', '123 ',
TO_DATE ('15-JUN-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1013, 'Blue ', 'Don', '2017 Saturn Rd .',
'Pleasancon', 'CA', '20160301', '20160301 ',
TO_DATE ('09-SEP-1999 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1014, 'Stone ', 'Tony', '12 Main St. Apt. 3 ',
'Oakland', 'CA', '123', '123 ',
TO_DATE ('11-APR-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1015, 'Mason ', 'Paul', '53 Orange way ',
'Las Vegas ', 'nv', '123', '123 ',
TO_DATE ('04-JUL-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1016, 'Stone ', 'Tony', '12 Main St. Apt. 3 ',
'Oakland', 'CA', '123', '123 ',
TO_DATE ('30-DEC-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1017, 'krieger ', 'Jeff', '2017 Mercury Rd .',
'San ramon', 'CA', '123', '123 ',
TO_DATE ('11-JAN-2001 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1018, 'bucke', 'Becky ', '1970 Venus St .',
'Salt Lake City', 'ut', '123', '123 ',
TO_DATE ('12-JUN-2001 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1019, 'Stone ', 'Tony', '12 Main St. Apt. 3 ',
'Oakland', 'CA', '123', '123 ',
TO_DATE ('03-NOV-2000 ', 'DD-MON-YYYY '));

INSERT INTO MERS
VALUES (1020, 'Hill ', 'Larry', '1970 century St .',
'Santa Rosa ', 'CA', '123', '123 ',
TO_DATE ('24-AUG-2000 ', 'DD-MON-YYYY '));

COMMIT;
The following code adds a unique constraint on the last name and name fields (of course, you can add this constraint when creating a table to prevent redundant data)
ALTER TABLE MERS
Add constraint Customers_LastFirst
UNIQUE (LastName, FirstName );

The redundant keys in the Customer table are the set of LastName and FirstName. we Group and collect statistics on the data containing the redundant keys.

SELECT LastName, FirstName, COUNT (*) FROM MERS
Group by LastName, FirstName
Order by LastName, FirstName;

Listing 2 shows the output of this statement. we can see that the output of three rows is greater than 1, which means that the table contains three sets of redundant data.
Listing 2. identify redundancy

Lastname firstname count (*)
-----------------------------------
Blake Becky 1
Blue Don 1
Bradley Tom 1
Chang Jim 1
Griffin David 1
Hill Larry 1
King Chuck 1
Krieger Jeff 3
Loney Julie 1
Lord Don 1
Mason Paul 1
Monroe John 1
Simon Michael 2
Stone Tony 5

14 rows selected.

We add the Having () statement to the statement to filter out non-redundant data.
SELECT LastName, FirstName, COUNT (*)
FROM MERS
Group by LastName, FirstName
Having count (*)> 1;

SQL
Listing 3. filtering redundancy
The Having () statement is added to filter out non-redundant data.
Lastname firstname count (*)
-----------------------------------
Krieger Jeff 3
Simon Michael 2
Stone Tony 5

3 rows selected.
Listing 3 displays the input of the above code. However, these query results do not show fields that can identify each row, we use the previous statement as a nested query to display the IDs of these records.
Select id, LastName, FirstName
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 1)
Order by LastName, FirstName;

Listing 4 shows the results of the above code. these queries show three groups of redundancy and there are 10 rows in total. we should keep the records 1001, in these groups and then delete the records, 1014, redundant entries.
SQL
Listing 4. find the unique key
Statement output
ID LASTNAME FIRSTNAME
------------------------------
1005 Krieger Jeff
1012 Krieger Jeff
1017 Krieger Jeff
1009 Simon Michael
1010 Simon Michael
Stone Tony 1001
Stone Tony 1011
Stone Tony 1016
Stone Tony 1019
Stone Tony 1014

10 rows selected.

A solution provided by Oracle
Oracle provides us with a solution to delete redundant data. this solution uses a collection function MIN () or MAX () of Oracle to solve this problem. MIN () the function can obtain all the values that should be retained in each group (redundant non-redundant. (as we can see, input redundant values that do not contain those large IDs

Select min (ID) as id, LastName, FirstName
FROM MERS
Group by LastName, FirstName;
Output of this command
Listing 5. Output of MIN () query
This command shows all non-redundant data, and other lines should be deleted.
ID LASTNAME FIRSTNAME
------------------------------
1018 Blake Becky
1013 Blue Don
1000 Bradley Tom
1002 Chang Jim
1008 Griffin David
1020 Hill Larry
1004 King Chuck
1005 Krieger Jeff
1003 Loney Julie
1007 Lord Don
1015 Mason Paul
1006 Monroe John
1009 Simon Michael
Stone Tony 1001

14 rows selected.

In this way, you can delete all rows not in this table and use the previous statement as a subquery to construct a statement.

DELETE FROM MERS
WHERE ID NOT IN
(Select min (ID)
FROM MERS
Group by LastName, FirstName );

Despite this, the theory is feasible, but this solution is not so effective, because in this way, DBMS needs to scan two tables to complete this task. for a large amount of data, this is simply not feasible. in order to test its performance, I created a Customer table with approximately 5000,000 rows and 45,000 redundant rows. the command (9%) and above ran for an hour, no output results, it exhausted my patience, so I killed this process
This scheme has another disadvantage. you cannot control the rows to be retained in each group.

A pl/SQL solution: Delete redundant data using a stored procedure, called a stored procedure of DeleDuplicate. The structure of this process is clear.
SQL
Listing 6. The DeleteDuplicate stored procedure

It selects one of these redundant rows to a cursor, extracts each redundant row from the table, compares it with the row in the cursor, and then determines whether to delete the row.
Create or replace procedure DeleteDuplicates (
PCommitBatchSize in integer: = 5000) IS

CURSOR csr_Duplicates IS
Select id, LastName, FirstName
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 1)
Order by LastName, FirstName;

/* Save the Last surname and name */
VLastName Customers. LastName % TYPE: = NULL;
VFirstName Customers. FirstName % TYPE: = NULL;

VCounter INTEGER: = 0;

BEGIN

FOR vDuplicates IN csr_Duplicates
LOOP
IF vLastName IS NULL OR
(VDuplicates. LastName! = VLastName
Or nvl (vDuplicates. FirstName ,'')! = NVL (vFirstName ,''))
THEN
/* Get the row or a new row for the first time
Save the value of its surname and name */
VLastName: = vDuplicates. LastName;
VFirstName: = vDuplicates. FirstName;

ELSE
/* Remove redundant data */
DELETE
FROM MERS
Where id = vDuplicates. ID;

VCounter: = vCounter 1;

/* Submit the result */
/* Commit every pCommitBatchSize rows */
If mod (vCounter, pCommitBatchSize) = 0
THEN
COMMIT;
End if;

End if;
End loop;

IF vCounter> 0
THEN
COMMIT;
End if;

DBMS_OUTPUT.PUT_LINE (TO_CHAR (vCounter) |
'Duplicates have been deleted .');

EXCEPTION
WHEN OTHERS
THEN
DBMS_OUTPUT.PUT_LINE ('error' |
TO_CHAR (SQLCODE) | ':' | SQLERRM );
ROLLBACK;

END DeleteDuplicates;

It selects redundant data to a cursor and groups the data according to (LastName, FirstName) (in our solution). then it opens the cursor and cyclically retrieves each row, then compare it with the previously retrieved key value. if this is the first time this value is taken out, or this value is not a redundant key, skip this record and take the next one. otherwise, this is the redundant record in this group, so delete it.
Let's run this stored procedure.
BEGIN
DeleteDuplicates;
END;
/

SELECT LastName, FirstName, COUNT (*)
FROM MERS
Group by LastName, FirstName
Having count (*)> 1;

The last query statement has no return value, so the process of retrieving redundant data from the table is completely implemented by the SQL statement defined in the csr_Duplicates cursor, PL/SQL is only used to delete redundant data. Can I use SQL statements completely?

II. SQL solution, using RANK () to delete redundant data
The Oracle8i analysis function RANK () is used to enumerate the elements in each group. in our solution, we apply this solution. We use this function to dynamically arrange redundant data in a continuous manner and add numbers, groups are separated by the Partintion by statement and then grouped by Order by statement.
Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) SeqNumber
FROM MERS
Order by LastName, FirstName;

SQL
Listing 7. Output of single SQL statement that uses RANK ()
Display is based on the number of records, especially for redundant data
ID LASTNAME FIRSTNAME SEQNUMBER
----------------------------------------
1018 Blake Becky 1
1013 Blue Don 1
1000 Bradley Tom 1
1002 Chang Jim 1
1008 Griffin David 1
1020 Hill Larry 1
1004 King Chuck 1
1005 Krieger Jeff 1
1012 Krieger Jeff 2
1017 Krieger Jeff 3
1003 Loney Julie 1
1007 Lord Don 1
1015 Mason Paul 1
1006 Monroe John 1
1009 Simon Michael 1
1010 Simon Michael 2
1001 Stone Tony 1
Stone Tony 2 1011
Stone Tony 3 1014
Stone Tony 4 1016
Stone Tony 5 1019

We can see that the value in the SeqNumber column, redundant data is sorted by the ID number from small to large, and the SqlNumber of all redundant data is greater than one, all non-redundant data is equal to one, so we need to delete the data that is useless.
Select id, LastName, FirstName
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) AS SeqNumber
From mers)
WHERE SeqNumber> 1;

SQL
Listing 8. key value of the redundant key
Seven rows must be deleted
ID LASTNAME FIRSTNAME
------------------------------
1012 Krieger Jeff
1017 Krieger Jeff
1010 Simon Michael
Stone Tony 1011
Stone Tony 1014
Stone Tony 1016
Stone Tony 1019

7 rows selected.

This shows that there are seven rows to be deleted, or I tested this code using the previous table, it took 77 seconds to delete all the data. the SQL statement is ready to delete the redundant data. Version 1 executes the code for 135 seconds.

DELETE
FROM MERS
WHERE ID IN
(SELECT ID
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) AS SeqNumber
From mers)
WHERE SeqNumber> 1 );
We can see that the last two rows of statements are sorted by table data, which is not valid. so we can optimize the last query statement and set Rank () the function is applied to a group that only contains redundant data, rather than all columns.
The following statement is efficient, although it is not as simple as the previous query.
Select id, LastName, FirstName
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) AS SeqNumber
FROM
(Select id, LastName, FirstName
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 1 )))
WHERE SeqNumber> 1;
It takes only 26 seconds to select redundant data, which improves the performance by 67%.
The efficiency of deleting a query as a subquery,
DELETE
FROM MERS
WHERE ID IN
(SELECT ID
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) AS SeqNumber
FROM
(Select id, LastName, FirstName
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 1 )))
WHERE SeqNumber> 1 );
Now, the above tasks are completed in only 47 seconds, which is a great improvement compared to the last 136 seconds. In contrast, the stored procedure takes 56 seconds, in this way, the stored procedure is somewhat slow. Using PL/SQL statements, we and the above code will get better and more precise code and improve the efficiency of your code execution, although there is no difference between PL/SQL for enumerating data from the database, PL/SQL is much faster than SQL for data comparison, however, if the redundant data volume is small, we try to use SQL instead of PL/SQL.

If your data table does not have a primary key, you can refer to other technologies

Rank () Other methods
Using the Rank () function, you can select the data you keep (or the data with a small ID or a large ID, which is determined by the RECDate column, you can add REcdate to the (Order by) clause, in reverse or forward Order.

This is a solution to retain the largest Id

DELETE
FROM MERS
WHERE ID IN
(SELECT ID
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by RecDate DESC, ID) AS SeqNumber
FROM
(Select id, LastName, FirstName, RecDate
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 1 )))
WHERE SeqNumber> 1 );

This technology ensures that you can control the reserved groups in each table. assume that you have a database, a promotion, or a discount information. for example, a group can use this promotion five times, you can also use this discount for three times. to specify the number of groups to be retained, you can set it in the where and having clauses. then, you will delete all redundant groups greater than the number of groups you set.
DELETE
FROM MERS
WHERE ID IN
(SELECT ID
FROM
(Select id, LastName, FirstName,
RANK () OVER (partition by LastName,
FirstName order by id) AS SeqNumber
FROM
(Select id, LastName, FirstName
FROM MERS
WHERE (LastName, FirstName) IN
(SELECT LastName, FirstName
FROM MERS
Group by LastName, FirstName
Having count (*)> 3 )))
WHERE SeqNumber> 3 );
As you can see, using the RANK () function allows you to eliminate duplicates in
Single SQL statement and gives you more capabilities by extending the power of your
Queries.
As you can see, using Rank () can eliminate redundant data and provide great scalability.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More