Recently, in the project, all the tables in the database have a large amount of duplicate data, and they are completely duplicated. That is, all the fields are consistent, including the primary keys are consistent, you need to delete duplicate data and retain only one data record. I have asked some people, and I have been online for half a day. Finally, I will summarize the following methods to delete duplicate data:
1. The cursor is used to delete a common type of method with low data volume and low efficiency requirements, as shown in the following script:
Code
// Define two variables
Delcare @ max integer, @ id integer
// Define a local cursor
Declare cusCursor cursor local for select id, count (*) from tableName group by id having count (*)> 1
// Open the cursor
Open cusCursor
// Put the content read by the current cursor into the variable
Fetch cusCursor into @ id, @ max
// Judge the status after the cursor executes the fetch statement. If the execution succeeds, the loop starts.
While @ fetch_status = 0
Begin
Select @ max = @ max-1
// Set the number of lines processed by the following statement
Set rowcount @ max
// Delete duplicate data. Only one record is retained.
Delete from tableName where id = @ id
// Move the cursor down a row to continue reading data
Fetch cusCursor into @ id, @ max
End
// Close the cursor
Close cusCursor
// Delete the cursor
Deallocate cusCursor
Set rowcount 0
2. Use the temporary table method to delete duplicate records. This method is highly efficient but has some limitations, as shown in the following script:
Code
// After the data in the source table is distinct, only one piece of data is left, and then the data is written to the temporary table # tempTable
Select distinct * into # tempTable from tableName
// Clear the data in the source table
Truncate table tableName
// Write data from the temporary table to the source table
Select * into tableName from # temTable
// Delete a temporary table
Drop # tempTable
The limitation of this method is that an error is reported when an image or text field exists in the Database Table, prompting that the distinct operation cannot be performed on this field.
3. The limitation of the second method, and the data size is also large, and the efficiency is also required, the third method can be as follows:
Code
// Copy the data structure of the source table to the temporary table # tempTable
Select * into # tempTable from tableName where 1 <> 1
// Create an index in the temporary table and ignore duplicate column data
Create unique index temp on # tempTable (duplicate column name) with IGNORE_DUP_KEY
// Copy the data in the source table to the temporary table. When copying the data, duplicate column data is ignored, that is, only one copy of data is retained.
Insert into # tempTable select * from tableName
// Clear the source table
Truncate table tableName
// Write data from the temporary table to the source table
Insert into tableName select * from # tempTable
// Delete a temporary table
Drop table # tempTable
The above are three methods for deleting duplicate database records recently summarized. The third method is efficient and versatile, and has a good performance in the data volume of 10 million.