How do I... Identify and delete duplicates from SQL Server tables?

Source: Internet
Author: User
Tags microsoft sql server 2005
  • Source: http://blogs.techrepublic.com.com/datacenter? P = 372 & tag = rbxccnbtr1
  • Date: June 3rd, 2008
  • Author: Susan Harkins
  • Category: Microsoft, databases, SQL Server
  • Tags:Microsoft SQL Server, server, column, record, table, databases, storage, enterprise software, software, data management
  • Special reports» see more posts on: how do I

 

Duplicate records in a database are bad. you can't tolerate them. at the very least, they'll produce misleading analysis. at the worst, they'll totally wreck everything. your application will probably run, but everything it generates will be suspect. you have to find them and delete them or your client might just delete you!

Of course, the applications you develop prevent duplicates from the get go-pat yourself on the back. however, your keen insight into the potential problems duplicates pose won't help you if you inherit a legacy application or have to import dirty data. the good news is that finding and deleting duplicate records in Microsoft's SQL Server isn' t a difficult task. however, determining just what constitutes a duplicate can be a bit tricky.

This blog post is also available in the PDF format in a techrepublic download.

The definition of duplicate

Before you start deleting records, you need to decide what the term duplicate means. most of us wowould define any record where the data in one record matches another column per column (including or excluding the primary key) as a duplicate. the discussion becomes more complex as you narrow the focus from all columns to just a few.

In most cases, business rules, and not repetitive values, will determine what repetitive data creates duplicates. no definition is absolute and knowing your data is the key. that means there's no silver bullet. the solution inListingIs generic; you'll have to adjust it to fit your situation.

Listing
SELECT * | DELETE FROM tableWHERE uniquefield IN  (SELECT uniquefield    FROM table   WHERE EXISTS(      SELECT NULL      FROM table AS tmptable      WHERE table.field1 = tmptable.field1         [AND table.field2 = tmptable.field2         [AND ...]]         HAVING table.uniquefield > MIN(tmptable.uniquefield)    )   )

TableLists the variables that you must update.

Table A: update these variables

Variable

Purpose

table
The table from which you're deleting duplicate records.
uniquefield
Any column that uniquely identifies each record. This can be a primary key or identity column.
tmptable
An alias for the table created by a subquery.
field1, field2
The column (s) that comprise the duplicate.

The Select null subquery retrieves records where the duplicate columns inTableAnd an aliased copy of the table match. The having clause further filters those rows so that only those rows whereUniquefieldValue is greater than the minimum value for the other rows are returned.

You can think of the minimum value as the original record, but it doesn' t really matter. the Code uses it as an anchor-it's simply a place to start, regardless of the input order of the matching records. the two subqueries return duplicate records for each original record. run the delete statement to delete those records fromTable. After deleting, only the record with the minimumUniquefieldValue remains inTable.

Run the select first and examine its results before you actually delete records. the and clause is optional if you're comparing a single column. for each additional column you add to the duplicate comparison, you must add an and clause. be sure to update the column references in the WHERE clause accordingly.

A quick example

The table inFigureContains a few duplicate records. Specifically, three records repeat the same zipcode value. Two records repeat the same zipcode and city value.

Figure athis table contains duplicate records.

Running the SELECT statement shown inFigure BAgainst this table returns two records. There's one zipcode value that occurs three times, but the statement returns only those records whereUniqueidentifier-Codeid, in this case-is greater than that column's minimum value.Figure CShows what happens when an and Clause compares both the zipcode and city columns. There's only one record where both values are duplicated.

Figure BThe SELECT statement returns two duplicate records. Figure cexpanding the comparison limits CES the number of duplicate records.

While this approach is simple, you'll want to consider a few things:

  • This statement isn't dynamic, which means you'll have to update the references each time you use it. if you delete duplicates regularly on different tables, you might want to find a different approach. if you run it regularly against the same table, consider converting it to a stored procedure.
  • If you're uncomfortable deleting records, make a copyTableAnd hang on to it for a while. or, don't delete the records at all. instead, move them to an archive table, or flag them as deleted in the original table. either way allows you to filter the records from current ProcessingAndMaintain a historical perspective.
  • This solution works on a single table. If you're deleting records based on values in a related table, you need to consider far more than just duplicate values.
  • IfUniquefieldIs an identity column, you don't have to worry about missing values. The code works just fine. Nor do the records have to be in any specific order.
Reduce duplicates and your work

You'll find this solution for finding duplicate records dependable. It's easy to implement because it requires no temporary tables and cursors. In addition, the simple code can be quickly updated.

Susan sales Harkins is an independent consultant and the author of several articles and books on database technologies. her most recent book is "mastering Microsoft SQL Server 2005 Express," with Mike gunderloy, published by sybex. other collaborations with Mike gunderloy are "automating Microsoft Access 2003 with VBA," "upgrader's Guide to Microsoft Office System 2003," "ICDL exam cram 2, "and" absolute beginner's Guide to Microsoft Access 2003 "all by que. currently, Susan volunteers as the publications ctor for database advisors at http://www.databaseadvisors.com. you can reach her at ssharkins@gmail.com.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.