Showing posts with label Delete Duplicate Data. Show all posts
Showing posts with label Delete Duplicate Data. Show all posts

Wednesday, May 12, 2010

Deleting Duplicate Data from Table - Part 2

In my earlier post (Deleting Duplicate Data from Table - Part 1 ), we saw there could be different types for row being duplicated.  Just to keep the blog smaller I posted the queries to deal with one type of row duplication – Entire row being duplicated.

Second type of duplication is different than the first one, here only key columns being duplicated where other columns could have different values. If you want to remove the duplicate data in such scenario, that means you want to keep one non key column value for one row and loose the other one. There could be different logic to decide which one to keep and which one to delete…

NOTE: Here I have taken a decision to keep the first row and delete the latest entries and have written the queries accordingly. If your business logic need you to keep the latest one and delete the older ones you must change the queries accordingly. (scenario selected here is deals with US state data, where duplicate entries could be a negligence of the operator or importing same data multiple times… in scenario of Employee or Client data, you may keep the latest records as it could have updated address/contact number etc…)

Script to populate create and populate USStates table with sample data… (ID column is made as Identity) considered Abbreviation column as Key Column.

CREATE TABLE USStates (ID INT IDENTITY(1,1) PRIMARY KEY,

Abbreviation VARCHAR(5),

StateName VARCHAR(100))

GO

INSERT INTO USStates (Abbreviation,StateName)

VALUES ('AL', 'ALABAMA'),

('AK', 'ALASKA'),

('AS', 'AMERICAN SAMOA'),

('AZ', 'ARIZONA '),

('AR', 'ARKANSAS'),

('CA', 'CALIFORNIA '),

('CO', 'COLORADO '),

('CT', 'CONNECTICUT'),

('DE', 'DELAWARE'),

('DC', 'DISTRICT OF COLUMBIA'),

('FM', 'FEDERATED STATES OF MICRONESIA'),

('FL', 'FLORIDA'),

('GA', 'GEORGIA'),

('GU', 'GUAM '),

('HI', 'HAWAII'),

('ID', 'IDAHO'),

('IL', 'ILLINOIS'),

('IN', 'INDIANA'),

('IA', 'IOWA'),

('KS', 'KANSAS'),

('KY', 'KENTUCKY'),

('LA', 'LOUISIANA'),

('ME', 'MAINE'),

('MH', 'MARSHALL ISLANDS'),

('MD', 'MARYLAND'),

('MA', 'MASSACHUSETTS'),

('MI', 'MICHIGAN'),

('MN', 'MINNESOTA'),

('MS', 'MISSISSIPPI'),

('MO', 'MISSOURI'),

('MT', 'MONTANA'),

('NE', 'NEBRASKA'),

('NV', 'NEVADA'),

('NH', 'NEW HAMPSHIRE'),

('NJ', 'NEW JERSEY'),

('NM', 'NEW MEXICO'),

('NY', 'NEW YORK'),

('NC', 'NORTH CAROLINA'),

('ND', 'NORTH DAKOTA'),

('MP', 'NORTHERN MARIANA ISLANDS'),

('OH', 'OHIO'),

('OK', 'OKLAHOMA'),

('OR', 'OREGON'),

('PW', 'PALAU'),

('PA', 'PENNSYLVANIA'),

('PR', 'PUERTO RICO'),

('RI', 'RHODE ISLAND'),

('SC', 'SOUTH CAROLINA'),

('SD', 'SOUTH DAKOTA'),

('TN', 'TENNESSEE'),

('TX', 'TEXAS'),

('UT', 'UTAH'),

('VT', 'VERMONT'),

('VI', 'VIRGIN ISLANDS'),

('VA', 'VIRGINIA '),

('WA', 'WASHINGTON'),

('WV', 'WEST VIRGINIA'),

('WI', 'WISCONSIN'),

('WY', 'WYOMING')


GO 3 --DONT MISS THE NUMBER 3, THIS WILL EXECUTE THE BATCH 3 TIMES TO POPULATE DUPLICATE ENTRIES


Following are different approaches to delete duplicate data based on Key columns.

--USING CTE – preferred one

;WITH CTEUSStates AS (SELECT ROW_NUMBER() OVER(PARTITION BY Abbreviation ORDER BY ID ) RN

      FROM USStates)

DELETE CTEUSStates

WHERE RN>1




--USING CO-RELATED SUBQUERY 1

DELETE D

FROM USStates D

WHERE EXISTS (SELECT 1 FROM USStates O WHERE O.Abbreviation = D.Abbreviation AND O.ID < D.ID )




--USING CO-RELATED SUBQUERY 2

DELETE D

FROM USStates D

WHERE ID > (SELECT MIN(ID) FROM USStates O WHERE O.Abbreviation = D.Abbreviation )




--USING SUBQUERY

DELETE      USStates

WHERE ID NOT IN (SELECT MIN(O.ID) FROM USStates O GROUP BY Abbreviation)




NB : When there is referential integrity  and table is being referenced from other table and the ID of possibly duplicate row being referred by reference table, we cannot delete such rows… FK constraint prevent us from deleting such rows. In such case we need to replace the ID of those rows with the ID being retained. In our case I have written below update statements that replaces the ID being deleted with the ID being retained for the key.
In such scenario Update statements should be run prior to delete statements.



Script to create the reference table and populate with sample data


CREATE TABLE USStateProperties (ID INT IDENTITY(1,1), USStateID INT NOT NULL REFERENCES USStates(ID),

      Property VARCHAR(50),

      PropertyValue VARCHAR(100))


GO



INSERT INTO USStateProperties (USStateID,Property,PropertyValue)

VALUES(51, 'Capital', 'Austin'),

(110, 'Area', '268820 SqMiles'),

(169, 'NickName', 'Lone Star State'),

(35, 'Capital', 'Trenton'),

(94, 'Area', '8729 sqml'),

(153, 'HighestPoint', '1,803 ft  (550 m)')




Update statements to replace the ID being deleted with the ID being retained…

--UDPATE 1

UPDATE SP

      SET SP.USStateID = O.ID

FROM USStateProperties SP

INNER JOIN USStates S ON S.ID = SP.USStateID

INNER JOIN (SELECT Abbreviation,

                        MIN(ID) AS ID

                  FROM USStates

                  GROUP BY Abbreviation )O ON O.Abbreviation = S.Abbreviation

WHERE O.ID < S.ID




--UPDATE 2

UPDATE SP

      SET SP.USStateID = (SELECT MIN(O.ID) FROM USStates O WHERE O.Abbreviation = S.Abbreviation)

FROM USStateProperties SP

INNER JOIN USStates S ON S.ID = SP.USStateID

     

There are other ways to delete such type of duplicate data and update the reference table… I have just posted couple of for reference.



Monday, April 26, 2010

Deleting Duplicate Data from Table - Part 1

Recent few days I heard the same question – How to delete duplicate data from table? Question was asked by peoples who are not that familiar to TSQL…but still I found it worthy to put efforts and prepare some samples for the same. As I believe, there is always an option in SQL, you just need to choose the right one.

 

That’s what triggered me to post this topic here…something that we must know. I think few SQL fellows are also not familiar with all these techniques and could help them to decide the right one for them. And some Interviewers must having old traditional ways in their mind and they must be expecting the same answer from the candidate… guys with new versions of SQL, you are equipped with new tools to handle old issues.

 

When you say, duplicate data, there are different type of duplicate values in the table, 1 – Entire Row being duplicate or 2 – Key column duplicates where ID and other non-key columns are different.

You can’t handle both the issues with same strategy.

 

First, let’s deal with duplicate data where Entire table row is being duplicate, this happens usually on heap table.

 

Script to create a table with some dummy data..

CREATE TABLE USStates (ID INT, Abbreviation VARCHAR(5), StateName VARCHAR(100))

GO

 

INSERT INTO USStates (ID,Abbreviation,StateName)

VALUES (1, 'AL', 'ALABAMA'),

(2, 'AK', 'ALASKA'),

(3, 'AS', 'AMERICAN SAMOA'),

(4, 'AZ', 'ARIZONA '),

(5, 'AR', 'ARKANSAS'),

(6, 'CA', 'CALIFORNIA '),

(7, 'CO', 'COLORADO '),

(8, 'CT', 'CONNECTICUT'),

(9, 'DE', 'DELAWARE'),

(10, 'DC', 'DISTRICT OF COLUMBIA'),

(11, 'FM', 'FEDERATED STATES OF MICRONESIA'),

(12, 'FL', 'FLORIDA'),

(13, 'GA', 'GEORGIA'),

(14, 'GU', 'GUAM '),

(15, 'HI', 'HAWAII'),

(16, 'ID', 'IDAHO'),

(17, 'IL', 'ILLINOIS'),

(18, 'IN', 'INDIANA'),

(19, 'IA', 'IOWA'),

(20, 'KS', 'KANSAS'),

(21, 'KY', 'KENTUCKY'),

(22, 'LA', 'LOUISIANA'),

(23, 'ME', 'MAINE'),

(24, 'MH', 'MARSHALL ISLANDS'),

(25, 'MD', 'MARYLAND'),

(26, 'MA', 'MASSACHUSETTS'),

(27, 'MI', 'MICHIGAN'),

(28, 'MN', 'MINNESOTA'),

(29, 'MS', 'MISSISSIPPI'),

(30, 'MO', 'MISSOURI'),

(31, 'MT', 'MONTANA'),

(32, 'NE', 'NEBRASKA'),

(33, 'NV', 'NEVADA'),

(34, 'NH', 'NEW HAMPSHIRE'),

(35, 'NJ', 'NEW JERSEY'),

(36, 'NM', 'NEW MEXICO'),

(37, 'NY', 'NEW YORK'),

(38, 'NC', 'NORTH CAROLINA'),

(39, 'ND', 'NORTH DAKOTA'),

(40, 'MP', 'NORTHERN MARIANA ISLANDS'),

(41, 'OH', 'OHIO'),

(42, 'OK', 'OKLAHOMA'),

(43, 'OR', 'OREGON'),

(44, 'PW', 'PALAU'),

(45, 'PA', 'PENNSYLVANIA'),

(46, 'PR', 'PUERTO RICO'),

(47, 'RI', 'RHODE ISLAND'),

(48, 'SC', 'SOUTH CAROLINA'),

(49, 'SD', 'SOUTH DAKOTA'),

(50, 'TN', 'TENNESSEE'),

(51, 'TX', 'TEXAS'),

(52, 'UT', 'UTAH'),

(53, 'VT', 'VERMONT'),

(54, 'VI', 'VIRGIN ISLANDS'),

(55, 'VA', 'VIRGINIA '),

(56, 'WA', 'WASHINGTON'),

(57, 'WV', 'WEST VIRGINIA'),

(58, 'WI', 'WISCONSIN'),

(59, 'WY', 'WYOMING')

 

GO 3 --DONT MISS THE NUMBER 3, THIS WILL EXECUTE THE BATCH 3 TIMES TO POPULATE DUPLICATE ENTRIES

 

--VERIFY DUPLICATE ROWS

SELECT * FROM USStates

 

Following are the different approaches to delete duplicate data from table USStates..

Ø  Using CTE (Common Table Expression) –Personally this one is the BEST approach.

SET NOCOUNT ON

      ;WITH cteUSStates AS

            (SELECT ROW_NUMBER() OVER(PARTITION BY ID,Abbreviation,StateName ORDER BY ID) AS RN,

                  ID,

                  Abbreviation,

                  StateName

            FROM USStates )

      DELETE

      FROM cteUSStates

      WHERE RN > 1

 

Ø  Using Identity Column and Hash Values

ALTER TABLE USStates

            ADD RowID INT IDENTITY(1,1)

 

SET NOCOUNT ON

 

DELETE S

FROM USStates S

INNER JOIN (

            SELECT MIN(RowID) RowID, CHECKSUM(O.ID,O.Abbreviation,O.StateName) AS RCheckSum

            FROM USStates O

            GROUP BY CHECKSUM(O.ID,O.Abbreviation,O.StateName)

            ) X ON CHECKSUM(S.ID,S.Abbreviation,S.StateName) = x.RCheckSum

                  AND S.RowID > X.RowID

                 

ALTER TABLE USStates

            DROP COLUMN RowID

     

GO

 

Ø  Using Identity column and group by clause

      ALTER TABLE USStates

            ADD RowID INT IDENTITY(1,1)

      SET NOCOUNT ON

 

      DELETE S

      FROM USStates S

      INNER JOIN (

            SELECT MIN(RowID) RowID, O.ID,O.Abbreviation,O.StateName

            FROM USStates O

            GROUP BY O.ID,O.Abbreviation,O.StateName

            ) X ON S.ID = X.ID AND S.Abbreviation = X.Abbreviation AND S.StateName = X.StateName

                  AND S.RowID > X.RowID

                 

      ALTER TABLE USStates

            DROP COLUMN RowID

           

      GO

 

Ø  Using temporary table

            SET NOCOUNT ON

 

            SELECT DISTINCT *

            INTO #USStates

            FROM USStates

 

            DELETE

            FROM USStates

 

            INSERT INTO USStates

            SELECT * FROM #USStates

 

            DROP TABLE #USStates

 

GO

 

Ø  Using Cursor with Set RowCount / Top Operators

      SET NOCOUNT ON

      DECLARE curState CURSOR LOCAL FAST_FORWARD READ_ONLY FOR

            SELECT CHECKSUM(*), COUNT(*) -1

            FROM USStates

            GROUP BY CHECKSUM(*)

            HAVING COUNT(*) > 1

 

      DECLARE @CheckSum INT, @Count INT

      OPEN curState

 

      FETCH NEXT FROM curState INTO @CheckSum, @Count

      WHILE @@FETCH_STATUS = 0

      BEGIN

            --COMMENT BELOW STATEMENT IF USING SET ROWCOUNT OPTION

            DELETE TOP (@Count)

            FROM USStates

            WHERE CHECKSUM(*) = @CheckSum

 

      /*--UNCOMMENT THIS BLOCK IF USING SET ROWCOUNT OPTION

            SET ROWCOUNT @Count

 

            DELETE

            FROM USStates

            WHERE CHECKSUM(*) = @CheckSum

            */

 

      FETCH NEXT FROM curState INTO @CheckSum, @Count

      END

            SET ROWCOUNT 0

      CLOSE curState

      DeAllocate CurState

 

      GO

 

Just to keep this post smaller, I will post the sample for other possibility of duplicate data in next post soon.

 

Comments are always welcome.