Recent few days I heard the same question – How to delete duplicate data from table? Question was asked by peoples who are not that familiar to TSQL…but still I found it worthy to put efforts and prepare some samples for the same. As I believe, there is always an option in SQL, you just need to choose the right one.
That’s what triggered me to post this topic here…something that we must know. I think few SQL fellows are also not familiar with all these techniques and could help them to decide the right one for them. And some Interviewers must having old traditional ways in their mind and they must be expecting the same answer from the candidate… guys with new versions of SQL, you are equipped with new tools to handle old issues.
When you say, duplicate data, there are different type of duplicate values in the table, 1 – Entire Row being duplicate or 2 – Key column duplicates where ID and other non-key columns are different.
You can’t handle both the issues with same strategy.
First, let’s deal with duplicate data where Entire table row is being duplicate, this happens usually on heap table.
Script to create a table with some dummy data..
CREATE TABLE USStates (ID INT, Abbreviation VARCHAR(5), StateName VARCHAR(100))
GO
INSERT INTO USStates (ID,Abbreviation,StateName)
VALUES (1, 'AL', 'ALABAMA'),
(2, 'AK', 'ALASKA'),
(3, 'AS', 'AMERICAN SAMOA'),
(4, 'AZ', 'ARIZONA '),
(5, 'AR', 'ARKANSAS'),
(6, 'CA', 'CALIFORNIA '),
(7, 'CO', 'COLORADO '),
(8, 'CT', 'CONNECTICUT'),
(9, 'DE', 'DELAWARE'),
(10, 'DC', 'DISTRICT OF COLUMBIA'),
(11, 'FM', 'FEDERATED STATES OF MICRONESIA'),
(12, 'FL', 'FLORIDA'),
(13, 'GA', 'GEORGIA'),
(14, 'GU', 'GUAM '),
(15, 'HI', 'HAWAII'),
(16, 'ID', 'IDAHO'),
(17, 'IL', 'ILLINOIS'),
(18, 'IN', 'INDIANA'),
(19, 'IA', 'IOWA'),
(20, 'KS', 'KANSAS'),
(21, 'KY', 'KENTUCKY'),
(22, 'LA', 'LOUISIANA'),
(23, 'ME', 'MAINE'),
(24, 'MH', 'MARSHALL ISLANDS'),
(25, 'MD', 'MARYLAND'),
(26, 'MA', 'MASSACHUSETTS'),
(27, 'MI', 'MICHIGAN'),
(28, 'MN', 'MINNESOTA'),
(29, 'MS', 'MISSISSIPPI'),
(30, 'MO', 'MISSOURI'),
(31, 'MT', 'MONTANA'),
(32, 'NE', 'NEBRASKA'),
(33, 'NV', 'NEVADA'),
(34, 'NH', 'NEW HAMPSHIRE'),
(35, 'NJ', 'NEW JERSEY'),
(36, 'NM', 'NEW MEXICO'),
(37, 'NY', 'NEW YORK'),
(38, 'NC', 'NORTH CAROLINA'),
(39, 'ND', 'NORTH DAKOTA'),
(40, 'MP', 'NORTHERN MARIANA ISLANDS'),
(41, 'OH', 'OHIO'),
(42, 'OK', 'OKLAHOMA'),
(43, 'OR', 'OREGON'),
(44, 'PW', 'PALAU'),
(45, 'PA', 'PENNSYLVANIA'),
(46, 'PR', 'PUERTO RICO'),
(47, 'RI', 'RHODE ISLAND'),
(48, 'SC', 'SOUTH CAROLINA'),
(49, 'SD', 'SOUTH DAKOTA'),
(50, 'TN', 'TENNESSEE'),
(51, 'TX', 'TEXAS'),
(52, 'UT', 'UTAH'),
(53, 'VT', 'VERMONT'),
(54, 'VI', 'VIRGIN ISLANDS'),
(55, 'VA', 'VIRGINIA '),
(56, 'WA', 'WASHINGTON'),
(57, 'WV', 'WEST VIRGINIA'),
(58, 'WI', 'WISCONSIN'),
(59, 'WY', 'WYOMING')
GO 3 --DONT MISS THE NUMBER 3, THIS WILL EXECUTE THE BATCH 3 TIMES TO POPULATE DUPLICATE ENTRIES
--VERIFY DUPLICATE ROWS
SELECT * FROM USStates
Following are the different approaches to delete duplicate data from table USStates..
Ø Using CTE (Common Table Expression) –Personally this one is the BEST approach.
SET NOCOUNT ON
;WITH cteUSStates AS
(SELECT ROW_NUMBER() OVER(PARTITION BY ID,Abbreviation,StateName ORDER BY ID) AS RN,
ID,
Abbreviation,
StateName
FROM USStates )
DELETE
FROM cteUSStates
WHERE RN > 1
Ø Using Identity Column and Hash Values
ALTER TABLE USStates
ADD RowID INT IDENTITY(1,1)
SET NOCOUNT ON
DELETE S
FROM USStates S
INNER JOIN (
SELECT MIN(RowID) RowID, CHECKSUM(O.ID,O.Abbreviation,O.StateName) AS RCheckSum
FROM USStates O
GROUP BY CHECKSUM(O.ID,O.Abbreviation,O.StateName)
) X ON CHECKSUM(S.ID,S.Abbreviation,S.StateName) = x.RCheckSum
AND S.RowID > X.RowID
ALTER TABLE USStates
DROP COLUMN RowID
GO
Ø Using Identity column and group by clause
ALTER TABLE USStates
ADD RowID INT IDENTITY(1,1)
SET NOCOUNT ON
DELETE S
FROM USStates S
INNER JOIN (
SELECT MIN(RowID) RowID, O.ID,O.Abbreviation,O.StateName
FROM USStates O
GROUP BY O.ID,O.Abbreviation,O.StateName
) X ON S.ID = X.ID AND S.Abbreviation = X.Abbreviation AND S.StateName = X.StateName
AND S.RowID > X.RowID
ALTER TABLE USStates
DROP COLUMN RowID
GO
Ø Using temporary table
SET NOCOUNT ON
SELECT DISTINCT *
INTO #USStates
FROM USStates
DELETE
FROM USStates
INSERT INTO USStates
SELECT * FROM #USStates
DROP TABLE #USStates
GO
Ø Using Cursor with Set RowCount / Top Operators
SET NOCOUNT ON
DECLARE curState CURSOR LOCAL FAST_FORWARD READ_ONLY FOR
SELECT CHECKSUM(*), COUNT(*) -1
FROM USStates
GROUP BY CHECKSUM(*)
HAVING COUNT(*) > 1
DECLARE @CheckSum INT, @Count INT
OPEN curState
FETCH NEXT FROM curState INTO @CheckSum, @Count
WHILE @@FETCH_STATUS = 0
BEGIN
--COMMENT BELOW STATEMENT IF USING SET ROWCOUNT OPTION
DELETE TOP (@Count)
FROM USStates
WHERE CHECKSUM(*) = @CheckSum
/*--UNCOMMENT THIS BLOCK IF USING SET ROWCOUNT OPTION
SET ROWCOUNT @Count
DELETE
FROM USStates
WHERE CHECKSUM(*) = @CheckSum
*/
FETCH NEXT FROM curState INTO @CheckSum, @Count
END
SET ROWCOUNT 0
CLOSE curState
DeAllocate CurState
GO
Just to keep this post smaller, I will post the sample for other possibility of duplicate data in next post soon.
Comments are always welcome.
Hi Sajid,
ReplyDeleteThanks for post...
Got some more differant ways to delete duplicate records...
~Prasad