We have a process in my place of work where some data is placed into a staging table. a batch job runs and takes this data (tracking numbers, delivery times etc) and attempts to put it into another table, but the job often fails because for some reason there's duplicate data, ie often the delivery guy presses the enter button twice, so it has two records for one delivery.
In the staging table, we have about 8 columns, but the 3 important ones we look for are the tracking number, action code and action date. If there are 2 or more instances of the same tracking number, action code and action date, the job fails due to duplicates.
Tracking Number Action Code action date
12345 C 14052012 1106
12345 C 14052012 1106
So the process we have at the moment is to run this code against the staging table -
SELECT COUNT(*) AS Expr1, TrackingNumber + ActionCode + ActionDate AS Expr2
GROUP BY TrackingNumber + ActionCode + ActionDate
ORDER BY COUNT(*) DESC
This will then highlight the duplicates, like below -
So the duplicates here are highlighted as the top 2. The way we remove these duplicates is a bit complicated. When the job tries to move the data from the staging table to another table, when it fails it outputs a .txt file with all the data. We take the tracking numbers, and one by one do a 'find' in the .txt file, look for each tracking number and delete one of the duplicate rows.
What i'd like to be able to do, is to just run some SQL code that would identify the duplicates, which we're already doing, but then for the next step it would delete the duplicate data without us having to go and search for 80 duplicate tracking numbers in a file with over 1000 rows!
with CTE as
ROW_NUMBER() OVER(PARTITION BY TrackingNumber, ActionCode, ActionDate
ORDER BY stgID) as RowNo
INSERT INTO myProductionTaable (....)
SELECT ... FROM CTE where RowNo = 1
Add an Identity column to your staging table.
Then run this code to remove the duplicates, keeping only the latest entry:
from stagingtable as A
inner join stagingtable as B
on A.TrackingNumber = B.TrackingNumber
and A.ActionCode = B.ActionCode
and A.ActionDate = B.ActionDate
and A.IdentityValue < B.IdentityValue
If it's not practically useful, then it's practically useless.