Now, as you can see , the record email@example.com is a duplicate in the table. However, the first record with firstname.lastname@example.org contains more data then the second one (birthday field is completed in the first record).
I would like to deduplicate my table but retain the record with the most data in it. So in this case the second record (ID 2) should be deleted since it contains less data then ID 1
Is this possible?
And if so , how would you proceed ?
what is the criteria of more data/less data? are there any mandatory attributes that help define this more data criteria? like name/ DOB ?
-- query to find duplicate records
Select EMail from <tablename> Group by Email having count(1) >1
--query to find the duplicate records with more data ( incase, records with name and dob)
Select T.ID, T.Name, T.DOB, T.Email
From <tablename> T
Inner join (Select EMail from <tablename> Group by Email having count(1) >1) Dup
On Dup.Email = T.Email
And T.Name is not null and T.DOB is not null
Thank you for your reply. I had tried very similar to the second solution but unfortunately it did not work out exactly the way I want it. Sometimes it's DOB that contains data, sometimes it's another field , and I need to merge all data of all duplicates into 1 new record.
I had that query running for about 4 hours but it still was busy. (over 250 million rows was just too much too handle).
I have now used a commercial tool called DataQualityTools to do exactly what I need and it works out great (and lightning fast).