Results 1 to 3 of 3
  1. #1
    Join Date
    Dec 2007
    Posts
    6

    Lightbulb Unanswered: Need help with complex query

    Hi All

    I need some help with a complex query (or maybe it's not that complex for the experts here).

    I have a table which contains a large amount of data. Among them a lot of duplicates. below tablestructure :


    ID BIRTHDAY NAME EMAIL
    1 12/12/1980 John john@example.com
    2 John john@example.com
    3 11/11/1970 Lisa lisa@example.com


    Now, as you can see , the record john@example.com is a duplicate in the table. However, the first record with john@example.com contains more data then the second one (birthday field is completed in the first record).

    I would like to deduplicate my table but retain the record with the most data in it. So in this case the second record (ID 2) should be deleted since it contains less data then ID 1

    Is this possible?
    And if so , how would you proceed ?


    thank you in advance,

  2. #2
    Join Date
    Feb 2004
    Location
    Bangalore, India
    Posts
    250
    what is the criteria of more data/less data? are there any mandatory attributes that help define this more data criteria? like name/ DOB ?

    -- query to find duplicate records
    Select EMail from <tablename> Group by Email having count(1) >1

    --query to find the duplicate records with more data ( incase, records with name and dob)
    Select T.ID, T.Name, T.DOB, T.Email
    From <tablename> T
    Inner join (Select EMail from <tablename> Group by Email having count(1) >1) Dup
    On Dup.Email = T.Email
    And T.Name is not null and T.DOB is not null


    try this one
    Cheers....

    baburajv

  3. #3
    Join Date
    Dec 2007
    Posts
    6
    Dear baburajv

    Thank you for your reply. I had tried very similar to the second solution but unfortunately it did not work out exactly the way I want it. Sometimes it's DOB that contains data, sometimes it's another field , and I need to merge all data of all duplicates into 1 new record.

    I had that query running for about 4 hours but it still was busy. (over 250 million rows was just too much too handle).

    I have now used a commercial tool called DataQualityTools to do exactly what I need and it works out great (and lightning fast).

    Thank you very much again for your efforts !

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •