If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Need help with complex query

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 11-29-11, 15:03
Bram Van Daele Bram Van Daele is offline
Registered User
 
Join Date: Dec 2007
Posts: 6
Lightbulb Need help with complex query

Hi All

I need some help with a complex query (or maybe it's not that complex for the experts here).

I have a table which contains a large amount of data. Among them a lot of duplicates. below tablestructure :


ID BIRTHDAY NAME EMAIL
1 12/12/1980 John john@example.com
2 John john@example.com
3 11/11/1970 Lisa lisa@example.com


Now, as you can see , the record john@example.com is a duplicate in the table. However, the first record with john@example.com contains more data then the second one (birthday field is completed in the first record).

I would like to deduplicate my table but retain the record with the most data in it. So in this case the second record (ID 2) should be deleted since it contains less data then ID 1

Is this possible?
And if so , how would you proceed ?


thank you in advance,
Reply With Quote
  #2 (permalink)  
Old 11-30-11, 00:13
baburajv baburajv is offline
Registered User
 
Join Date: Feb 2004
Location: Bangalore, India
Posts: 242
what is the criteria of more data/less data? are there any mandatory attributes that help define this more data criteria? like name/ DOB ?

-- query to find duplicate records
Select EMail from <tablename> Group by Email having count(1) >1

--query to find the duplicate records with more data ( incase, records with name and dob)
Select T.ID, T.Name, T.DOB, T.Email
From <tablename> T
Inner join (Select EMail from <tablename> Group by Email having count(1) >1) Dup
On Dup.Email = T.Email
And T.Name is not null and T.DOB is not null


try this one
__________________
Cheers....

baburajv
Reply With Quote
  #3 (permalink)  
Old 11-30-11, 02:12
Bram Van Daele Bram Van Daele is offline
Registered User
 
Join Date: Dec 2007
Posts: 6
Dear baburajv

Thank you for your reply. I had tried very similar to the second solution but unfortunately it did not work out exactly the way I want it. Sometimes it's DOB that contains data, sometimes it's another field , and I need to merge all data of all duplicates into 1 new record.

I had that query running for about 4 hours but it still was busy. (over 250 million rows was just too much too handle).

I have now used a commercial tool called DataQualityTools to do exactly what I need and it works out great (and lightning fast).

Thank you very much again for your efforts !
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On