If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > Database Server Software > MySQL > Query help with a duplicates mess

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 03-29-10, 16:12
zoonix zoonix is offline
Registered User
 
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
Query help with a duplicates mess

I have data from two sources that has created a mess of duplicates. I've been trying to figure out a way to remove the duplicates, but I think it's just beyond my skill level. Here's a sample...

SMITH JOHN 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN WHOVILLE
SMITH JOHN WATSON

Is there a way to get this to remove the duplicates? I'm hoping it's possible.

Thanks,
Randy
Reply With Quote
  #2 (permalink)  
Old 03-29-10, 22:29
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
Quote:
Originally Posted by zoonix View Post
I'm hoping it's possible.
of course it's possible

but don't expect us to define the rules for what to keep and what to remove -- that's up to you

if you want help with the SQL, you're going to have to dump the table for us, i.e. generate the CREATE TABLE and INSERT statements, with sufficient rows to illustrate the problems

what you posted doesn't make a whole lot of sense, there aren't even any column names
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #3 (permalink)  
Old 03-30-10, 07:31
zoonix zoonix is offline
Registered User
 
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
This thing is such a mess I'm not even sure how to describe the problem. So no I don't expect anything, I just hope someone can point in the right direction.

Sorry, I thought the column names were fairly self explanatory...
lastname, firstname, num1, num2, city (i left out the other columns for the sake of brevity).

The problem is that I am importing the data from two sources each of which have duplicates. So what I tried to illustrate is the one person might have
duplicates with the first id#,
dupes with the second id#,
dupes with both id's and
dupes with neither id

I've been trying to work it out by using temporary tables and multiple select distinct queries, but have thus far failed miserably.

Any ideas???
Reply With Quote
  #4 (permalink)  
Old 03-30-10, 08:11
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
Quote:
Any ideas???
Your explanation is still quite poor I'm afraid. Do you want any form of duplicate removed? What would you want to keep from the data you supplied? Which fields should be unique in your table? The real data would be much more useful as the current extract just looks like poorly defined homework.
__________________
Mike
Reply With Quote
  #5 (permalink)  
Old 03-30-10, 08:45
zoonix zoonix is offline
Registered User
 
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
Sorry, I can't post the real data, it's sensitive. But you are looking at a sample of what I have to working with. I was given two csv files generated from separate sources and asked to combine them and create a unique list. The list are +200K lines each so I thought it best to import them to a database and try to query out the duplicates.

If I had a unique field it would be simple, but the all the possible uniques have potential duplicates. So say you had this little table...

first last pid cid city
SMITH JOHN ........... 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN ........... ........... WHOVILLE
SMITH JOHN ........... ........... WATSON

How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON

I hope I'm explaining it better. I'm having a hard time spitting out!

Thanks,
Randy

Last edited by zoonix; 03-30-10 at 08:56.
Reply With Quote
  #6 (permalink)  
Old 03-30-10, 08:47
healdem healdem is offline
Jaded Developer
 
Join Date: Nov 2004
Location: out on a limb
Posts: 9,250
I think your real issue at present is trying to identify which row is 'the' row, ie the one you want to keep, and which row(s) are the ones you want to drop

First off you need to scope the problem, how many rows have you got with such issues
running various queries using a join or the count + group by function should give you a handle on that.

after that you've got to find out how you got duplicates in the first place and stop that happening again (sounds like you knwo of a way of making a row unique but its not been implemented.

bear in mind that some duplicates are down to data entry errors can can be tricky for a SQL statement to identify unless you get involved with trickery such as double phenome or soundex

things like this may help;
Detecting outliers in US census data Data Applied: Beyond Reporting
__________________
I'd rather be riding my Versys or my Tiger 800 let alone the Norton
Reply With Quote
  #7 (permalink)  
Old 03-30-10, 08:54
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
Quote:
Originally Posted by zoonix View Post
So say you had this little table...

first last pid cid city
SMITH JOHN 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN WHOVILLE
SMITH JOHN WATSON
no wonder i couldn't follow what you were doing

the first time you posted, the columns were ~not~ obvious, because you didn't mention any

now you've changed them again

and to compound the confusion, you didn't format the fake data into columns, like this --
Code:
first last pid     cid     city    
SMITH JOHN 5432112         WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456         WHOVILLE
SMITH JOHN 5522456         WHOVILLE
SMITH JOHN                 WHOVILLE
SMITH JOHN                 WATSON
there's no way you're going to be able to write a query to reduce all those rows down to this ----
Code:
first last pid     cid     city    
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN                 WATSON
i suppose we could still help you write the queries, but you still haven't made any attempt to explain how to combine this data

for example, you've got 5432112 under pid, and yet your suggested output shows 5432112 only under cid -- how does that work?
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #8 (permalink)  
Old 03-30-10, 08:59
zoonix zoonix is offline
Registered User
 
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
Sorry again. I keep messing up the format of my post. It should look like this

first last pid cid city
SMITH JOHN ........... 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN ........... ........... WHOVILLE
SMITH JOHN ........... ........... WATSON

How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON

- Randy
Reply With Quote
  #9 (permalink)  
Old 03-30-10, 09:10
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
Quote:
Originally Posted by zoonix View Post
How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON
i guess you didn't see my previous post, in which i formatted your data for you and also remarked that you ~cannot~ write a query to produce those results from that data

and you still haven't made any attempt to state the rules for what to combine

it ~looks~ like you want to consolidate based on pid and cid, but see my remark about the number 5432112
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #10 (permalink)  
Old 03-30-10, 09:12
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
here, try this and see what you get --
Code:
SELECT first, last, MAX(pid), MAX(cid), city
  FROM daTable
GROUP BY first, last, city
i think it works on your fake data, but of course on the real data it's going to mess things up rather nicely
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
  #11 (permalink)  
Old 03-30-10, 09:18
mike_bike_kite mike_bike_kite is offline
vaguely human
 
Join Date: Jun 2007
Location: London
Posts: 2,519
This is a million miles from perfect but I think it will produce something similar to what you are after. I wouldn't run it on your main copy of the data as we 're still a way of a complete solution yet. At least it starts you off:
Code:
create table temp_table
   select lastname,firstname,city, max( concat(num1,'-',num2) ) as keeper
   from   your_table
   group by lastname, firstname, city;

delete from your_table
where  concat(num1,'-',num2) != (
        select keeper from temp_table t2
        where  t2.lastname = your_table.lastname
               and t2.firstname = your_table.firstname
               and t2.city = your_table.city );

select * from your_table;
The issue above is working out which id's are the ones to keep.
What if 2 people have the same names in same city?
What if one person has 2 sets of ids in city - which do you keep?
__________________
Mike
Reply With Quote
  #12 (permalink)  
Old 03-30-10, 09:44
zoonix zoonix is offline
Registered User
 
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
Thanks mike_bike_kite, that looks like the road I was going down, but I couldn't quite get it worked out. This should get me over the hump I think. I'm going to work on it for a bit and see if I can make it work. I post what I come up with and see what everyone thinks.

Thanks for your patience as well Rudy, I'm throwing a copy of your book on the company CC in appreciation!
Reply With Quote
  #13 (permalink)  
Old 03-30-10, 10:15
r937 r937 is offline
SQL Consultant
 
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
thanks, that was unexpected, considering all the grief i gave you

i hope you will enjoy it, even though it's not at a partricularly advanced level, which is what i think you'll need for this problem
__________________
r937.com | rudy.ca
please visit Simply SQL and buy my book
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On