| |
|
If this is your first visit, be sure to check out the FAQ by clicking the link above.
You may have to register before you can post: click the register link above to proceed.
To start viewing messages, select the forum that you want to visit from the selection below.
|
 |

03-29-10, 16:12
|
|
Registered User
|
|
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
|
|
|
Query help with a duplicates mess
|
|
I have data from two sources that has created a mess of duplicates. I've been trying to figure out a way to remove the duplicates, but I think it's just beyond my skill level. Here's a sample...
SMITH JOHN 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN WHOVILLE
SMITH JOHN WATSON
Is there a way to get this to remove the duplicates? I'm hoping it's possible.
Thanks,
Randy
|
|

03-29-10, 22:29
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
|
|
Quote:
Originally Posted by zoonix
I'm hoping it's possible.
|
of course it's possible
but don't expect us to define the rules for what to keep and what to remove -- that's up to you
if you want help with the SQL, you're going to have to dump the table for us, i.e. generate the CREATE TABLE and INSERT statements, with sufficient rows to illustrate the problems
what you posted doesn't make a whole lot of sense, there aren't even any column names
|
|

03-30-10, 07:31
|
|
Registered User
|
|
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
|
|
|
|
This thing is such a mess I'm not even sure how to describe the problem. So no I don't expect anything, I just hope someone can point in the right direction.
Sorry, I thought the column names were fairly self explanatory...
lastname, firstname, num1, num2, city (i left out the other columns for the sake of brevity).
The problem is that I am importing the data from two sources each of which have duplicates. So what I tried to illustrate is the one person might have
duplicates with the first id#,
dupes with the second id#,
dupes with both id's and
dupes with neither id
I've been trying to work it out by using temporary tables and multiple select distinct queries, but have thus far failed miserably.
Any ideas???
|
|

03-30-10, 08:11
|
|
vaguely human
|
|
Join Date: Jun 2007
Location: London
Posts: 2,519
|
|
Your explanation is still quite poor I'm afraid. Do you want any form of duplicate removed? What would you want to keep from the data you supplied? Which fields should be unique in your table? The real data would be much more useful as the current extract just looks like poorly defined homework.
|
|

03-30-10, 08:45
|
|
Registered User
|
|
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
|
|
Sorry, I can't post the real data, it's sensitive. But you are looking at a sample of what I have to working with. I was given two csv files generated from separate sources and asked to combine them and create a unique list. The list are +200K lines each so I thought it best to import them to a database and try to query out the duplicates.
If I had a unique field it would be simple, but the all the possible uniques have potential duplicates. So say you had this little table...
first last pid cid city
SMITH JOHN ........... 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN ........... ........... WHOVILLE
SMITH JOHN ........... ........... WATSON
How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON
I hope I'm explaining it better. I'm having a hard time spitting out!
Thanks,
Randy
|
Last edited by zoonix; 03-30-10 at 08:56.
|

03-30-10, 08:47
|
|
Jaded Developer
|
|
Join Date: Nov 2004
Location: out on a limb
Posts: 9,250
|
|
I think your real issue at present is trying to identify which row is 'the' row, ie the one you want to keep, and which row(s) are the ones you want to drop
First off you need to scope the problem, how many rows have you got with such issues
running various queries using a join or the count + group by function should give you a handle on that.
after that you've got to find out how you got duplicates in the first place and stop that happening again (sounds like you knwo of a way of making a row unique but its not been implemented.
bear in mind that some duplicates are down to data entry errors can can be tricky for a SQL statement to identify unless you get involved with trickery such as double phenome or soundex
things like this may help;
Detecting outliers in US census data Data Applied: Beyond Reporting
__________________
I'd rather be riding my Versys or my Tiger 800 let alone the Norton
|
|

03-30-10, 08:54
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
|
|
Quote:
Originally Posted by zoonix
So say you had this little table...
first last pid cid city
SMITH JOHN 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN WHOVILLE
SMITH JOHN WATSON
|
no wonder i couldn't follow what you were doing
the first time you posted, the columns were ~not~ obvious, because you didn't mention any
now you've changed them again
and to compound the confusion, you didn't format the fake data into columns, like this --
Code:
first last pid cid city
SMITH JOHN 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN 5522456 WHOVILLE
SMITH JOHN WHOVILLE
SMITH JOHN WATSON
there's no way you're going to be able to write a query to reduce all those rows down to this ----
Code:
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN WATSON
i suppose we could still help you write the queries, but you still haven't made any attempt to explain how to combine this data
for example, you've got 5432112 under pid, and yet your suggested output shows 5432112 only under cid -- how does that work?
|
|

03-30-10, 08:59
|
|
Registered User
|
|
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
|
|
Sorry again. I keep messing up the format of my post. It should look like this
first last pid cid city
SMITH JOHN ........... 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN 5522456 ........... WHOVILLE
SMITH JOHN ........... ........... WHOVILLE
SMITH JOHN ........... ........... WATSON
How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON
- Randy
|
|

03-30-10, 09:10
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
|
|
Quote:
Originally Posted by zoonix
How would you write a query to return this...
first last pid cid city
SMITH JOHN 5522456 5432112 WHOVILLE
SMITH JOHN ........... ........... WATSON
|
i guess you didn't see my previous post, in which i formatted your data for you and also remarked that you ~cannot~ write a query to produce those results from that data
and you still haven't made any attempt to state the rules for what to combine
it ~looks~ like you want to consolidate based on pid and cid, but see my remark about the number 5432112
|
|

03-30-10, 09:12
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
|
|
here, try this and see what you get --
Code:
SELECT first, last, MAX(pid), MAX(cid), city
FROM daTable
GROUP BY first, last, city
i think it works on your fake data, but of course on the real data it's going to mess things up rather nicely
|
|

03-30-10, 09:18
|
|
vaguely human
|
|
Join Date: Jun 2007
Location: London
Posts: 2,519
|
|
This is a million miles from perfect but I think it will produce something similar to what you are after. I wouldn't run it on your main copy of the data as we 're still a way of a complete solution yet. At least it starts you off:
Code:
create table temp_table
select lastname,firstname,city, max( concat(num1,'-',num2) ) as keeper
from your_table
group by lastname, firstname, city;
delete from your_table
where concat(num1,'-',num2) != (
select keeper from temp_table t2
where t2.lastname = your_table.lastname
and t2.firstname = your_table.firstname
and t2.city = your_table.city );
select * from your_table;
The issue above is working out which id's are the ones to keep.
What if 2 people have the same names in same city?
What if one person has 2 sets of ids in city - which do you keep?
|
|

03-30-10, 09:44
|
|
Registered User
|
|
Join Date: Mar 2010
Location: Tampa, FL
Posts: 5
|
|
Thanks mike_bike_kite, that looks like the road I was going down, but I couldn't quite get it worked out. This should get me over the hump I think. I'm going to work on it for a bit and see if I can make it work. I post what I come up with and see what everyone thinks.
Thanks for your patience as well Rudy, I'm throwing a copy of your book on the company CC in appreciation!
|
|

03-30-10, 10:15
|
|
SQL Consultant
|
|
Join Date: Apr 2002
Location: Toronto, Canada
Posts: 19,525
|
|
thanks, that was unexpected, considering all the grief i gave you
i hope you will enjoy it, even though it's not at a partricularly advanced level, which is what i think you'll need for this problem
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|