Results 1 to 13 of 13
  1. #1
    Join Date
    Mar 2010
    Location
    Tampa, FL
    Posts
    5

    Unanswered: Query help with a duplicates mess

    I have data from two sources that has created a mess of duplicates. I've been trying to figure out a way to remove the duplicates, but I think it's just beyond my skill level. Here's a sample...

    SMITH JOHN 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 WHOVILLE
    SMITH JOHN 5522456 WHOVILLE
    SMITH JOHN WHOVILLE
    SMITH JOHN WATSON

    Is there a way to get this to remove the duplicates? I'm hoping it's possible.

    Thanks,
    Randy

  2. #2
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by zoonix View Post
    I'm hoping it's possible.
    of course it's possible

    but don't expect us to define the rules for what to keep and what to remove -- that's up to you

    if you want help with the SQL, you're going to have to dump the table for us, i.e. generate the CREATE TABLE and INSERT statements, with sufficient rows to illustrate the problems

    what you posted doesn't make a whole lot of sense, there aren't even any column names
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  3. #3
    Join Date
    Mar 2010
    Location
    Tampa, FL
    Posts
    5
    This thing is such a mess I'm not even sure how to describe the problem. So no I don't expect anything, I just hope someone can point in the right direction.

    Sorry, I thought the column names were fairly self explanatory...
    lastname, firstname, num1, num2, city (i left out the other columns for the sake of brevity).

    The problem is that I am importing the data from two sources each of which have duplicates. So what I tried to illustrate is the one person might have
    duplicates with the first id#,
    dupes with the second id#,
    dupes with both id's and
    dupes with neither id

    I've been trying to work it out by using temporary tables and multiple select distinct queries, but have thus far failed miserably.

    Any ideas???

  4. #4
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    Any ideas???
    Your explanation is still quite poor I'm afraid. Do you want any form of duplicate removed? What would you want to keep from the data you supplied? Which fields should be unique in your table? The real data would be much more useful as the current extract just looks like poorly defined homework.

  5. #5
    Join Date
    Mar 2010
    Location
    Tampa, FL
    Posts
    5
    Sorry, I can't post the real data, it's sensitive. But you are looking at a sample of what I have to working with. I was given two csv files generated from separate sources and asked to combine them and create a unique list. The list are +200K lines each so I thought it best to import them to a database and try to query out the duplicates.

    If I had a unique field it would be simple, but the all the possible uniques have potential duplicates. So say you had this little table...

    first last pid cid city
    SMITH JOHN ........... 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 ........... WHOVILLE
    SMITH JOHN 5522456 ........... WHOVILLE
    SMITH JOHN ........... ........... WHOVILLE
    SMITH JOHN ........... ........... WATSON

    How would you write a query to return this...
    first last pid cid city
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN ........... ........... WATSON

    I hope I'm explaining it better. I'm having a hard time spitting out!

    Thanks,
    Randy
    Last edited by zoonix; 03-30-10 at 09:56.

  6. #6
    Join Date
    Nov 2004
    Location
    out on a limb
    Posts
    13,692
    Provided Answers: 59
    I think your real issue at present is trying to identify which row is 'the' row, ie the one you want to keep, and which row(s) are the ones you want to drop

    First off you need to scope the problem, how many rows have you got with such issues
    running various queries using a join or the count + group by function should give you a handle on that.

    after that you've got to find out how you got duplicates in the first place and stop that happening again (sounds like you knwo of a way of making a row unique but its not been implemented.

    bear in mind that some duplicates are down to data entry errors can can be tricky for a SQL statement to identify unless you get involved with trickery such as double phenome or soundex

    things like this may help;
    Detecting outliers in US census data Data Applied: Beyond Reporting
    I'd rather be riding on the Tiger 800 or the Norton

  7. #7
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by zoonix View Post
    So say you had this little table...

    first last pid cid city
    SMITH JOHN 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 WHOVILLE
    SMITH JOHN 5522456 WHOVILLE
    SMITH JOHN WHOVILLE
    SMITH JOHN WATSON
    no wonder i couldn't follow what you were doing

    the first time you posted, the columns were ~not~ obvious, because you didn't mention any

    now you've changed them again

    and to compound the confusion, you didn't format the fake data into columns, like this --
    Code:
    first last pid     cid     city    
    SMITH JOHN 5432112         WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456         WHOVILLE
    SMITH JOHN 5522456         WHOVILLE
    SMITH JOHN                 WHOVILLE
    SMITH JOHN                 WATSON
    there's no way you're going to be able to write a query to reduce all those rows down to this ----
    Code:
    first last pid     cid     city    
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN                 WATSON
    i suppose we could still help you write the queries, but you still haven't made any attempt to explain how to combine this data

    for example, you've got 5432112 under pid, and yet your suggested output shows 5432112 only under cid -- how does that work?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  8. #8
    Join Date
    Mar 2010
    Location
    Tampa, FL
    Posts
    5
    Sorry again. I keep messing up the format of my post. It should look like this

    first last pid cid city
    SMITH JOHN ........... 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN 5522456 ........... WHOVILLE
    SMITH JOHN 5522456 ........... WHOVILLE
    SMITH JOHN ........... ........... WHOVILLE
    SMITH JOHN ........... ........... WATSON

    How would you write a query to return this...
    first last pid cid city
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN ........... ........... WATSON

    - Randy

  9. #9
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    Quote Originally Posted by zoonix View Post
    How would you write a query to return this...
    first last pid cid city
    SMITH JOHN 5522456 5432112 WHOVILLE
    SMITH JOHN ........... ........... WATSON
    i guess you didn't see my previous post, in which i formatted your data for you and also remarked that you ~cannot~ write a query to produce those results from that data

    and you still haven't made any attempt to state the rules for what to combine

    it ~looks~ like you want to consolidate based on pid and cid, but see my remark about the number 5432112
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  10. #10
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    here, try this and see what you get --
    Code:
    SELECT first, last, MAX(pid), MAX(cid), city
      FROM daTable
    GROUP BY first, last, city
    i think it works on your fake data, but of course on the real data it's going to mess things up rather nicely
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

  11. #11
    Join Date
    Jun 2007
    Location
    London
    Posts
    2,527
    This is a million miles from perfect but I think it will produce something similar to what you are after. I wouldn't run it on your main copy of the data as we 're still a way of a complete solution yet. At least it starts you off:
    Code:
    create table temp_table
       select lastname,firstname,city, max( concat(num1,'-',num2) ) as keeper
       from   your_table
       group by lastname, firstname, city;
    
    delete from your_table
    where  concat(num1,'-',num2) != (
            select keeper from temp_table t2
            where  t2.lastname = your_table.lastname
                   and t2.firstname = your_table.firstname
                   and t2.city = your_table.city );
    
    select * from your_table;
    The issue above is working out which id's are the ones to keep.
    What if 2 people have the same names in same city?
    What if one person has 2 sets of ids in city - which do you keep?

  12. #12
    Join Date
    Mar 2010
    Location
    Tampa, FL
    Posts
    5
    Thanks mike_bike_kite, that looks like the road I was going down, but I couldn't quite get it worked out. This should get me over the hump I think. I'm going to work on it for a bit and see if I can make it work. I post what I come up with and see what everyone thinks.

    Thanks for your patience as well Rudy, I'm throwing a copy of your book on the company CC in appreciation!

  13. #13
    Join Date
    Apr 2002
    Location
    Toronto, Canada
    Posts
    20,002
    thanks, that was unexpected, considering all the grief i gave you

    i hope you will enjoy it, even though it's not at a partricularly advanced level, which is what i think you'll need for this problem
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •