Results 1 to 12 of 12
  1. #1
    Join Date
    Jan 2002
    Posts
    189

    Unanswered: Can someone proofread my 'remove duplicates' script?

    Code:
    DELETE
    FROM tblContacts
    WHERE tblContacts.ID IN(
    	SELECT F.ID
    	FROM tblContacts AS F
    	WHERE Exists (
    		SELECT email, Count(ID)
    		FROM tblContacts
    		WHERE tblContacts.email = F.email
    		GROUP BY tblContacts.email
    		HAVING Count(tblContacts.ID) > 1
    	)
    )
    AND tblContacts.ID NOT IN(
    	SELECT Min(ID)
    	FROM tblContacts AS F
    	WHERE Exists (
    		SELECT email, Count(ID)
    		FROM tblContacts
    		WHERE tblContacts.email = F.email
    		GROUP BY tblContacts.email
    		HAVING Count(tblContacts.ID) > 1
    	)
    	GROUP BY email
    )

    I readily admit that I've shamelessly copied 'n pasted this from a tutorial and then taken a stab at tweaking it for my own ends. But I really don't understand what it's doing.

    Really, all I want to know is that it will remove records with duplicate email fields. But I could also do with confirming - looking at the "SELECT Min(ID)" bit - does that mean that if it finds a duplicate, it'll delete the latest-added one? And if so, that changing it to remove the earliest-added one is simply a case of changing MIN to MAX?


    Thanks

  2. #2
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    A good tip for keeping your sanity is to always run scripts like this on a testing version of your database and then confirm that it has worked before even contemplating running it on prod. As such - the below is based on my best reading of the script.

    Yes it will work. Yes it will delete the most recently inserted record(s) (assuming that the ID field is a monotonically increasing value such as an identity and a higher number always indicates a more recently inserted record). And yes - you can change MIN to MAX to retain the most recently inserted record.

    Have a read through the script a few times though - even if you don't consider it necesary it would be nice to know what it is doing and why.

    HTH
    Testimonial:
    pootle flump
    ur codings are working excelent.

  3. #3
    Join Date
    Jan 2002
    Posts
    189
    Yeah, a test run would be advisable. Good point. I've tried reading my way through it and I just get bogged down in "so we get one list that's.... and those exist in... and that doesn't exist.... and...." and the will to live rapidly leaves me. I think I get it now, though. I guess I just needed someone to tell me it did do what I thought before I tried to figure out exactly how.

    Thanks for the help.

  4. #4
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    While it lookes to me like the code you posted should work, I'd suggest a simpler approach. It is a lot easier to read (at least for me anyway), and probably easier to understand.

    Code:
    DELETE FROM tblContacts
       WHERE ID <> (SELECT Max(z.ID)
          FROM tblContacts AS z
          WHERE  z.email = tblContacts.email))
    -PatP

  5. #5
    Join Date
    Jan 2002
    Posts
    189
    Ooh, that's easier Nice one

  6. #6
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    Oh yeah... One thing I ought to mention before you go trundling off, this code snippet will trip over any rows that have an ID column that is NULL. This may make you need to add an ID IS NOT NULL to the outer clause if there is any chance of encountering a legitimate NULL value in the ID column. This is unlikely, but some schemas will permit it, and the results can be catastrophic!

    -PatP

  7. #7
    Join Date
    Jan 2003
    Location
    Massachusetts
    Posts
    5,800
    Provided Answers: 11
    A handy trick I have used in such cases is rewrite the statement as a SELECT, rather than an update or delete. See what records you are about to modify/maim, and if you have no objections, then you can run the actual data modification.

  8. #8
    Join Date
    Jan 2002
    Posts
    189
    Quote Originally Posted by Pat Phelan
    Code:
    DELETE FROM tblContacts
       WHERE ID <> (SELECT Max(z.ID)
          FROM tblContacts AS z
          WHERE  z.email = tblContacts.email))
    Actually - reading this more carefully, I'm a bit confused. It looks like it'll only do one at a time? Is that right? If so, that's not a bad thing - in fact, it'd be good to know that I've managed to guess what a statement will do before I run it

    edit:

    Well... having read MCrowley's excellent advice: clearly it doesn't. It gives me a big list of duplicates. But I don't understand how? It selects MAX - which is only going to return one record, right? And then it selects (or deletes) WHERE ID <> - not "is not in [a range]", but "does not equal [a value]".

    I'm confused again
    Last edited by Spudhead; 10-03-06 at 05:16.

  9. #9
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    The key is in the corrolation
    Code:
    DELETE FROM tblContacts WHERE ID <> (SELECT Max(z.ID) FROM tblContacts AS z WHERE z.email = tblContacts.email))
    If you remove the bit in bold then yes - you would get one ID returned. However the bit in bold corrolates the inner and outer query so there is one MAX(ID) returned per email address.

    It can be rewritten as a select query as a join of two tables that might make it more obvious:

    Code:
    SELECT tblContacts.* FROM tblContacts INNER JOIN (SELECT Max(z.ID) AS TheMaxID, email FROM tblContacts GROUP BY email) AS z ON z.email = tblContacts.email WHERE tblContacts.ID <> TheMaxID


    HTH
    Testimonial:
    pootle flump
    ur codings are working excelent.

  10. #10
    Join Date
    Jan 2002
    Posts
    189
    The more I think about this, the harder it gets I swear there's a SQL gene.

    Anyway - thanks for the help and patience, everyone. I think I need to just go and play with these statements and get my head round them.

  11. #11
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Just practice - SQL is very easy to learn but rather tricky to master.

    Run the inner query on its own:
    Code:
    SELECT Max(z.ID) AS TheMaxID, 
    email
    FROM     tblContacts 
    GROUP BY email
    That might illuminate....
    Testimonial:
    pootle flump
    ur codings are working excelent.

  12. #12
    Join Date
    Feb 2004
    Location
    In front of the computer
    Posts
    15,579
    Provided Answers: 54
    When you have a subquery (one query that is logically "nested inside" of another query), the subquery gets re-evaluated for each row returned by the outer query. If the DELETE query materializes a million rows from the tblContacts table, then the SELECT query would be evaluated a million times, once for each of the rows materialized by the DELETE query.

    For each row in tblContacts, that DELETE checks to see if that row has the Max(ID) value for a given email address. If the row does not have the Max(ID) for that email, then the row is deleted.

    While this will sometimes confuse people, it does not confuse SQL! The only point that can confuse SQL is NULL values. A NULL email is simply ignored, considered junque and deleted (the explanation of that gets a bit tricky, just take it on faith for now). A NULL ID is also deleted, for a different (but similar) reason.

    Once you understand the way this works for non-NULL values, we'll worry about the NULL values. They are almost assuredly garbage anyway, so don't burn much time on them yet.

    -PatP

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •