Results 1 to 6 of 6

Thread: Deduping DB

  1. #1
    Join Date
    Jun 2004
    Posts
    6

    Unanswered: Deduping DB

    I have an Sql database with several tables. It has about 75,000 records in it but also has a lot of dupes. Can anyone help me with a script to isolate the most possible dupes? One of the most common things I'm seeing is mispelled names.
    Thanks.

  2. #2
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    Provided Answers: 1
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  3. #3
    Join Date
    Mar 2004
    Location
    Glasgow, Scotland
    Posts
    107
    A good starting point would be to quickly link an Access front-end to your DB and then run the Access Find Duplicates query wizard to get an idea of how it does it. Unfortunately there is absolutely no way you are ever going to remove all duplicates from a database. I have had quite a lot of experience with this - NEVER promise anyone that you can do it!

    One of the problems with removing cutomer/supplier dups is that if the duplicate customer/supplier also has records in other related tables you will then want to link that data to the duplicate you are keeping if u know what I mean...

    Cheers

  4. #4
    Join Date
    Jun 2004
    Posts
    6
    I've tried access and the number I am getting seems unbelieveable. Do you know of any tools that might work?

  5. #5
    Join Date
    Mar 2004
    Location
    Glasgow, Scotland
    Posts
    107
    Hi,

    There is software in the marketplace that will do this sort of thing but it all depends on how conisitently your data has been entered in the first place e.g. some users might enter 'Mr Matt McDonald' into a name field whilst others might enter 'Matt McDonald' - a standard database de-duping routine obviously wouldn't pick this up.

    Whilst working for a mailing house and de-duping customer data I used software developed by QAS (www.qas.com) but it is very expensive and only works well if you have consistent customer address fields as it looks for postcodes etc. - as far as I'm aware most of the de-duping software works like this.

    You say that Access returns more records that you think it should - maybe you need to change the criteria to specify what really are duplicates. Have you tried double-checking some of the results to see if they really are duplicates?? Other than that the only way round this is to write your own custom procedure using a combination of code (maybe VBA in access??) and Queries using wildcard characters e.g. Like * *

    I don't know if this is any help to you but de-duping is different for each organisation and for each set of data....there is no one template fits all.

    Matt

  6. #6
    Join Date
    Nov 2002
    Location
    Jersey
    Posts
    10,322
    Sure...you'll get an answer very fast if you post the DDL

    AND what you consider a dup to be...it's not always black and white with some people

    A dup to me is everything on the row is exactly the same...

    Not just the PK

    If you have a PK...which you don't because then you wouldn't have a dup
    Brett
    8-)

    It's a Great Day for America everybody!

    dbforums Yak CorralRadio 'Rita
    dbForums Member List
    I'm Good Once as I ever was

    The physical order of data in a database has no meaning.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •