Results 1 to 7 of 7
  1. #1
    Join Date
    Jan 2003
    Posts
    8

    Question Unanswered: Matching algorithm

    I have a db full of names and identifiers. Was wondering if anyone could direct me towards a good dupe eliminating algorithm or bit of logic to look at?

    Thanks!

  2. #2
    Join Date
    Feb 2002
    Posts
    2,232
    What will be your basis for eliminating duplicates ?

  3. #3
    Join Date
    Jan 2003
    Posts
    8
    Originally posted by rnealejr
    What will be your basis for eliminating duplicates ?
    Name dupes and identifiers, some names may be similar, some addresses may be simialr. Each row will have approximately 10 identifiers, but some will be nnull. It is fairly dirty data.

  4. #4
    Join Date
    Feb 2002
    Posts
    2,232
    But how do you determine which rows to delete (like keying off the most recent timestamp) or will it be a visual comparison (so will you need to route these duplicates to a holding table) ?

  5. #5
    Join Date
    Jan 2003
    Posts
    8
    Originally posted by rnealejr
    But how do you determine which rows to delete (like keying off the most recent timestamp) or will it be a visual comparison (so will you need to route these duplicates to a holding table) ?
    Yes I wil need to move them to a holding table. It will be compared visually using all.

  6. #6
    Join Date
    Dec 2002
    Location
    Czech Republic
    Posts
    249
    You should post more info about your table of duplicities.
    Generic query looks like this

    select d1.*
    into ClearTable
    from DuplTable d1
    join
    (
    select IdCol1,IdCol2,...
    from DuplTable d2
    group by IdCol1,IdCol2,...
    having count(*)=1
    ) x on (d1.IdCol1=x.IdCol1 or (d1.IdCol1 is null and x.IdCol1 is null))
    and (d1.IdCol2=x.IdCol2 or (d1.IdCol2 is null and x.IdCol2 is null))
    .
    .
    .
    select d1.*
    into HoldingTable
    from DuplTable d1
    join
    (
    select IdCol1,IdCol2,...
    from DuplTable d2
    group by IdCol1,IdCol2,...
    having count(*)>1
    ) x on (d1.IdCol1=x.IdCol1 or (d1.IdCol1 is null and x.IdCol1 is null))
    and (d1.IdCol2=x.IdCol2 or (d1.IdCol2 is null and x.IdCol2 is null))
    .
    .
    .
    select * from ClearTable
    select * from HoldingTable

  7. #7
    Join Date
    Jan 2003
    Posts
    8

    Smile

    Originally posted by ispaleny
    You should post more info about your table of duplicities.
    Awesome, I am going to try that format. My original table is pretty dirty. It is Lname, fname, SSN, address, city, state, country, phone number, etc., about 5 more identifiers. It is around 3 million records, some complete, some not. Spellings are different in some cases, middle names are there in other cases.

    Thanks for your help.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •