Results 1 to 5 of 5
  1. #1
    Join Date
    Jul 2009
    Posts
    2

    Question Data Cleansing of a Country field

    Hi All,

    Not sure where to put this but here seems like a fair enough place. I'm looking at doing some data cleansing of an old Oracle RDB database on a field that records a country. The field had no validation on the front end, and as a result there are many different misspellings of countries. Given that I have a list of properly spelled countries, I was wondering if anyone knew of an algorithm or technique to analyse the field and attempt to match it from the known list of countries?

    Cheers,

    Tom.

  2. #2
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    The two most commonly used techniques available to you are soundex (also try looking at metaphone) and distance algorithms (such as levenshtein).
    A big advantage of soundex type functions is they accept a single input so you can do a load of pre-processing before attempting to match. Distance algorithms are not great at matching large sets if time is an issue.
    I have no idea what is available in Oracle, nor the best way to implement these however there will be loads of info available on the internets.
    You would also want to check for abbreviations too I suppose.
    Last edited by pootle flump; 07-08-09 at 04:35.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  3. #3
    Join Date
    Feb 2004
    Location
    One Flump in One Place
    Posts
    14,912
    Just in case it is not obvious, set up constraints to ensure this never happens again once your data is cleaned.
    Testimonial:
    pootle flump
    ur codings are working excelent.

  4. #4
    Join Date
    Jun 2003
    Location
    Ohio
    Posts
    12,592
    This is a relatively simple algorithm which could easily be ported to Oracle:

    sqlblindman private pastebin - collaborative debugging tool

    If you do, please credit my original work in your code, add your own credits, and share the results.

    Thanks.
    If it's not practically useful, then it's practically useless.

    blindman
    www.chess.com: "sqlblindman"
    www.LobsterShot.blogspot.com

  5. #5
    Join Date
    Jul 2009
    Posts
    2
    Thanks to both of you - the levenshtein distance algorithm worked the best.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •