If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Go Back  dBforums > General > Database Concepts & Design > Data Cleansing of a Country field

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 07-07-09, 22:02
tellisnz tellisnz is offline
Registered User
 
Join Date: Jul 2009
Posts: 2
Question Data Cleansing of a Country field

Hi All,

Not sure where to put this but here seems like a fair enough place. I'm looking at doing some data cleansing of an old Oracle RDB database on a field that records a country. The field had no validation on the front end, and as a result there are many different misspellings of countries. Given that I have a list of properly spelled countries, I was wondering if anyone knew of an algorithm or technique to analyse the field and attempt to match it from the known list of countries?

Cheers,

Tom.
Reply With Quote
  #2 (permalink)  
Old 07-08-09, 03:31
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
The two most commonly used techniques available to you are soundex (also try looking at metaphone) and distance algorithms (such as levenshtein).
A big advantage of soundex type functions is they accept a single input so you can do a load of pre-processing before attempting to match. Distance algorithms are not great at matching large sets if time is an issue.
I have no idea what is available in Oracle, nor the best way to implement these however there will be loads of info available on the internets.
You would also want to check for abbreviations too I suppose.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.

Last edited by pootle flump; 07-08-09 at 03:35.
Reply With Quote
  #3 (permalink)  
Old 07-08-09, 03:36
pootle flump pootle flump is offline
King of Understatement
 
Join Date: Feb 2004
Location: One Flump in One Place
Posts: 14,905
Just in case it is not obvious, set up constraints to ensure this never happens again once your data is cleaned.
__________________
Testimonial:
Quote:
pootle flump
ur codings are working excelent.
Reply With Quote
  #4 (permalink)  
Old 07-08-09, 09:37
blindman blindman is offline
World Class Flame Warrior
 
Join Date: Jun 2003
Location: Ohio
Posts: 11,726
This is a relatively simple algorithm which could easily be ported to Oracle:

sqlblindman private pastebin - collaborative debugging tool

If you do, please credit my original work in your code, add your own credits, and share the results.

Thanks.
__________________
If it's not practically useful, then it's practically useless.

blindman
www.chess.com: "sqlblindman"
Reply With Quote
  #5 (permalink)  
Old 07-09-09, 00:09
tellisnz tellisnz is offline
Registered User
 
Join Date: Jul 2009
Posts: 2
Thanks to both of you - the levenshtein distance algorithm worked the best.
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On