Not sure where to put this but here seems like a fair enough place. I'm looking at doing some data cleansing of an old Oracle RDB database on a field that records a country. The field had no validation on the front end, and as a result there are many different misspellings of countries. Given that I have a list of properly spelled countries, I was wondering if anyone knew of an algorithm or technique to analyse the field and attempt to match it from the known list of countries?
The two most commonly used techniques available to you are soundex (also try looking at metaphone) and distance algorithms (such as levenshtein).
A big advantage of soundex type functions is they accept a single input so you can do a load of pre-processing before attempting to match. Distance algorithms are not great at matching large sets if time is an issue.
I have no idea what is available in Oracle, nor the best way to implement these however there will be loads of info available on the internets.
You would also want to check for abbreviations too I suppose.