Results 1 to 8 of 8
  1. #1
    Join Date
    Oct 2003
    Posts
    268

    Fuzzy String matching/searching?

    If my database has an entry for "Albertson" and the user types in "AlbertZson", "Albertsons", or some other variant, I would like a matching algorithm that would properly match. Ideally it would provide a match score as well. I would also like to use this to suggest several close variations if there isn't one obvious one.

    I can pre-index my database if necessary.

    FWIW, I'm currently using a MySQL database.

    Thanks!

  2. #2
    Join Date
    Mar 2004
    Posts
    370
    First of all,
    this doesn't have anything to do with database design.Just some suggestions:
    1.If you want to retrive the strings with similar sounding form you can use SOUNDEX function in SQL.
    2.If you need a ranking score I think you should go through one of *Information Retrieval Models* such as vector model that is sufficient for your job in my opinion.
    Google "vector model" "IR model" for more information.
    -Good luck!

  3. #3
    Join Date
    Sep 2005
    Posts
    22
    I am faced with this problem at the moment.
    We have a person table of 41000 names.
    What I came up with is an "extended soundex" - soundex of the surname plus two initials.
    (if there is no second name replace it with zero).
    Combined with the suburb this appears to work well with our data.
    Mind you our data guarantees that the suburb will be spelled correctly.

  4. #4
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Quote Originally Posted by RogerWilco
    If my database has an entry for "Albertson" and the user types in "AlbertZson", "Albertsons", or some other variant, I would like a matching algorithm that would properly match.
    Look up Soundex or (better) Metaphone.

    What these algorithms do is reduce each word to the most important sounds. So similar vowel and consonant sounds (based on English phonics) are "normalized." It's a bit like doing a case-insensitive match by storing everything as upper case.

    What you're trying to do entails this:

    When you add a name, you'll add both the name and its hash in two separate columns. When you search for a name, you'll hash the name you're searching for and check that against the hash column.

    Ideally it would provide a match score as well. I would also like to use this to suggest several close variations if there isn't one obvious one.

    I can pre-index my database if necessary.

    FWIW, I'm currently using a MySQL database.

    Thanks!
    Match score: first, names that don't match Metaphone or Soundex are probably so far off you don't want to match them at all.

    Nevertheless, there's an easy way: compare the actual values, and if they're equal give those a score of 1, otherwise 0.

    The harder way: count the number of characters that are different. This is harder than it sounds because you have to do a "longest common subsequence" match to handle insertions and deletions. It's too much work for what you're trying to do, honestly, though perl (at least) has libraries that have that, and the Metaphone algorithm.

    (Hint: hit search.cpan.org for Text::Metaphone and Algorithm:iff)

    Pre-index: yes, you'll need to maintain the hash column. If you're using MySQL 5, triggers could help you here.

    (Completely irrelevant anecdote: I had been doing much the same thing and, by chance, had flown to California to go to the Mac Developer's conference. My brother lives in the area, and we went to a party with some friends of his in Oakland. So I'm talking to this woman and I mention how good Metaphone is. She says, "oh, wait, I have to get my husband." I'm a little baffled, but he comes back, and she asks me to say what I just did. Turns out the guy was Lawrence Phillips, who invented Metaphone...)

  5. #5
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697

    Tries

    BTW, there's another way of doing this that is more along the lines of "write your own DBMS." Look up "trie" on wikipedia if you're curious. That's how most spell checkers and the T9 algorithms on cell phones are implemented.

  6. #6
    Join Date
    Oct 2003
    Posts
    268
    Thanks guys. No, this isn't a database design topic but it's a related subject. Most users expect this from standard data driven applications.

    If I search Amazon for "zpsp", for example, it assumes I meant "psp" (popular video game machine). This is the functionality my client wants. Note that "zpsp" isn't phoenetically similar to "psp". I assume that the matching algorithm works based on the popularity of the term "psp". There must be either some hashing algorithm or indexing structure that supports this efficiently.

    I've read on soundex/metaphone and they aren't what I want since they are designed for phonetics and are not designed to catch typos or non-phonetical misspellings.

    I understand the "trie" data structure and "suffix trees" but I don't see any fuzzy matching capabilities. I also am familiar with Burrow Wheelers Transforms and the indexing capabilities of those. However, I don't want to simply pick one based on arbitrary, abstract, or academic criteria.

    Ideally, I would know which algorithm that Amazon uses or which is used by premier search engines such as Google.

    thanks again!
    Last edited by RogerWilco; 11-07-05 at 17:16. Reason: spelling

  7. #7
    Join Date
    Oct 2002
    Location
    Baghdad, Iraq
    Posts
    697
    Quote Originally Posted by RogerWilco
    I understand the "trie" data structure and "suffix trees" but I don't see any fuzzy matching capabilities. I also am familiar with Burrow Wheelers Transforms and the indexing capabilities of those. However, I don't want to simply pick one based on arbitrary, abstract, or academic criteria.
    Tries will work. Honestly, they really are used in spell checkers. <a href="http://www.cs.mcgill.ca/~tim/tries/approxString.ps">I didn't just make that up.</a> (Postscript file)

    Ideally, I would know which algorithm that Amazon uses or which is used by premier search engines such as Google.
    Well, instead of your arbitrary, abstract, academic criterion that it be Amazon or Google's technique, how about <a href="http://en.wikipedia.org/wiki/Spell_checker">looking up how it's done</a>? The article in question also has a half dozen open source spell checkers that are all doing exactly what you're trying to do. That way you don't need to waste away waiting for an Amazon or Google engineer to stop by dbforums.

  8. #8
    Join Date
    Mar 2004
    Location
    Toronto, ON, Canada
    Posts
    513
    How cool is THIS? Sounds like it does exactly what you want... unfortunately it's mysql.

    http://dev.mysql.com/doc/refman/5.0/...expansion.html

    Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion finds all books with the word “Maigret” on the second pass.
    --
    Jonathan Petruk
    DB2 Database Consultant

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •