Fuzzy string matching
Posted by Kelvin on 03 Jan 2007 at 08:47 am | Tagged as: programming
I've been recently peripherally involved in a project which attempts to perform a fuzzy match on names in a MySQL database. With "Homethinking":http://www.homethinking.com, we had to do something similar regarding matching for realtor and brokerage names. Its also related to some of the Lucene consulting I've been involved with.
Its an interesting problem. There's an article on "Wikipedia":http://en.wikipedia.org/wiki/Approximate_string_matching on it. "Project Dedupe":http://tickett.net/dedupe/index.php/Main_Page also has some "interesting links":http://tickett.net/dedupe/index.php/Existing_software worth checking out.
The first (simple) approaches that come to mind are:
* Edit/Levenstein distance
* Phonetic matching, e.g. soundex, metaphone, double metaphone (though these much less useful for names)
* Regex based wildcard matching
One way or going with the regex approach by constructing multiple regex patterns, adding wildcards before each letter. Let me illustrate:
Given the word: *foo*, the following regex patterns are created
* /.*foo/
* /f.*oo/
* /fo.*o/
* /foo.*/
Keeping in mind that we're trying to match for name misspellings, which generally consist of single or at most 2 character variations, sometimes additonal, sometimes missing characters, its not hard to see how this regex will probably turn up the correctly spelled word.
Another variation on the regex approach I was toying around with, was generating the following regex, given the same word *foo*
* /[foo]{3,}/
So look for all (or if the word was longer, some) characters in the word, which takes into account any possible letter transpositions.
Ideally, for fuzzy matches, we'd want to be able to give a guesstimate score of how close we think a match is to the word.
On this project, we didn't need that, and were also constrained by needing to perform queries against MySQL (and not wanting to write/install any MySQL-specific code/functions). MySQL (v5+ I think) does come with a native soundex function which we'll probably end up using.
Of course, IMHO, the smart way to do it is to index the fields using Lucene, then can do n-gram and edit-distance matching, but once again, am constrained for this project.
Comments Off on Fuzzy string matching