I've been away from these boards for a really long time, but still they're the first place I thought of to come and ask for help, to throw around and play around a bit with some ideas in regards to what I think is an interesting little in-house project.
I wasn't sure where to post exactly since the subject is pretty language/platform independent, although in reality for me the platform is predefined and not up for arguments. I do hope I didn't post on the wrong board *shrug*.
Anyways, to the point.
Picture something like this:
1. You have a source that gives you around 8 million names (people's names), and other basic information (sex, DOB, place of birth).
2. A name consists of Paternal name, First Name, Middle Name and Maternal Name of which only Paternal and First are guaranteed to exist.
3. This is your only trustworthy source of information and it isn't all that trustworthy because the only information you are guaranteed for each record is the name.
4. You have many other sources of information that also give you names (again the only semi-guaranteed information is the name).
5. You've got to take the names from these last sources and pin point at best which one of the original 8 million a name represents, at worst narrow it down to a few possibilities.
6. Spelling cannot stop you from finding matches.
7. Names may be in one or two different languages.
That's about the general overview.
I'm not looking for someone to solve my problems for me, mind you. Just thought I could use some fresh out-of-the-box ideas, and maybe be told that my solution(s) is/are completely crazy or not...
Well, onto what I've got...
1. A phonetic normalization heuristic that "fixes" misspellings and other small differences in names.
2. A string matching routine that implements a combination of a couple of solid (or so I believe) comparison metrics (JaroWinkler and Levenshtein Distance).
Both of these have been tuned to the input, and have done well in all the tests I've thrown at them. I feel that these two are about as solid as they can get, except for speed, they may be optimizable in that respect. Of course, I might be wrong, but let's work with the assumption that they're good enough.
...and what I'm thinking of...
1. Taking the two pieces of data I can count on Paternal and First name, normalizing and hashing them. Building a database with these hash values where I relate them to all the records that hash to that specific value...
This is where I'm sort of stuck... when one hashes one wants to minize collisions, and all the literature I've found works on that assumption, and rightly so... but in this case I *want* collisions... I want a hash function that produces similar if not equal hashes for similar strings...
What I've thought of is to use the phonetic normalization as part of the hashing scheme, but I'd like a little more flexibility than that... I'm trying to find out if it's reasonable... that's one the reasons I'm here.. hehe...
The idea would be to "normalize" names and then hash them with a function that'll give collisions where reasonable... something between the excess of collisions given by an additive hash and the too few collision given by good hash...
I've tried a few of the "mediocre" hashes suggested at http://eternallyconfuzzled.com/tuts/hashing.html, but they're a bit "too good"... hehe...
It occurs to me that I could compartmentalize the hashes into groups of similar hashes squeasing them or hashing the hashes, I don't know... *shrug*
What do you guys think?
Thanks in advance...,