Database setup for searching based on string distance

Printable View

02-19-2010
Mario F.

Database setup for searching based on string distance
Ok, I need your help trying to come up with a satisfying method to handle the following requirement:

Synopsis
Access an on-site Postal Code database with a little over 300,000 records, do a real-time full-text search based on some criteria and return results ordered by relevance.

The Database
Each record is composed of 13 fields. Of interest, there's just 8.

A full street address is composed by concatenating 5 fields. This is so because a street name in Portugal can have the following format (example):
- Street Type: Rua
- 1st Preposition: do
- Title: Senhor
- 2nd Preposition: das
- Name: Pedras
In English a direct translation would read something like Street of the Lord of the Stones. Each field underlined. Some streets do not use all fields. For instance, Street Saint Michael, would use Street Type, Title, Name.

3 fields provide the actual postal code: 2750-671 Cascais
Finally, there is a door number field that is used on some cases. It names the door number range and parity (odd, even) for a given postal code, since many streets can have more than one postal code.

Context
As the user fills a form consisting of data that includes address information, it is required they can quickly search for the correct postal code based on the information previously introduced on the address field (meaning the whole address above, excluding postal code exists in just one textbox) of that form.

The form is filled from Guarantee Coupons the customer previously filled in and sent to our company. Problem is that many customers (~77%) either don't fill the postal code fields, fill them only partially, or fill them with wrong information. This creates a problem to our data entry team which wastes considerable time searching for the correct postal code (even though I've been attenuating this over the past months with new functionality to the data entry application).

Search Criteria
The contents of the address textbox.
......

Current Method being analysed

I'm currently considering proposing a method that involves transforming the current postal code table, moving it to memory, and then apply a Damerau-Levenshtein Distance algorithm on a record basis. In more detail:

Preparation
- Generate a spellchecker database consisted by all unique words in the postal code database and nothing more. This will help the data entry team spot spelling mistakes. The application is already doing spellchecking so this is only a matter of applying a different spellchecker database to the address field. No sweat.
- Generate a new street address table with 4 fields. ID (foreign key to the postal code table primary key) and STREET (concatenation of the Street Type, Title and Name fields of the postal code table), DOORMIN and DORMAX (street door range. Parity is inferred).
Execution
The application loads the street address table to memory on launch and keeps it there. When the user requests a postal code search, the following steps take place:
- Parse the address textbox to exclude prepositions and other "ligatures", any punctuation, in so building the search string.
- Further parse the resulting search string to remove and store on another variable the door number, if existing.
- For each record apply Damerau-Levenshtein Distance on the STREET field vs the search string. Retrieve into a result set only those records which resulting distance fall into a previously defined maximum by the application settings.
- If existing, compare door numbers. If matching, simply do a distance--.
- Display the results ordered by relevance (min distance to max distance).
Given the size of the database, the somewhat complex nature of the search do you think this is a satisfactory method and I can move on to code this, or do you think there are better ways?
02-20-2010
Mario F.

I understand for a question, this is too large of a post. I wasn't very lucky with the post title choice either. Just typed it and didn't give it a second thought. Can't edit it now.

But if you are interested in text searching, string metrics and full-text database search optimizations, I'd appreciate if you gave this thread some of your free time. Cheers.
02-22-2010
Mario F.

Ok, I've been implementing a basic setup to test this solution and the resulting performance seemed excessive. While acceptable from a practical POV (wouldn't take more than 3-4 seconds to come up with the results), profilling results just nagged me. This couldn't be it.

I got a tip elsewhere and I switched to Levenshtein distance algorithm which only considers addition, subtraction and substitution and things became more acceptable and performance increased by about 60%. It also makes more sense, since Damerau-Levenshtein is well suited for spellchecking and I didn't need that because the address is already spellchecked when it comes the time to find the postal code.

There are further improvements to be made, like setting the distance threshold and returning results from the database as they are gathered on a different thread. But this concludes our show today. Thanks for listening.
02-23-2010
Perspective

here's a simple solution, which can also support synonyms and stemming easily.

Use a lucene inverted index, mapping address strings to postal codes. Use the built in approximate search with the user input, then resort the results using your edit distance function (you may also want to try q-gram distance since it's position invariant.)

So you end up with a bunch of postal codes as "document" ids, and each can have an arbitrary amount of text associated with it (all the street address info, along with synonyms abbreviations etc..). The result of the inverted index query gives you a set of candidates very quickly (recall oriented) which you can then re-rank based on syntactic distance.

Lucene also lets you store all the data as separate fields so there is no ambiguity as to which part of the address the search matched.
02-24-2010
Mario F.

Awesomely interesting Perspective.

I took a look at the C++ port and it's marvelous. Really handles most of my concerns. And the q-gram tip is also great. I wasn't aware of this method. It's one of the last in a long list on my current source for information and I wasn't giving it too much importance. It has a strong applicability elsewhere where I needed a fuzzy search method.

Two thumbs up, mate!