Database setup for searching based on string distance
Ok, I need your help trying to come up with a satisfying method to handle the following requirement:
Access an on-site Postal Code database with a little over 300,000 records, do a real-time full-text search based on some criteria and return results ordered by relevance.
Each record is composed of 13 fields. Of interest, there's just 8.
A full street address is composed by concatenating 5 fields. This is so because a street name in Portugal can have the following format (example):
- Street Type: Rua
- 1st Preposition: do
- Title: Senhor
- 2nd Preposition: das
- Name: Pedras
In English a direct translation would read something like Street of the Lord of the Stones. Each field underlined. Some streets do not use all fields. For instance, Street Saint Michael, would use Street Type, Title, Name.
3 fields provide the actual postal code: 2750-671 Cascais
Finally, there is a door number field that is used on some cases. It names the door number range and parity (odd, even) for a given postal code, since many streets can have more than one postal code.
As the user fills a form consisting of data that includes address information, it is required they can quickly search for the correct postal code based on the information previously introduced on the address field (meaning the whole address above, excluding postal code exists in just one textbox) of that form.
The form is filled from Guarantee Coupons the customer previously filled in and sent to our company. Problem is that many customers (~77%) either don't fill the postal code fields, fill them only partially, or fill them with wrong information. This creates a problem to our data entry team which wastes considerable time searching for the correct postal code (even though I've been attenuating this over the past months with new functionality to the data entry application).
The contents of the address textbox.
Current Method being analysed
I'm currently considering proposing a method that involves transforming the current postal code table, moving it to memory, and then apply a Damerau-Levenshtein Distance algorithm on a record basis. In more detail:
- Generate a spellchecker database consisted by all unique words in the postal code database and nothing more. This will help the data entry team spot spelling mistakes. The application is already doing spellchecking so this is only a matter of applying a different spellchecker database to the address field. No sweat.
- Generate a new street address table with 4 fields. ID (foreign key to the postal code table primary key) and STREET (concatenation of the Street Type, Title and Name fields of the postal code table), DOORMIN and DORMAX (street door range. Parity is inferred).
The application loads the street address table to memory on launch and keeps it there. When the user requests a postal code search, the following steps take place:
- Parse the address textbox to exclude prepositions and other "ligatures", any punctuation, in so building the search string.
- Further parse the resulting search string to remove and store on another variable the door number, if existing.
- For each record apply Damerau-Levenshtein Distance on the STREET field vs the search string. Retrieve into a result set only those records which resulting distance fall into a previously defined maximum by the application settings.
- If existing, compare door numbers. If matching, simply do a distance--.
- Display the results ordered by relevance (min distance to max distance).
Given the size of the database, the somewhat complex nature of the search do you think this is a satisfactory method and I can move on to code this, or do you think there are better ways?