Thread: Database setup for searching based on string distance

  1. #1
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446

    Database setup for searching based on string distance

    Ok, I need your help trying to come up with a satisfying method to handle the following requirement:

    Synopsis
    Access an on-site Postal Code database with a little over 300,000 records, do a real-time full-text search based on some criteria and return results ordered by relevance.

    The Database

    Each record is composed of 13 fields. Of interest, there's just 8.

    A full street address is composed by concatenating 5 fields. This is so because a street name in Portugal can have the following format (example):


    • Street Type: Rua
    • 1st Preposition: do
    • Title: Senhor
    • 2nd Preposition: das
    • Name: Pedras

    In English a direct translation would read something like Street of the Lord of the Stones. Each field underlined. Some streets do not use all fields. For instance, Street Saint Michael, would use Street Type, Title, Name.

    3 fields provide the actual postal code: 2750-671 Cascais
    Finally, there is a door number field that is used on some cases. It names the door number range and parity (odd, even) for a given postal code, since many streets can have more than one postal code.

    Context
    As the user fills a form consisting of data that includes address information, it is required they can quickly search for the correct postal code based on the information previously introduced on the address field (meaning the whole address above, excluding postal code exists in just one textbox) of that form.

    The form is filled from Guarantee Coupons the customer previously filled in and sent to our company. Problem is that many customers (~77%) either don't fill the postal code fields, fill them only partially, or fill them with wrong information. This creates a problem to our data entry team which wastes considerable time searching for the correct postal code (even though I've been attenuating this over the past months with new functionality to the data entry application).

    Search Criteria

    The contents of the address textbox.
    ......

    Current Method being analysed

    I'm currently considering proposing a method that involves transforming the current postal code table, moving it to memory, and then apply a Damerau-Levenshtein Distance algorithm on a record basis. In more detail:

    Preparation

    • Generate a spellchecker database consisted by all unique words in the postal code database and nothing more. This will help the data entry team spot spelling mistakes. The application is already doing spellchecking so this is only a matter of applying a different spellchecker database to the address field. No sweat.
    • Generate a new street address table with 4 fields. ID (foreign key to the postal code table primary key) and STREET (concatenation of the Street Type, Title and Name fields of the postal code table), DOORMIN and DORMAX (street door range. Parity is inferred).

    Execution
    The application loads the street address table to memory on launch and keeps it there. When the user requests a postal code search, the following steps take place:

    • Parse the address textbox to exclude prepositions and other "ligatures", any punctuation, in so building the search string.
    • Further parse the resulting search string to remove and store on another variable the door number, if existing.
    • For each record apply Damerau-Levenshtein Distance on the STREET field vs the search string. Retrieve into a result set only those records which resulting distance fall into a previously defined maximum by the application settings.
    • If existing, compare door numbers. If matching, simply do a distance--.
    • Display the results ordered by relevance (min distance to max distance).


    Given the size of the database, the somewhat complex nature of the search do you think this is a satisfactory method and I can move on to code this, or do you think there are better ways?
    Last edited by Mario F.; 02-19-2010 at 09:41 PM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  2. #2
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    I understand for a question, this is too large of a post. I wasn't very lucky with the post title choice either. Just typed it and didn't give it a second thought. Can't edit it now.

    But if you are interested in text searching, string metrics and full-text database search optimizations, I'd appreciate if you gave this thread some of your free time. Cheers.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  3. #3
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Ok, I've been implementing a basic setup to test this solution and the resulting performance seemed excessive. While acceptable from a practical POV (wouldn't take more than 3-4 seconds to come up with the results), profilling results just nagged me. This couldn't be it.

    I got a tip elsewhere and I switched to Levenshtein distance algorithm which only considers addition, subtraction and substitution and things became more acceptable and performance increased by about 60%. It also makes more sense, since Damerau-Levenshtein is well suited for spellchecking and I didn't need that because the address is already spellchecked when it comes the time to find the postal code.

    There are further improvements to be made, like setting the distance threshold and returning results from the database as they are gathered on a different thread. But this concludes our show today. Thanks for listening.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  4. #4
    Crazy Fool Perspective's Avatar
    Join Date
    Jan 2003
    Location
    Canada
    Posts
    2,640
    here's a simple solution, which can also support synonyms and stemming easily.

    Use a lucene inverted index, mapping address strings to postal codes. Use the built in approximate search with the user input, then resort the results using your edit distance function (you may also want to try q-gram distance since it's position invariant.)

    So you end up with a bunch of postal codes as "document" ids, and each can have an arbitrary amount of text associated with it (all the street address info, along with synonyms abbreviations etc..). The result of the inverted index query gives you a set of candidates very quickly (recall oriented) which you can then re-rank based on syntactic distance.

    Lucene also lets you store all the data as separate fields so there is no ambiguity as to which part of the address the search matched.

  5. #5
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Awesomely interesting Perspective.

    I took a look at the C++ port and it's marvelous. Really handles most of my concerns. And the q-gram tip is also great. I wasn't aware of this method. It's one of the last in a long list on my current source for information and I wasn't giving it too much importance. It has a strong applicability elsewhere where I needed a fuzzy search method.

    Two thumbs up, mate!
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Unable to compare string with 'getter' returned string.
    By Swerve in forum C++ Programming
    Replies: 2
    Last Post: 10-30-2009, 05:56 PM
  2. find a string based on the location of another string
    By rlilley in forum C Programming
    Replies: 3
    Last Post: 02-19-2009, 12:29 PM
  3. String issues
    By The_professor in forum C++ Programming
    Replies: 7
    Last Post: 06-12-2007, 09:11 AM
  4. string handling
    By lessrain in forum C Programming
    Replies: 3
    Last Post: 04-24-2002, 07:36 PM