Thread: Implement Spelling Checker

  1. #1
    Anirban Ghosh
    Join Date
    Jan 2006
    Posts
    278

    Implement Spelling Checker

    Can anyone give any idea on how to implement a spelling checker that suggests nearest words as in MS-WORD and others.

    Also what are the main data structures needed?

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Well you could use this, to find words with similar sounds.
    Soundex - Wikipedia, the free encyclopedia

    > Also what are the main data structures needed?
    Arrays, lists, trees - you know, the usual ones.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by anirban View Post
    Can anyone give any idea on how to implement a spelling checker that suggests nearest words as in MS-WORD and others.

    Also what are the main data structures needed?
    Near string matches can be done with a number of algorithms. The Levenshtein Distance returns a single int with the calculated distance from one string to another and is an easy to implement option. Its space requirements are only an array to hold the calculated running results the size of (string1 + 1) x (string2 + 1).

    See the wikipedia article on it. Its rather comprehensive. Make a point of reading the section of possible improvements to the algorithm. I suspect no one uses the algorithm as it is presented in the pseudocode. Depending on the actual requirements, one wants to optimize it. For the purpose of spell checking, you definitely are only interested in testing against a threshold value, and that's at least optimization nš 4 on that list.

    That article also links you to the broader concept of string matching which you can use to study numerous other solutions.
    Last edited by Mario F.; 03-09-2011 at 06:53 AM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  4. #4
    Anirban Ghosh
    Join Date
    Jan 2006
    Posts
    278
    Okay but is between 2 strings. But given a wrong string do I have to calculate the distances to all the words? Then it will take great time. Correct me if I am wrong!

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Does MS word always assume that you always got at least the first letter correct?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by anirban View Post
    Okay but is between 2 strings. But given a wrong string do I have to calculate the distances to all the words? Then it will take great time. Correct me if I am wrong!
    You are correct. But how else do you propose to test words near to each other if you don't test for all the words in your dictionary?

    In any case, spell checkers use a distance threshold to determine what words to list as being near to a misspelled word. They may establish, for instance, that they will only build the word suggestions list with words that have a distance of 3 or less to the misspelled word. So if I type "egadgemant", they will list engagement (distance is 3), but not engorgement (distance is 4). In fact, distance 3 is a common maximum distance internal setting when using spell checkers. Some do allow users to change this setting, but when not allowing users to change it, you can safely bet on distance 3 as a sensible general purpose setting.

    So, with this in mind, you can apply some optimizations. One of them being the fact that since the algorithm operates on the principle of running totals, as soon as you hit a distance of 4, you break out from that word (you are not interested on it anymore) and test for the next word in line in your dictionary. Since the number of words that are likely to have 3 or less distance is certainly smaller than the words that can have 4 or more differences, you are doing a very important optimization. Still, you will be required to test for all words. You can however avoid running the most expensive part of the algorithm altogether for most of the words in your dictionary, by simply testing for string length. Any string with a length difference of 4 or more has a distance of 4 or more. So, those you don't even need to test for inserts or deletes and can be rejected right at the beginning of the algorithm or even before you run it by filtering the dictionary with only those words with a length interval of 3.

    In any case, Levenshtein Distance is indeed a "slow" algorithm. But I don't see why you shouldn't implement it. It's not slow enough for you to note a difference in your application performance, except if you have critical performance requirements or are using extremely slow computers. Many spell checkers implement it in one way or another. Certainly, as you get used to it you will also find ways to further optimize it.

    edit: I use it as it should NOT be used and still I'm quite satisfied with the results. I use to test for addresses against a database of 300.000 postal codes and accept a distance of 7 in order to suggest a list of addresses from a misspelled or incomplete address. I don't recall anymore the timings I got when I was doing performance tests (it's been an year or so) but I can tell you it runs over that database of 300.000 postal codes almost always under 1 second on a dual core at 2.2 Mhz.
    Last edited by Mario F.; 03-09-2011 at 09:38 AM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  7. #7
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    I use a heavily generalized form of the Metaphone algorithm with great success.

    But I dropped by to say that "nearest word" is locale specific. If you don't know what this means, you are better off downloading an existing implementation than building your own.

    O_o

    Unless this is a learning exercise, in which case, go crazy. You'll find a list of all the data structures you might need at the link. I'm not even joking.

    Soma

    List of data structures - Wikipedia, the free encyclopedia

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. spelling checker
    By Tom.b in forum C Programming
    Replies: 8
    Last Post: 11-12-2009, 11:17 PM
  2. Implement "Whats This" using Win32
    By hemanth.balaji in forum Windows Programming
    Replies: 1
    Last Post: 05-29-2005, 06:03 AM
  3. Spell checker poem.
    By adrianxw in forum A Brief History of Cprogramming.com
    Replies: 4
    Last Post: 01-13-2004, 10:49 AM
  4. IDEA: Spelling Checker
    By fletch in forum Contests Board
    Replies: 14
    Last Post: 08-17-2002, 11:20 PM
  5. Can't Implement Tabbing
    By Invincible in forum Windows Programming
    Replies: 10
    Last Post: 02-27-2002, 05:09 AM