1. ## Implement Spelling Checker

Can anyone give any idea on how to implement a spelling checker that suggests nearest words as in MS-WORD and others.

Also what are the main data structures needed?

2. Well you could use this, to find words with similar sounds.
Soundex - Wikipedia, the free encyclopedia

> Also what are the main data structures needed?
Arrays, lists, trees - you know, the usual ones.

3. Originally Posted by anirban
Can anyone give any idea on how to implement a spelling checker that suggests nearest words as in MS-WORD and others.

Also what are the main data structures needed?
Near string matches can be done with a number of algorithms. The Levenshtein Distance returns a single int with the calculated distance from one string to another and is an easy to implement option. Its space requirements are only an array to hold the calculated running results the size of (string1 + 1) x (string2 + 1).

See the wikipedia article on it. Its rather comprehensive. Make a point of reading the section of possible improvements to the algorithm. I suspect no one uses the algorithm as it is presented in the pseudocode. Depending on the actual requirements, one wants to optimize it. For the purpose of spell checking, you definitely are only interested in testing against a threshold value, and that's at least optimization nš 4 on that list.

That article also links you to the broader concept of string matching which you can use to study numerous other solutions.

4. Okay but is between 2 strings. But given a wrong string do I have to calculate the distances to all the words? Then it will take great time. Correct me if I am wrong!

5. Does MS word always assume that you always got at least the first letter correct?

6. Originally Posted by anirban
Okay but is between 2 strings. But given a wrong string do I have to calculate the distances to all the words? Then it will take great time. Correct me if I am wrong!
You are correct. But how else do you propose to test words near to each other if you don't test for all the words in your dictionary?

In any case, spell checkers use a distance threshold to determine what words to list as being near to a misspelled word. They may establish, for instance, that they will only build the word suggestions list with words that have a distance of 3 or less to the misspelled word. So if I type "egadgemant", they will list engagement (distance is 3), but not engorgement (distance is 4). In fact, distance 3 is a common maximum distance internal setting when using spell checkers. Some do allow users to change this setting, but when not allowing users to change it, you can safely bet on distance 3 as a sensible general purpose setting.

So, with this in mind, you can apply some optimizations. One of them being the fact that since the algorithm operates on the principle of running totals, as soon as you hit a distance of 4, you break out from that word (you are not interested on it anymore) and test for the next word in line in your dictionary. Since the number of words that are likely to have 3 or less distance is certainly smaller than the words that can have 4 or more differences, you are doing a very important optimization. Still, you will be required to test for all words. You can however avoid running the most expensive part of the algorithm altogether for most of the words in your dictionary, by simply testing for string length. Any string with a length difference of 4 or more has a distance of 4 or more. So, those you don't even need to test for inserts or deletes and can be rejected right at the beginning of the algorithm or even before you run it by filtering the dictionary with only those words with a length interval of 3.

In any case, Levenshtein Distance is indeed a "slow" algorithm. But I don't see why you shouldn't implement it. It's not slow enough for you to note a difference in your application performance, except if you have critical performance requirements or are using extremely slow computers. Many spell checkers implement it in one way or another. Certainly, as you get used to it you will also find ways to further optimize it.

edit: I use it as it should NOT be used and still I'm quite satisfied with the results. I use to test for addresses against a database of 300.000 postal codes and accept a distance of 7 in order to suggest a list of addresses from a misspelled or incomplete address. I don't recall anymore the timings I got when I was doing performance tests (it's been an year or so) but I can tell you it runs over that database of 300.000 postal codes almost always under 1 second on a dual core at 2.2 Mhz.

7. I use a heavily generalized form of the Metaphone algorithm with great success.

But I dropped by to say that "nearest word" is locale specific. If you don't know what this means, you are better off downloading an existing implementation than building your own.

O_o

Unless this is a learning exercise, in which case, go crazy. You'll find a list of all the data structures you might need at the link. I'm not even joking.

Soma

List of data structures - Wikipedia, the free encyclopedia