Spell Checker need help

**panosstylianou** · 11-02-2011

Hey i have a project in which i have to create a spell checker in c. I have a dictionary file named words.txt and i want the user to type in a filename from which the words are going to be compared to the dictionary's (words.txt). Is there a way i can compare the words from the dictionary file itself? i tried copying the whole dictionary in an array but something seems to be wrong with my code. It runs but for some reason it goes into an infinite loop. Could that be caused by the size of the file? if not can someone tell me what is the mistake i have done? the following code is just for copying the two files into two different arrays before comparing them.

Code:

#include <stdio.h>

Code:

#include<string.h>
#include <stdlib.h>


char dictionary[1000000];
char inputfile[100000];


int main(void) 
{
    FILE* dict_file;
    FILE* input_file;
    int bytes_read;
    char* p;
    char filename[50];


    
    dict_file = fopen("words.txt", "r"); /* Open the dictionary file */
if (dict_file != 0) /* Test if the file is NOT empty */
    {   
/* Read the dictionary */
        p = dictionary;
        p = fgets(p, 1000000, dict_file);
        while (p != 0) 
        {
            while (*p != '\0') 
            { 
                p += 1; 
            }
            p = fgets(p, 1000000, dict_file);
        }
    }
    else
    {
printf("\nThe dictionary file is empty. Unable to open file\n");
    }
    
    
    char* pfilename = NULL;
    pfilename = (char*)malloc(50);
printf("Please write the name of the input file\n");
    scanf("%s", filename);
    
    input_file = fopen(filename, "r");
    if (input_file != 0) 
    {
/* Read the input file */
        p = inputfile;
        bytes_read = fread(p, 1, 1000, input_file);
        p += bytes_read;
        while (bytes_read != 0) 
        {
            bytes_read = fread(p, 1, 1000, input_file);
            p += bytes_read;
        }
        *p = 0;
    }
    else
    {
          printf("\nThe text file is empty. Unable to open file\n");
    }
}

**jimblumberg** · 11-02-2011

For starters this:

Code:

char dictionary[1000000];

Allocates memory for one string that can have 1000000 individual characters. You probably want an array of strings. Something like:

Code:

char dictionary[1000][100];

This will allocate memory for 1000 strings of up to 100 characters.

Jim

**~~CommonTater~~** · 11-03-2011

Some stuff for you to research... Binary Search, Dynamic Arrays, Case Insensitive Compare, Phonetic Search .... and more.

This is not a trivial project. Writing a spell checker (or other correction mechanism) that both works well and attains decent speeds is a very interesting challenge... especially when you want it to make suggestions instead of simply saying "this is wrong" on every third word...

**panosstylianou** · 11-03-2011

Originally Posted by jimblumberg

For starters this:

Code:

char dictionary[1000000];

Allocates memory for one string that can have 1000000 individual characters. You probably want an array of strings. Something like:

Code:

char dictionary[1000][100];

This will allocate memory for 1000 strings of up to 100 characters.

Jim

wouldn't that be a small array for a dictionary array? or should i do it

Code:

 char dictionary[1000000][100];

thanks a lot

Panos

**panosstylianou** · 11-03-2011

Originally Posted by CommonTater

Some stuff for you to research... Binary Search, Dynamic Arrays, Case Insensitive Compare, Phonetic Search .... and more.

This is not a trivial project. Writing a spell checker (or other correction mechanism) that both works well and attains decent speeds is a very interesting challenge... especially when you want it to make suggestions instead of simply saying "this is wrong" on every third word...

yeah i know it is very complicated. and i also need it done by the next week. I don't know if thats possible

**Adak** · 11-03-2011

Write up a small utility function and have it tell you what is the longest word in your word list. Replace the "100" with the number you get back from your longest word function.

You can't waste memory like 100 char's for each word!

Aren't you glad you don't need to find the longest word by counting each letter, of each word?

If you want to have a LOT of words, you'll want to separate them into several files, and load up your array, only with the words from one file at a time. I liked the idea of separating the words according to the length of the word: words4.txt words5.txt, etc. Then, inside each file, the words are all alphabetized (again by a program), and that allows a binary search (really fast), to search for any entered word - very quick and easy.

**panosstylianou** · 11-03-2011

Originally Posted by Adak

Write up a small utility function and have it tell you what is the longest word in your word list. Replace the "100" with the number you get back from your longest word function.

You can't waste memory like 100 char's for each word!

Aren't you glad you don't need to find the longest word by counting each letter, of each word?

If you want to have a LOT of words, you'll want to separate them into several files, and load up your array, only with the words from one file at a time. I liked the idea of separating the words according to the length of the word: words4.txt words5.txt, etc. Then, inside each file, the words are all alphabetized (again by a program), and that allows a binary search (really fast), to search for any entered word - very quick and easy.

yeah thats a very good idea.. what i'm thinking is that even though the search is being sorted in word sizes even if its faster, when i will have to find a word to suggest maybe the wrong word has extra characters or less. That means i will also have to search in other files except the one with the same size. But that could just be the one more and one less letter text files right?

**~~CommonTater~~** · 11-03-2011

Originally Posted by panosstylianou

yeah i know it is very complicated. and i also need it done by the next week. I don't know if thats possible

In a week you can most likely get it to somehow mark misspelled words... but I wouldn't be making a big try for much more.

My suggestion would be that you ...

1) Write a small utility to load, sort and rewrite your wordlist so it's in alphabetical order. You can also use this utility to find the array size; the number of words and longest word. (I believe the longest word in the english language is "Antidisestablishmentarianism"... 28 letters + a null... if that helps.)

2) Create a quick software to load the entire word file in one piece into an array... You'll need to use malloc() for the array because it's absolutely going to be too big to allocate on the stack... probably something like... char *worlist = malloc((nwords * longest) * sizeof (char));
(For 100,000 words that probably means an array of 100000 * 32 bytes == ~3mb.

3) Now take the file you are going to examine and load it entirely into memory as well. (use malloc() for this too!)

4) Create a quick algorythm to set a pointer that floats from the beginning of one word to the next... or you can tokenize the text file with strtok() building an array of pointers to each word.

5) using a binary search algorythm (because it's amazingly fast, especially in memory) search your word list for every word in the text file... If it's not found, it's misspelled, print it on the screen; something like "Word 46 : Arduous"... If it is found, move on to the next word...

6) when you finish that, free() all your memory and exit the program.

Now, I'm going to give you some advice that the others will balk at...

Yes I am advising you to load whole large files, potentially megabytes in size into memory. It's "old think" to try to do this with a gazillion small files and it's slow like pudding. On modern computers with 2 and 4 gb of ram and 32bit array indexing, it's trivial to load a 5mb file... so don't balk at the idea. The arrays will be huge but the checker will be blistering fast this way. I have software out there, in daily use that loads 150mb index files with no problems at all... so go for it. Your code will be simpler and ultimately that means it will work better.

This much you should be able to do in a week. Beyond that, you're looking at a lot of complex search algorythms and a pop-up mechanism to display suggestions, etc... that's the hard part.

**~~CommonTater~~** · 11-03-2011

This will give you some idea how to load the dictionary file... (assuming one word per line)

Code:

#include <stdio.h>
#include <string.h>


#define DICTFILE "D:\\MyProjects\\Speller\\dictionary.txt"

int main(void)
  { file *Dict = fopen(DICTFILE,"r");             // dictionary file
    int  sList = 100;                             // size of Words array
    char *Words = malloc(100 * sizeof(char*));    // main word list
    int count = 0;                                // current word count
    char word[100];

     while(fgets(Dict,100,word))
        { Words[count++] = strdup(word);
           if (count >= sList)
            { sList += 100;
              Words = realloc(Words, sList * sizeof(char*)); } }

     fclose(Dict);

     // more code here

This may not be exactly it, but it should get you on the right track. It will load a wordlist of any size, with each string the exact size for each word, so you get very efficient memory usage.

**panosstylianou** · 11-03-2011

Originally Posted by CommonTater

In a week you can most likely get it to somehow mark misspelled words... but I wouldn't be making a big try for much more.

My suggestion would be that you ...

1) Write a small utility to load, sort and rewrite your wordlist so it's in alphabetical order. You can also use this utility to find the array size; the number of words and longest word. (I believe the longest word in the english language is "Antidisestablishmentarianism"... 28 letters + a null... if that helps.)

2) Create a quick software to load the entire word file in one piece into an array... You'll need to use malloc() for the array because it's absolutely going to be too big to allocate on the stack... probably something like... char *worlist = malloc((nwords * longest) * sizeof (char));
(For 100,000 words that probably means an array of 100000 * 32 bytes == ~3mb.

3) Now take the file you are going to examine and load it entirely into memory as well. (use malloc() for this too!)

4) Create a quick algorythm to set a pointer that floats from the beginning of one word to the next... or you can tokenize the text file with strtok() building an array of pointers to each word.

5) using a binary search algorythm (because it's amazingly fast, especially in memory) search your word list for every word in the text file... If it's not found, it's misspelled, print it on the screen; something like "Word 46 : Arduous"... If it is found, move on to the next word...

6) when you finish that, free() all your memory and exit the program.

Now, I'm going to give you some advice that the others will balk at...

Yes I am advising you to load whole large files, potentially megabytes in size into memory. It's "old think" to try to do this with a gazillion small files and it's slow like pudding. On modern computers with 2 and 4 gb of ram and 32bit array indexing, it's trivial to load a 5mb file... so don't balk at the idea. The arrays will be huge but the checker will be blistering fast this way. I have software out there, in daily use that loads 150mb index files with no problems at all... so go for it. Your code will be simpler and ultimately that means it will work better.

This much you should be able to do in a week. Beyond that, you're looking at a lot of complex search algorythms and a pop-up mechanism to display suggestions, etc... that's the hard part.

thanks a lot!! that was extremely helpful. I will start implementing all of these, along with your code. I hope i can do this in a week. thanks a lot again!!!!!!!

**Adak** · 11-04-2011

I agree with Tater, and using one sorted array of words is very fast, coupled with a binary search. On a fast PC you can ID every word from an entire book, in less than a second.

I like the idea of having less common words in separate files, rather than part of the main words array, but that may not be necessary. It depends on the number of words you want in your spell checker, and the PC it's running on.

Words like "adze" give rise to an elegant idea for this. Adze (a woodworking tool), is a word, but do you want to keep words like this, in your main array of words? Very Doubtful. Making the word array into a struct containing the word array and an int for popularity, will allow your program to copy the most common words, into the main words file and array, and leave the others still available, but not in the main memory of the PC. That will greatly expand the size of your word list, while keeping 95% or so of the speed.

In your spell checking, if a word matches the word being checked, then show that - but if the word doesn't match, you'll want to show the 5 words closest to it, alphabetically, on either side of it. It's a "dumb" checker, because it won't know anything about the meaning of the word to guide the suggestions, but doing anything based on word meaning, is VERY much more difficult! I'd stay FAR away from that.

**~~CommonTater~~** · 11-04-2011

Originally Posted by Adak

I agree with Tater, and using one sorted array of words is very fast, coupled with a binary search. On a fast PC you can ID every word from an entire book, in less than a second.

I like the idea of having less common words in separate files, rather than part of the main words array, but that may not be necessary. It depends on the number of words you want in your spell checker, and the PC it's running on.

A properly written binary search can locate 1 of 130,000 items in 17 tries. Moving 1/2 of the words to a separate file, only reduces the number of tries needed to 16... There's not much to be gained by that. In fact you end up with a penalty because now you have to either load two files, causing two sets of 16 tries or you end up with complicated logic to flip files and decide which file you need.

This is why a simple "high/low" binary search is so efficient... doubling your array or file size only needs 1 more try to find stuff.

Words like "adze" give rise to an elegant idea for this. Adze (a woodworking tool), is a word, but do you want to keep words like this, in your main array of words? Very Doubtful. Making the word array into a struct containing the word array and an int for popularity, will allow your program to copy the most common words, into the main words file and array, and leave the others still available, but not in the main memory of the PC. That will greatly expand the size of your word list, while keeping 95% or so of the speed.

Old think, my friend... I did a test run of this last night with the 120,000 word list from an open source spell checker and the loading function I showed above... less than 3mb of space... NOT worth all the trouble to try to optimize...

Frankly, given the way binary searches work... splitting stuff up for "efficiency" might well end up hurting overall performance.

In your spell checking, if a word matches the word being checked, then show that - but if the word doesn't match, you'll want to show the 5 words closest to it, alphabetically, on either side of it. It's a "dumb" checker, because it won't know anything about the meaning of the word to guide the suggestions, but doing anything based on word meaning, is VERY much more difficult! I'd stay FAR away from that.

Interesting suggestion, but I'd say that, for now, it's enough for the OP to get the file loaders and search working...

**~~CommonTater~~** · 11-04-2011

@Adak.... I don't know how you do your binary searches (probably better than I) but here's a bit of toy code you and the OP can mess with to see what I mean...

Code:

// simple binary search demonstration
#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define ASIZE 100

// optimized search for sorted arrays
// give it the array, array size, and value to find
// returns array index or -1 if not found.

int BinarySearch(int *Array, int aSize, int Find)
  { int high = aSize -1;   
    int low = 0; 
    int target; 
    
    while (high >= low)   
      { target = low + ((high - low) / 2);

        // show tries for demonstration only
        printf("%d, ",target);
        
        if ( Array[target] == Find )
          return target;
        else if ( Find > Array[target] )
          low = target + 1;
        else if ( Find < Array[target] )
          high = target - 1;  }
    return -1; }



// main demonstration program
int main (void)
  { int num;
    int index;
    int *list = malloc(ASIZE * sizeof(int));

    srand(time(NULL));  // initialize randomizer

    // this section is for demonstration purposes only
    // you would not do this in a real application
    
    // generate an array of random numbers
    for (int i = 0; i < ASIZE; i++)
      list[i] = rand() % (ASIZE * 100);

     // binary searches need sorted arrays
    { int highest;
      int temp;

      for (int i = ASIZE - 1; i > 0; i--)
        { highest = 0;

          for (int j = 0; j < i; j++)
            if (list[j] > list[highest])
              highest = j;  
          
          if (list[highest] > list[i])
            { temp = list[i];
              list[i] = list[highest];
              list[highest] = temp; } }

        printf("Here is our number list...\n");
        for (int x = 0; x < ASIZE; x++)
          printf("%d\t",list[x]);  }


    // OK! lets do some searching...
    do
      { printf("\n\nEnter a number from the list (-1 to exit) : ");
        scanf("%d",&num);

        // and do a binary search
        index = BinarySearch(list,ASIZE,num);
        if (index > -1)
          printf("%d was found at index %d",num,index);
        else
          printf("%d is not on the list",num); }
    while(num > 0);

    free(list);
    printf("\n\n");

    return 0; }

Try changing the array sizes and see what it does to the number of tries...
Doubling the array size only adds one more try... the bigger the array the more effiecient it becomes. This is why I was suggesting he just load the whole thing and use this type of search.

**laserlight** · 11-04-2011

It seems that you are required to have words.txt store a list of words, but if not, an interesting approach is to store your dictionary of words in a trie. A quick search brings up this blog post on using a trie for a spell checker: How I Trie to Make Spelling Suggestions.

**Adak** · 11-04-2011

Quite true, Tater. I was thinking that 120,000 words might not be all the words he needs, and if he does need a really large dictionary, splitting out the least commonly accessed words, would be a good idea. If you can fit all the words you need into your array and still have memory enough for the PC to be responsive, then forget it -- you're golden.

Writing out the nearest misses from a binary search is an old trick from way back when 80286's roamed the earth!

I don't know whatever happened to them either - I think the TRex ate 'em!

Tries are a very elegant solution for a spell checker. Time consuming to work out any bugs though, imo.

Thread: Spell Checker need help

Thread Tools

Search Thread

Display

Spell Checker need help

Similar Threads

Spell checker help

Simple spell checker

Spell Checker

spell checker in c needs help

spell checker