Thread: Spell Checker need help

  1. #1
    Registered User
    Join Date
    Nov 2011
    Posts
    13

    Spell Checker need help

    Hey i have a project in which i have to create a spell checker in c. I have a dictionary file named words.txt and i want the user to type in a filename from which the words are going to be compared to the dictionary's (words.txt). Is there a way i can compare the words from the dictionary file itself? i tried copying the whole dictionary in an array but something seems to be wrong with my code. It runs but for some reason it goes into an infinite loop. Could that be caused by the size of the file? if not can someone tell me what is the mistake i have done? the following code is just for copying the two files into two different arrays before comparing them.

    Code:
    #include <stdio.h>
    Code:
    #include<string.h>
    #include <stdlib.h>
    
    
    char dictionary[1000000];
    char inputfile[100000];
    
    
    int main(void) 
    {
        FILE* dict_file;
        FILE* input_file;
        int bytes_read;
        char* p;
        char filename[50];
    
    
        
        dict_file = fopen("words.txt", "r"); /* Open the dictionary file */
    if (dict_file != 0) /* Test if the file is NOT empty */
        {   
    /* Read the dictionary */
            p = dictionary;
            p = fgets(p, 1000000, dict_file);
            while (p != 0) 
            {
                while (*p != '\0') 
                { 
                    p += 1; 
                }
                p = fgets(p, 1000000, dict_file);
            }
        }
        else
        {
    printf("\nThe dictionary file is empty. Unable to open file\n");
        }
        
        
        char* pfilename = NULL;
        pfilename = (char*)malloc(50);
    printf("Please write the name of the input file\n");
        scanf("%s", filename);
        
        input_file = fopen(filename, "r");
        if (input_file != 0) 
        {
    /* Read the input file */
            p = inputfile;
            bytes_read = fread(p, 1, 1000, input_file);
            p += bytes_read;
            while (bytes_read != 0) 
            {
                bytes_read = fread(p, 1, 1000, input_file);
                p += bytes_read;
            }
            *p = 0;
        }
        else
        {
              printf("\nThe text file is empty. Unable to open file\n");
        }
    }

  2. #2
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    For starters this:
    Code:
    char dictionary[1000000];
    Allocates memory for one string that can have 1000000 individual characters. You probably want an array of strings. Something like:
    Code:
    char dictionary[1000][100];
    This will allocate memory for 1000 strings of up to 100 characters.


    Jim

  3. #3
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Some stuff for you to research... Binary Search, Dynamic Arrays, Case Insensitive Compare, Phonetic Search .... and more.

    This is not a trivial project. Writing a spell checker (or other correction mechanism) that both works well and attains decent speeds is a very interesting challenge... especially when you want it to make suggestions instead of simply saying "this is wrong" on every third word...

  4. #4
    Registered User
    Join Date
    Nov 2011
    Posts
    13
    Quote Originally Posted by jimblumberg View Post
    For starters this:
    Code:
    char dictionary[1000000];
    Allocates memory for one string that can have 1000000 individual characters. You probably want an array of strings. Something like:
    Code:
    char dictionary[1000][100];
    This will allocate memory for 1000 strings of up to 100 characters.

    Jim

    wouldn't that be a small array for a dictionary array? or should i do it

    Code:
     char dictionary[1000000][100];
    thanks a lot

    Panos

  5. #5
    Registered User
    Join Date
    Nov 2011
    Posts
    13
    Quote Originally Posted by CommonTater View Post
    Some stuff for you to research... Binary Search, Dynamic Arrays, Case Insensitive Compare, Phonetic Search .... and more.

    This is not a trivial project. Writing a spell checker (or other correction mechanism) that both works well and attains decent speeds is a very interesting challenge... especially when you want it to make suggestions instead of simply saying "this is wrong" on every third word...
    yeah i know it is very complicated. and i also need it done by the next week. I don't know if thats possible

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Write up a small utility function and have it tell you what is the longest word in your word list. Replace the "100" with the number you get back from your longest word function.

    You can't waste memory like 100 char's for each word!

    Aren't you glad you don't need to find the longest word by counting each letter, of each word?

    If you want to have a LOT of words, you'll want to separate them into several files, and load up your array, only with the words from one file at a time. I liked the idea of separating the words according to the length of the word: words4.txt words5.txt, etc. Then, inside each file, the words are all alphabetized (again by a program), and that allows a binary search (really fast), to search for any entered word - very quick and easy.

  7. #7
    Registered User
    Join Date
    Nov 2011
    Posts
    13
    Quote Originally Posted by Adak View Post
    Write up a small utility function and have it tell you what is the longest word in your word list. Replace the "100" with the number you get back from your longest word function.

    You can't waste memory like 100 char's for each word!

    Aren't you glad you don't need to find the longest word by counting each letter, of each word?

    If you want to have a LOT of words, you'll want to separate them into several files, and load up your array, only with the words from one file at a time. I liked the idea of separating the words according to the length of the word: words4.txt words5.txt, etc. Then, inside each file, the words are all alphabetized (again by a program), and that allows a binary search (really fast), to search for any entered word - very quick and easy.
    yeah thats a very good idea.. what i'm thinking is that even though the search is being sorted in word sizes even if its faster, when i will have to find a word to suggest maybe the wrong word has extra characters or less. That means i will also have to search in other files except the one with the same size. But that could just be the one more and one less letter text files right?

  8. #8
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by panosstylianou View Post
    yeah i know it is very complicated. and i also need it done by the next week. I don't know if thats possible
    In a week you can most likely get it to somehow mark misspelled words... but I wouldn't be making a big try for much more.

    My suggestion would be that you ...

    1) Write a small utility to load, sort and rewrite your wordlist so it's in alphabetical order. You can also use this utility to find the array size; the number of words and longest word. (I believe the longest word in the english language is "Antidisestablishmentarianism"... 28 letters + a null... if that helps.)

    2) Create a quick software to load the entire word file in one piece into an array... You'll need to use malloc() for the array because it's absolutely going to be too big to allocate on the stack... probably something like... char *worlist = malloc((nwords * longest) * sizeof (char));
    (For 100,000 words that probably means an array of 100000 * 32 bytes == ~3mb.

    3) Now take the file you are going to examine and load it entirely into memory as well. (use malloc() for this too!)

    4) Create a quick algorythm to set a pointer that floats from the beginning of one word to the next... or you can tokenize the text file with strtok() building an array of pointers to each word.

    5) using a binary search algorythm (because it's amazingly fast, especially in memory) search your word list for every word in the text file... If it's not found, it's misspelled, print it on the screen; something like "Word 46 : Arduous"... If it is found, move on to the next word...

    6) when you finish that, free() all your memory and exit the program.


    Now, I'm going to give you some advice that the others will balk at...

    Yes I am advising you to load whole large files, potentially megabytes in size into memory. It's "old think" to try to do this with a gazillion small files and it's slow like pudding. On modern computers with 2 and 4 gb of ram and 32bit array indexing, it's trivial to load a 5mb file... so don't balk at the idea. The arrays will be huge but the checker will be blistering fast this way. I have software out there, in daily use that loads 150mb index files with no problems at all... so go for it. Your code will be simpler and ultimately that means it will work better.

    This much you should be able to do in a week. Beyond that, you're looking at a lot of complex search algorythms and a pop-up mechanism to display suggestions, etc... that's the hard part.

  9. #9
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    This will give you some idea how to load the dictionary file... (assuming one word per line)
    Code:
    #include <stdio.h>
    #include <string.h>
    
    
    #define DICTFILE "D:\\MyProjects\\Speller\\dictionary.txt"
    
    int main(void)
      { file *Dict = fopen(DICTFILE,"r");             // dictionary file
        int  sList = 100;                             // size of Words array
        char *Words = malloc(100 * sizeof(char*));    // main word list
        int count = 0;                                // current word count
        char word[100];
    
         while(fgets(Dict,100,word))
            { Words[count++] = strdup(word);
               if (count >= sList)
                { sList += 100;
                  Words = realloc(Words, sList * sizeof(char*)); } }
    
         fclose(Dict);
    
         // more code here
    This may not be exactly it, but it should get you on the right track. It will load a wordlist of any size, with each string the exact size for each word, so you get very efficient memory usage.
    Last edited by CommonTater; 11-03-2011 at 06:18 PM.

  10. #10
    Registered User
    Join Date
    Nov 2011
    Posts
    13
    Quote Originally Posted by CommonTater View Post
    In a week you can most likely get it to somehow mark misspelled words... but I wouldn't be making a big try for much more.

    My suggestion would be that you ...

    1) Write a small utility to load, sort and rewrite your wordlist so it's in alphabetical order. You can also use this utility to find the array size; the number of words and longest word. (I believe the longest word in the english language is "Antidisestablishmentarianism"... 28 letters + a null... if that helps.)

    2) Create a quick software to load the entire word file in one piece into an array... You'll need to use malloc() for the array because it's absolutely going to be too big to allocate on the stack... probably something like... char *worlist = malloc((nwords * longest) * sizeof (char));
    (For 100,000 words that probably means an array of 100000 * 32 bytes == ~3mb.

    3) Now take the file you are going to examine and load it entirely into memory as well. (use malloc() for this too!)

    4) Create a quick algorythm to set a pointer that floats from the beginning of one word to the next... or you can tokenize the text file with strtok() building an array of pointers to each word.

    5) using a binary search algorythm (because it's amazingly fast, especially in memory) search your word list for every word in the text file... If it's not found, it's misspelled, print it on the screen; something like "Word 46 : Arduous"... If it is found, move on to the next word...

    6) when you finish that, free() all your memory and exit the program.


    Now, I'm going to give you some advice that the others will balk at...

    Yes I am advising you to load whole large files, potentially megabytes in size into memory. It's "old think" to try to do this with a gazillion small files and it's slow like pudding. On modern computers with 2 and 4 gb of ram and 32bit array indexing, it's trivial to load a 5mb file... so don't balk at the idea. The arrays will be huge but the checker will be blistering fast this way. I have software out there, in daily use that loads 150mb index files with no problems at all... so go for it. Your code will be simpler and ultimately that means it will work better.

    This much you should be able to do in a week. Beyond that, you're looking at a lot of complex search algorythms and a pop-up mechanism to display suggestions, etc... that's the hard part.
    thanks a lot!! that was extremely helpful. I will start implementing all of these, along with your code. I hope i can do this in a week. thanks a lot again!!!!!!!

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    I agree with Tater, and using one sorted array of words is very fast, coupled with a binary search. On a fast PC you can ID every word from an entire book, in less than a second.

    I like the idea of having less common words in separate files, rather than part of the main words array, but that may not be necessary. It depends on the number of words you want in your spell checker, and the PC it's running on.

    Words like "adze" give rise to an elegant idea for this. Adze (a woodworking tool), is a word, but do you want to keep words like this, in your main array of words? Very Doubtful. Making the word array into a struct containing the word array and an int for popularity, will allow your program to copy the most common words, into the main words file and array, and leave the others still available, but not in the main memory of the PC. That will greatly expand the size of your word list, while keeping 95% or so of the speed.

    In your spell checking, if a word matches the word being checked, then show that - but if the word doesn't match, you'll want to show the 5 words closest to it, alphabetically, on either side of it. It's a "dumb" checker, because it won't know anything about the meaning of the word to guide the suggestions, but doing anything based on word meaning, is VERY much more difficult! I'd stay FAR away from that.
    Last edited by Adak; 11-04-2011 at 03:40 AM.

  12. #12
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Adak View Post
    I agree with Tater, and using one sorted array of words is very fast, coupled with a binary search. On a fast PC you can ID every word from an entire book, in less than a second.

    I like the idea of having less common words in separate files, rather than part of the main words array, but that may not be necessary. It depends on the number of words you want in your spell checker, and the PC it's running on.
    A properly written binary search can locate 1 of 130,000 items in 17 tries. Moving 1/2 of the words to a separate file, only reduces the number of tries needed to 16... There's not much to be gained by that. In fact you end up with a penalty because now you have to either load two files, causing two sets of 16 tries or you end up with complicated logic to flip files and decide which file you need.

    This is why a simple "high/low" binary search is so efficient... doubling your array or file size only needs 1 more try to find stuff.

    Words like "adze" give rise to an elegant idea for this. Adze (a woodworking tool), is a word, but do you want to keep words like this, in your main array of words? Very Doubtful. Making the word array into a struct containing the word array and an int for popularity, will allow your program to copy the most common words, into the main words file and array, and leave the others still available, but not in the main memory of the PC. That will greatly expand the size of your word list, while keeping 95% or so of the speed.
    Old think, my friend... I did a test run of this last night with the 120,000 word list from an open source spell checker and the loading function I showed above... less than 3mb of space... NOT worth all the trouble to try to optimize...

    Frankly, given the way binary searches work... splitting stuff up for "efficiency" might well end up hurting overall performance.

    In your spell checking, if a word matches the word being checked, then show that - but if the word doesn't match, you'll want to show the 5 words closest to it, alphabetically, on either side of it. It's a "dumb" checker, because it won't know anything about the meaning of the word to guide the suggestions, but doing anything based on word meaning, is VERY much more difficult! I'd stay FAR away from that.
    Interesting suggestion, but I'd say that, for now, it's enough for the OP to get the file loaders and search working...

  13. #13
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    @Adak.... I don't know how you do your binary searches (probably better than I) but here's a bit of toy code you and the OP can mess with to see what I mean...

    Code:
    // simple binary search demonstration
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    
    #define ASIZE 100
    
    // optimized search for sorted arrays
    // give it the array, array size, and value to find
    // returns array index or -1 if not found.
    
    int BinarySearch(int *Array, int aSize, int Find)
      { int high = aSize -1;   
        int low = 0; 
        int target; 
        
        while (high >= low)   
          { target = low + ((high - low) / 2);
    
            // show tries for demonstration only
            printf("%d, ",target);
            
            if ( Array[target] == Find )
              return target;
            else if ( Find > Array[target] )
              low = target + 1;
            else if ( Find < Array[target] )
              high = target - 1;  }
        return -1; }
    
    
    
    // main demonstration program
    int main (void)
      { int num;
        int index;
        int *list = malloc(ASIZE * sizeof(int));
    
        srand(time(NULL));  // initialize randomizer
    
        // this section is for demonstration purposes only
        // you would not do this in a real application
        
        // generate an array of random numbers
        for (int i = 0; i < ASIZE; i++)
          list[i] = rand() % (ASIZE * 100);
    
         // binary searches need sorted arrays
        { int highest;
          int temp;
    
          for (int i = ASIZE - 1; i > 0; i--)
            { highest = 0;
    
              for (int j = 0; j < i; j++)
                if (list[j] > list[highest])
                  highest = j;  
              
              if (list[highest] > list[i])
                { temp = list[i];
                  list[i] = list[highest];
                  list[highest] = temp; } }
    
            printf("Here is our number list...\n");
            for (int x = 0; x < ASIZE; x++)
              printf("%d\t",list[x]);  }
    
    
        // OK! lets do some searching...
        do
          { printf("\n\nEnter a number from the list (-1 to exit) : ");
            scanf("%d",&num);
    
            // and do a binary search
            index = BinarySearch(list,ASIZE,num);
            if (index > -1)
              printf("%d was found at index %d",num,index);
            else
              printf("%d is not on the list",num); }
        while(num > 0);
    
        free(list);
        printf("\n\n");
    
        return 0; }
    Try changing the array sizes and see what it does to the number of tries...
    Doubling the array size only adds one more try... the bigger the array the more effiecient it becomes. This is why I was suggesting he just load the whole thing and use this type of search.
    Last edited by CommonTater; 11-04-2011 at 08:46 AM.

  14. #14
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    It seems that you are required to have words.txt store a list of words, but if not, an interesting approach is to store your dictionary of words in a trie. A quick search brings up this blog post on using a trie for a spell checker: How I Trie to Make Spelling Suggestions.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  15. #15
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quite true, Tater. I was thinking that 120,000 words might not be all the words he needs, and if he does need a really large dictionary, splitting out the least commonly accessed words, would be a good idea. If you can fit all the words you need into your array and still have memory enough for the PC to be responsive, then forget it -- you're golden.

    Writing out the nearest misses from a binary search is an old trick from way back when 80286's roamed the earth! I don't know whatever happened to them either - I think the TRex ate 'em!


    Tries are a very elegant solution for a spell checker. Time consuming to work out any bugs though, imo.
    Last edited by Adak; 11-04-2011 at 12:44 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Spell checker help
    By MyNameIs.. in forum C Programming
    Replies: 16
    Last Post: 11-25-2009, 05:43 PM
  2. Simple spell checker
    By purplechirin in forum C Programming
    Replies: 31
    Last Post: 03-19-2008, 07:17 AM
  3. Spell Checker
    By DeepFyre in forum Tech Board
    Replies: 2
    Last Post: 02-11-2005, 12:17 PM
  4. spell checker in c needs help
    By madmax in forum C Programming
    Replies: 3
    Last Post: 03-13-2003, 09:36 AM
  5. spell checker
    By bob20 in forum Windows Programming
    Replies: 3
    Last Post: 12-03-2002, 02:35 AM