Thread: spell check in C using a dictionary file

  1. #1
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    spell check in C using a dictionary file

    I need to create a C program that does a spell check on words in a user created data file which I created and called "data.txt". It will do this by checking to see if the word in the data text file is not a match in a dictionary file of accepted english language words called "words.txt".

    If the word in the data file does not match any word in the dictionary file, the program should print out the incorrectly spelled word from the data file as well as the line number of that word in the data file. For the case where a string of text in the data file is something such as '1980', the program should not check the spelling of those types of strings which are numeric. It should only perform the spell check for alphanumeric data and it should check every word of the dictionary file for a match with the data file word until either a match is found or it reaches the end of file.

    I have a vague idea of how to do this below in pseudo code:

    Code:
    open files
    while data.txt and words.txt are not EOF { 
        // check the word in the data file with every word in the dictionary   
        file
       if (word[data.txt] != word[words.txt]) {  
          print the incorrectly spelled word
          print the line number of that word
       }
    }
    I have a vague idea printed here but I am having trouble getting it into functional C code. Any advice or suggestions would be greatly appreciated.

    P.S. I have included the data file that I created called "data.txt" as a reference. I can't include the dictionary file as an attachment because it is WAY too big but I can tell you that the dictionary file displays only one word on a line and it has 45,427 words and 45,427 lines in the file.

  2. #2
    Registered User
    Join Date
    Nov 2004
    Posts
    2
    Can you ZIP compress the dictionary file then attach it?

  3. #3
    ~viaxd() viaxd's Avatar
    Join Date
    Aug 2003
    Posts
    246
    // check the word in the data file with every word in the dictionary
    that's way too slow. Imagine if both your lists are 100000 words each
    one thing you can do to improve performance is make sure that the wordlist is sorted and then just do a binary search...
    :wq

  4. #4
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re: Dictionary file in ZIP

    Here is the dictionary file called "words.txt" in a ZIP archive. This should help.

  5. #5
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re:

    To respond to what vialxd said, the dictionary file is sorted. I have now posted it as a ZIP archive.

  6. #6
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re:

    I think the strstr() function has to be used in some kind of loop to compare between the two files to see if the word from the data.txt file matches a word from the dictionary file or not. I don't quite know how to put that in a working coded segment though.

  7. #7
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Hm... It would have to be a text file, wouldn't it? Is it one word per line? How is the dictionary file set up? You could speed it up a bit I suppose, assuming you have to use it as a text file and can't convert it to a binary, if each word is on its own line. Build an index of where each new alphabet letter first starts.

    A is on line 1 (zero if you're using array-indexing notation).
    B is on line 348
    C is on line ...

    Skip ahead by simply calling fgets N times until you're at the index of the letter with which your word in question starts. Now start comparing by looping through with fgets calls until you find your word.

    Now, that'll only work for matching words, after you squash the case. If you have to spell check the word, that'll be a bit trickier. Hell, I stump all the http://dictionary.reference.com time.

    Quzah.
    Hope is the first step on the road to disappointment.

  8. #8
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re:

    The "data.txt" data file which is being checked for spelling errors has multiple words on a line but the dictionary file "words.txt" only has one word per line.

    I'll try to piece together something using your advice Quzah. Thanks.

  9. #9
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re:

    I've loaded the dictionary file into a large array:

    FILE *speller;

    while(fgets(spellchecker, 50000, speller)); which is a necessity

    and now I have to use strstr() to compare the words and check if there is a match or not.

    I am still stuck on how to go about doing the looping involving strstr() to check between the two files.

    I'll keep trying stuff though.

  10. #10
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Hmmm. Maybe...
    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main(void)
    {
       const char filename[] = __FILE__, keyword[] = "line";
       FILE *file = fopen(filename, "r");
       if ( file )
       {
          char line [ 256 ];
          while ( fgets(line, sizeof line, file) )
          {
             char *word = line;
             while ( word = strstr(word, keyword) )
             {
                int i;
                fputs(line, stdout);
                for (i = 0; i < word - line; ++i)
                {
                   putchar(' ');
                }
                puts("^");
                word += strlen(keyword);
             }
          }
          fclose(file);
       }
       else
       {
          perror(filename);
       }
       return 0;
    }
    My output is then...
    Code:
       const char filename[] = __FILE__, keyword[] = "line";
                                                      ^
          char line [ 256 ];
               ^
          while ( fgets(line, sizeof line, file) )
                        ^
          while ( fgets(line, sizeof line, file) )
                                     ^
             char *word = line;
                          ^
                fputs(line, stdout);
                      ^
                for (i = 0; i < word - line; ++i)
                                       ^
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  11. #11
    Registered User
    Join Date
    Nov 2004
    Posts
    73

    Re:

    thanks for the suggestion Dave. I'll take a look at what that does.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A development process
    By Noir in forum C Programming
    Replies: 37
    Last Post: 07-10-2011, 10:39 PM
  2. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM
  3. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM
  4. Dikumud
    By maxorator in forum C++ Programming
    Replies: 1
    Last Post: 10-01-2005, 06:39 AM
  5. simulate Grep command in Unix using C
    By laxmi in forum C Programming
    Replies: 6
    Last Post: 05-10-2002, 04:10 PM