Thread: Spellchecker refinement needed

  1. #1
    Registered User
    Join Date
    Dec 2011
    Posts
    13

    Spellchecker refinement needed

    I am writing a spellchecker program and have some issues. First thing is that I need to make better use of the memory. My code iterates through a large dictionary file repeatedly. I know its better to read this into memory first then iterate through the block but I really don't know how. The search should be optimized. The other problem is ignoring the non alphanumerical characters in the problem. Would really appreciate some help with this.

    Code:
    #include <stdio.h>#include <string.h>
    #include <stdlib.h>
    
    
    int main(void)
    {
    /*Open files and test that they open*/
    FILE *fp1;
    FILE *fp2;
    char fname[20];
    char wordcheck[45];/*The longest word in the English Language is 45 letters long*/
    char worddict[45];
    char dummy;
    int i;
    int notfound;
    
    
    fp1 = fopen("C:/Users/Aaron/ProgrammingAssignment/dictionary.txt","r");
    
    
    if (fp1 == NULL)
    {
    printf("The dictionary file did not open.");
    exit(0);
    }
    
    
    printf("Please enter the path of the file you wish to check:\n");
    scanf("%s", fname);
    scanf("%c", &dummy);
    
    
    fp2 = fopen(fname, "r");
    	if (fp2 == NULL)
    		{
    		printf("Your file did not open, please check your filepath and try again.\n");
    	
    		printf("Please enter path of file you wish to check: \n");
    		scanf("%20s",fname);
    	
    		fp2 = fopen(fname, "r");
    		}
    	
    	else
    		{
    		printf("Your file opened correctly\n");
    		}
    
    
    /*When files are open, read each word from the text file into an array:*/
    	
    	while(fscanf(fp2,"%s", wordcheck)!=EOF)//Reads word from text file into array//
    	{
    
    
    		for (i=0; wordcheck[i]; i++)
    		{
    			wordcheck[i] = tolower(wordcheck[i]);//makes all characters lower case//
    		}
    			fseek(fp1,0,SEEK_SET);
    
    
    		/*printf("%s", wordcheck);//debugger*/
    		
    			while(fscanf(fp1,"%s", worddict)!=EOF)
    			{	
    				notfound = 1;
    				
    				if(strcmp(wordcheck, worddict)==0)//compare strings//
    				{
    				printf("This word: %s is in the dictionary\n", wordcheck);//debugger//
    				notfound = 0;
    				break;
    				}
    			}
    			if(notfound == 1)
    				{
    				printf("%s is not in dictionary\n", wordcheck);
    				}
    	}
    	printf("Your file has been checked. Please check any errors found");
    	fclose(fp1);
    	fclose(fp2);
    			
    return 0;
    
    }


  2. #2
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Read .... THIS

    Hint... sometimes it's very helpful to search the forums for previous threads on the same projects.

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    So break the problem down into a series of steps.

    - Read the file into memory
    - Read the file into memory (efficiently)
    - Simple brute force linear search
    - Efficient binary search
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Registered User
    Join Date
    Dec 2011
    Posts
    13
    Thanks guys. Tater I found that post before but I don't really understand it. Binary searches, binary trees... I've only been programming for a few weeks so I'm quite lost with this. Would you guys mind giving me a few hints even on how to implement such a program?

    Thanks!

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    When you were a kid, or even a teen, did you ever get asked to "pick a number between 1 and 10"? Maybe who got to play quarterback, first or something?

    That's all a binary search is - but computers use "guess the number", just as well as the smartest kid on your block, (and for the same reason).

    So guess a number, I'm thinking of:
    Range is from 1 to 10
    You: 5 //smart, right in the middle
    Me: too high
    Range is now cut almost in half: 1 to 4
    You: 2 //again, right in half
    Me: too low
    Range is now again cut in half: 3 to 4
    You: 3
    Me: too low
    Range is now a single number: 4
    You: 4
    Me: Correct!

    Binary trees are beyond your scope right now. Binary search, is simply the smart kid on the block, playing "Guess the Number".

    If you don't have a binary search function in that thread (I believe it does), or you have more questions about it, post back. Note that the array of data MUST be in sorted (ascending) value. Descending order is also ok, but you have to adjust the search logic a bit.

  6. #6
    Registered User
    Join Date
    Dec 2011
    Posts
    13
    Thanks Adak,

    Ok I have managed to read my dictionary file into memory using a function. Here's the code:

    Code:
    void readfile(char *dictionary) //Reads the dictionary file into memory for optimized searching//
    {
    FILE *fp1 = fopen("dictionary.txt","rb");
    
    
    fseek(fp1,0,SEEK_END);
    float endpos = ftell(fp1);
    
    
    fseek(fp1,0,SEEK_SET);
    
    
    char *dictionary = malloc(endpos);
    fread(dictionary, endpos, 1, fp1);
    
    
    fclose(fp1);
    }
    As the dictionary is already in ascending alphabetical order, and stored in memory, am I right in saying I can scan the users text file word by word, and check each word against the dictionary using a binary search?

    Sorry if this is way off I'm trying to get to grips with it and it's due tomorrow
    Last edited by adohertyd; 12-21-2011 at 05:34 AM.

  7. #7
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    ftell() doesn't return a float, it returns a long int:
    Code:
    Syntax:
    long int ftell(FILE *stream);
    
     
    
    Declared in:
    <stdio.h>
    
     
    
    Description:
    The ftell function returns the current value of the file position indicator for the stream pointed to by stream. For a binary stream, the value is the number of characters from the beginning of the file. For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.
    
     
    Returns:
    The current file position on success, otherwise -1 and errno contains the error code.
    I'm confused because all I see is wordcheck[] which has only space for 45 char's. So loading a whole bunch of words from a file, isn't going to work.

    Don't you need a 2D array (with rows and columns)?

    words[0] = "a"
    words[1] = "aardvark"
    words[2] = "abba"
    etc.

    You can count each word in the word list first, if you need to make an array equal to just that number of words, or you can just say to hell with the saving memory, and make an array large enough for all the words - but use either malloc(), or declare your array in global scope (above main), so your memory will come from the large heap - local arrays declared inside a function, just get a small stack to get memory from, and it won't be enough to hold a large word list.

    The question is, with your current code, are you able to move quickly to each word, using either an index to the array[index], or a pointer (and I don't advise using the pointer way, due to your time constraints).

  8. #8
    Registered User
    Join Date
    Dec 2011
    Posts
    13
    My opinion was that I read the dictionary file into memory as I have done(I amended the float to long int). Repeat process for user file. Dictionary is already sorted alphabetically so no need to do that. Read the first word of the user file into an array(possible?) and do a binary search for that word in the dictionary.

    I'm probably way off. I'm so lost with this stuff. I know I'm asking a lot but I need a lot of help!

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    But can you now access each word, in that memory? Just having a word somewhere in memory isn't enough by itself, to use a binary search.

    This is the kind of thing you might study:

    There is code in here for an index logic, also - you don't need that. In tests (this code was the testing tool), the indexing did NOT provide enough advantage to use it. With slower hardware, it was quite useful, but not now. Ignore the index part of this, and study the rest of it.

    Code:
    /* 
    This is a text based (not binary), word searching program, It's purpose is educational, rather than utilitarian. It shows how to 
    automatically build an index array, and how that index array can then be used to cut down the search space, causing 30% fewer
    comparisons, in this case. Every letter of indexing you add, would increase that percentage.
    
    The indices are passed on to the binary search, to narrow the range of words in the binary search's search space. It is designed 
    to work with a large wordlist.txt file, with each word on it's own row, e.g.:
    
    a
    aardvark
    aback
    abacus
    ...
    zygote
    
    The list must be alphabetized and, for this example program, the words must have less than MAX-1 letters. The total wordList.txt 
    file must have less than 39,300 words presently.
    
    The words array wastes a lot of memory, but saving memory isn't what this program is trying to show. I did not want to distract 
    from it's purpose of showing an indexed search.
    
    This version finds every word in a text file, and searches for it, giving just a summary of the words found and not found, and 
    displaying the elapsed time of the program's run. 
    
    
    Note: This is version 0.1 
    
    The changes include:
    
    1) All newlines have been removed from the words
    
    2) Input is from a file, rather than from the user. The file is the Gettysburg
       Address, The Art Of War, Kidnapped, or A Tale of Two Cities, right now. All the books were ebooks free from Project Guttenberg.
       
    3) The run is timed by the program. 
    
    4) Educational message output is commented out from the active code.
    
    5) A larger wordlist.txt file is used, and longer words can be searched for. The wordlist file
       has been uploaded to Swoopshare at:
    
    6) The words being searched for is changed from uppercase to lowercase, if needed.
    
    7) The display is slightly more organized.
    
    Created on the Pelles C (free) Windows C compiler.
    
    Extensive testing has shown that indexing is of little use on this fast hardware (overclocked i7 cpu). More word comparisons are
    needed, but they are very cheap, time-wise. When A Tale of Two Cities can have every word checked in less than 1/10th of a 
    second, it makes no sense to work with a better algorithm. 
    
    The advantage of indexing has sharply diminished with today's powerful PC's. Definitely not worth doing unless for some reason, 
    comparisons again become more costly. 
    */
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    #include <time.h>
    
    #define MAX 17 
    #define MaxWords 39300
    //longest word is 15 chars
    
    char words[MaxWords][MAX];
    
    int binarySearch(const char *, int, int);
    int getInput(char *, int);
     
    
    int main(void) {
       int c, i, idx, n, lo, hi, len, inWord=0, found, notFound, inList=0;
       FILE *fp, *fpget;
       int index[27]={0};
       char goal[MAX];
       char testchar;
       clock_t start, stop;
    
       if((fp=fopen("wordlist.txt", "r")) == NULL) {
          perror("Error: unable to open the wordlist.txt file");
          return 0;
       }
       printf("\nLoading words\n"); 
    
       //fill the index array
       testchar='a'-1;
       n=0; i=idx=0;
       while((fgets(words[n], MAX, fp)) != NULL) {
          len = strlen(words[n]);
          if(words[n][len-1]=='\n') //remove the newline
             words[n][len-1]='\0';
          //words[n][strspn(words[n], "\n")]='\0';
    
          if((words[n][0]) > testchar) {  
             testchar = words[n][0];
             index[testchar-'a'] = n;  //char - 'a' makes a=0, b=1, c=2, etc.
          }
          ++n;  //line counter
       }
       
       inList = index[26] = n-1;
       fclose(fp);
       printf("\nThe Index Array Table:\nLetter\tStarts At Line #\n"); //show the index array
       printf("======================\n");
       for(i=0;i<26;i++) { 
          printf(" %c: %10d\n", (i+'a'), index[i]); 
       }
      
       start = clock();
       putchar('\n'); 
       //get user input
       if((fpget=fopen("ArtOfWar.txt", "r")) == NULL) {   //was ATaleOfTwoCities.txt "Kidnapped.txt", "ArtOfWar.txt" "Gettysburg.txt"
           perror("Error: Unable to open the Gettysburg.txt file\n");
          return 0;
       }
       //get a word from the text file
       i=found=notFound=0;
    
       while((c=fgetc(fpget)) != EOF) {
          if(isalpha(c)) {
             inWord=1;
             goal[i]=c;
             goal[i]=tolower(goal[i]);
             //printf("%d: %c ", i, goal[i]); getchar();
             ++i;
          }
          else {
             goal[i]='\0';
             if(inWord) 
                inWord=0;
             //printf("  goal: %s \n", goal); getchar();
               
             //set the lo and hi indices
             i = goal[0]-'a';  //<-- 
             //printf("\n low word in the search: %s\nhigh word in the search: %s\n\n",words[index[i]], words[index[i+1]-1]); getchar();
             lo = index[i]; hi = index[i+1]+1;  //NOT[i+1]-1; 
             idx=binarySearch(goal, lo, hi);
             if(idx > -1) {
                // printf(" The word: %s was found: %s at index#: %d \n\n", goal, words[idx], idx);
                ++found;
             }else  {
                ////printf("%s\n", goal);
                ++notFound;
             }
             goal[0]='\0';
             i=0;       
             //getchar();
          }
       }
    
       fclose(fpget);
       stop=clock();
       printf("\n\nThe word list has %d entries in it, out of a total size of %d\n\n", inList, MaxWords);
       printf("%d words were found and %d words were not found in the wordlist\n\n", found, notFound);
       printf("Elaspsed time for this run: %.3lf seconds.\n", (double)(stop-start)/CLOCKS_PER_SEC);
       
       getchar();
       return 0;
    }
    /*
       //reset for the comparison run                <<  Reset >>
       
       start=clock();
       putchar('\n'); 
       //get input
       if((fpget=fopen("ArtOfWar.txt", "r")) == NULL) {   //was "Gettysburg.txt" ATaleOfTwoCities.txt
           perror("Error: Unable to open the Gettysburg.txt file\n");
          return 0;
       }
       //get a word from the text file
       i=found=notFound=idx=0;
       while((c=fgetc(fpget)) != EOF) {
          if(isalpha(c)) {
             inWord=1;
             goal[i]=c;
             goal[i]=tolower(goal[i]);
             //printf("%d: %c ", i, goal[i]); getchar();
             ++i;
          }
          else {
             goal[i]='\0';
             //putchar('\n');
             if(inWord) 
                inWord=0;
                                  
          lo=0;hi=n-1;
          idx+=binarySearch(goal, lo, hi);
          if(idx > -1) {
             //printf(" The word: %s was found: %s at index#: %d \n\n", goal, words[idx], idx);
             ++found;
          }else  {
            // //printf("%s zzz\n", goal);
             ++notFound;
          }
          goal[0]='\0';
          i=0;       
          //getchar();
             }
          }
       }
    
       fclose(fpget);
       stop=clock();
       printf("\n\n*The word list has %d entries in it, out of a total size of %d\n\n", inList, MaxWords);
       printf("%d words were found and %d words were not found in the wordlist\n\n", found, notFound);
       printf("Elaspsed time for this run: %.3lf seconds.\n", (double)(stop-start)/CLOCKS_PER_SEC);
       printf("\nNumber of comparisons: %d\n", idx);
       getchar();
       //goto Cut2;
       
       return 0;
    } */
    
    int binarySearch(const char goal[], int left, int right) {
       int lo, mid, hi,count=0;
       lo=left;
       hi=right;
       while (lo <= hi) {
          ++count;
          mid = (lo + hi) / 2;
          /* watch how the binary search zerooes in on the goal word  */
      
          //printf("goal: %s mid: %d  words[mid]: %s  ", goal, mid, words[mid]); 
          //printf("stringcmp returns: %d\n", strcmp(goal, words[mid]));
          //getchar();
          
          if (strcmp(goal, words[mid]) < 0)     //goal < words[mid]: -1
             hi = mid-1;
          else if(strcmp(goal, words[mid]) > 0) //goal > words[mid]:  1 
             lo = mid+1;
          else
             return mid;
       }
       return -1;
    }
       /* needs guard code to stop empty input from crashing it, etc.
          In this version, this function is never called.
       */
    
    int getInput(char goal[], int i) {
       int len;
       
       if(i) {
          printf("\nEnter a word to be searched for, up to %d letters: ", MAX-2);
          fgets(goal, MAX, stdin);  
          len = strlen(goal);
          if(goal[len-1]=='\n')
             goal[len-1]='\0';
          printf("String Length is %d\n", strlen(goal));
       }
       return 0;
    }
    
    /*
    // makes wordlistx.txt file 
    int main(void) {
       int i, j, n, len;
       FILE *in, *out;
       char goal[30];
       //char testchar;
       
       
      /* if((in=fopen("TEMPDICX", "r")) == NULL) {
          perror("Error: unable to open the wordlist.txt file");
          return 0;
       }
       */
    /*   
       if((out=fopen("wordlistx.txt", "r")) == NULL) {
          perror("Error: unable to open the wordlistx.txt file");
          return 0;
       }
       printf("\n\n");
       i=n=0;
       while((fgets(goal, 30, out)) != NULL) {
          //sscanf("%s", goal);
          //for(j=0;j<30;j++) {
          //   if(isalpha(goal[j]) || goal[j]=='-')
          //      ;
          //   else
          //      goal[j]='\0';
          //}
          
          //fprintf(out,"%s\n",goal);
          len = strlen(goal);
          if(len > i) {
             i = len;
             printf("maxlength: %d\n", i);
          }
          ++n;
          //if(n<20) {
          //   printf("%s\n",goal);
          //   getchar();
          //}
       }
       //fclose(in);
       fclose(out);
    
    
       return 0;
    }
        */
    Note the declaration of the words[][] array, above main(). That's a global array, and all arrays can access it without parameter passing. Generally, a bad thing, but here, it shines.
    Last edited by Adak; 12-21-2011 at 06:57 AM.

  10. #10
    Registered User
    Join Date
    Dec 2011
    Posts
    13
    Thanks Adak you have been a great help but I'm still none the wiser if I'm honest. Basic string searching etc is fine for me but this stuff is just over my head. I think my original slow code will have to do instead because this really is just beyond me.

    Thanks so much

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Unfortunately, I've over-complicated it with some code that was just convenient.

    You have a wordlist.txt file, yes?

    You declare a words[][] array, and lets say we make it big:

    words[MAXwords][MAXlen]

    where MAX's are defines:
    Code:
    #include <stdio.h>
    #include <string.h>
    
    #define MAXwords 39000
    #define MAXlen   42 
    
    char words[MAXwords][MAXlen];
    
    int binarySearch(char buffer[], int left, int right);
    
    int main(void) {
    
       int i, inList, n, numWords;
       FILE *fp;
       char buffer[50];
       fp =fopen("wordList.txt", "r");
       if(fp==NULL) {
          printf("Error: input file didn't open!\n");
          return 0;
       }
    
       //read in the words from the wordslist file
       i = 0;
       while((fgets(words[i], MAXlen-2, fp)) != NULL) {
            //leave the newline
            ++i;
       }
       fclose(fp);
       //get a word from the user
       printf("Enter a word to be spell checked: ");
       fgets(buffer, MAXlen-2, stdin);
       n=binarySearch(buffer, 0, i);
       
       if(n>-1) {//the word was found
          printf("That is correctly spelled\n");
       }
       else {
          printf("I hope one of the suggested words was helpful\n");
       }
       buffer[0]='\0'; //delete old word string
       //and this should be made into a loop for the user, so they can 
       //check multiple words, if they want to. This would be the end of the   
      //loop.
    
      //}
       
       return 0;
    }
    int binarySearch(char buffer[], int left, int right) {
       int i, lo, mid, hi,count=0;
       lo=left;
       hi=right;
       while (lo <= hi) {
          ++count;
          mid = (lo + hi) / 2; //our index to the mid point of the range
            
          if (strcmp(buffer, words[mid]) < 0)     //goal < words[mid]: -1
             hi = mid-1;
          else if(strcmp(buffer, words[mid]) > 0) //goal > words[mid]:  1 
             lo = mid+1;
          else
             return mid;
       }
       printf("Sorry, incorrect Some suggestions: \n\n");
      
       for(i=-7;i<7;i++) 
          printf("  %s \n", words[mid +i]);
    
    
       return -1;
    }
    Note that the above is completely unchecked - I've never compiled it, lt alone run it and checked it for accuracy. It's so basic, I'm not sure it satisfies the definition of a spell checker - it's more of a word checker/suggester - if there is such a thing.

    Forget the code I posted previously. I tried to quickly "touch it up", and instead I "fouled it up", in my rush of the morning.

    Good luck.
    Last edited by Adak; 12-21-2011 at 08:28 AM.

  12. #12
    Registered User
    Join Date
    Dec 2011
    Posts
    13
    Thanks Adak I see exactly what's happening now. And I suppose a word checker is the right word for my problem rather than a spellchecker. I got bogged down in a lot of stuff and this has cleared it right up for me. Think there are a few small issues with the code but that's for me to work out. Just needed a big push in the right direction!

  13. #13
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    So much for the OP learning by themselves through a series of baby steps, when yet another complete (and overly complex) example gets served up on a plate.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  14. #14
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by adohertyd View Post
    My opinion was that I read the dictionary file into memory as I have done(I amended the float to long int). Repeat process for user file. Dictionary is already sorted alphabetically so no need to do that. Read the first word of the user file into an array(possible?) and do a binary search for that word in the dictionary.

    I'm probably way off. I'm so lost with this stuff. I know I'm asking a lot but I need a lot of help!
    What you're missing is that you need a pointer to the beginning of each word so that the binary search knows where to look...

    Now that you have the entire file in memory you have several tasks...
    1) word count the file .... and add a few extra for good measure.
    2) create an array of pointers to characters with malloc() based on #1
    3) tokenize the big file adding the pointer to each word to the array from #2

    I described a far simpler (and potentially faster) method in that other thread ... HERE

    If you don't understand some of the functions used, look them up in your compiler's library documentation (If you don't have it... get it)

  15. #15
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    It appears Adak, that you didn't pay any attention to any of the stuff I posted in the last thread on this topic.

    A suggestor was definitely not required this time, and outputting some of the lexicographically nearby words will give ridiculous "suggestions" of words that are not even remotely similar. Not to mention that the code would buffer underrun or overrun.

    That's also an inefficient binary search, and storing all the words in a single array is wasteful.
    But hey on the other hand it could be far worse, holding them in memory at all is a huge improvement.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Spellchecker program
    By vice in forum C Programming
    Replies: 2
    Last Post: 12-05-2005, 05:59 PM
  2. Help needed please.
    By TaktaK in forum C++ Programming
    Replies: 7
    Last Post: 10-04-2005, 09:32 PM
  3. Help Needed !!
    By vodlum in forum C++ Programming
    Replies: 3
    Last Post: 10-03-2005, 02:33 PM
  4. Help needed.
    By hyrule in forum C++ Programming
    Replies: 2
    Last Post: 09-26-2005, 09:51 AM
  5. Creating a spellchecker
    By warny_maelstrom in forum C Programming
    Replies: 14
    Last Post: 03-12-2004, 09:37 AM

Tags for this Thread