Thread: Detecting and separating single words in a text file

  1. #1
    Registered User
    Join Date
    Dec 2010
    Posts
    12

    Detecting and separating single words in a text file

    I am trying to create a simple text compression program in C that can compress .txt files. I want to replace all the "for", "and", "I" and other repetitive words with single numbers. Please help me in writing a code that can replace the repetitive words. I have written a code but the problem is that it replaces whole of the text with a single number such as 1.
    Any help would be greatly appreciated.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Show us what code you have so far.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Dec 2010
    Posts
    12
    Quote Originally Posted by Salem View Post
    Show us what code you have so far.
    Code:
    #include <conio.h>
    #include <stdio.h>
    #include <string.h>
    #include <Windows.h>
    
    void main ()
    {
    	char wd1[4]="for";
    
    	char word[5];
    	char code[2]="1";
    	FILE *fp=NULL, *ft=NULL;
    	fp=fopen("d:\\check.txt", "r+");
    	ft=fopen("d:\\comp.txt","w");
    	fgets(word,4,fp);
    	printf("%s\n", word);
    
    	strcpy(word,code);
    	//fp=fopen("d:\\check.txt", "r+");
    	
    
    	printf("%s", word);
    	
    	//fclose(fp);
    	
    	fputs(word, ft);
    	
    
    	fclose(fp);
    	fclose(ft);
    		getch();
    }

  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Well, your program does compress the text file -- no doubt!

    Your compression scheme is an odd duck, but OK. To make it work, why not make a text file of about the 100 most popular words in English (and please, only words longer than 1 char <LOL>). This info is available on the net, btw.

    Then read in these words into a two dimension char array, and you'll want to keep this word list, in sorted order (so a binary search can be used). Then check each word of the text file, with your word list.

    Let's say "and", is the first word in this sorted list. So:

    Code:
    word[0] = "and".
    now you can quickly do a binary search, and find if strcmp(word[0],YourTextFileWord)== 0. (you need to include <string.h> for this.

    Since "and" is number 0 in the common word list, you could represent it in the compressed file, as <space> 0 <space> , and always be able to rebuild the file when needed, by finding word[0], see?

    You won't be able to add more words to the list of most common words, once you start compressing files with this, of course.

    You might want to read up on text compression techniques, and see if this is really how you want to do this, but if it is, the above should give you a nudge (OK, a big nudge), in the right direction.
    Last edited by Adak; 01-05-2011 at 07:54 AM.

  5. #5
    Registered User
    Join Date
    Dec 2010
    Posts
    12
    lol, i know the logic is lame, im just a freshman in college and have only studied C for 3 months or so... so it is ok for me if it starts working of course. now for the same reason i don't have any idea what "binary searching" is... i would really appreciate if u shower some light upon it. thx

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    "Binary search" is the high tech computer programming lingo for the old kids game of "guess the number I'm thinking of, between 1 and 10".

    Seriously, that's all it is.

    I'm thinking of a number between 1 and 10. You have 5 guesses:

    U guess 5. Nope, says I, too low.

    Now comes the binary part - you can either guess higher or lower than 5. Even the mongoloid kid on my block, would guess higher:

    U guess 7. Nope, says I, too high

    Again, a binary choice - higher or lower, and again, nobody with a working brain, would guess higher:

    U guess 6. You guessed it right!

    This simple binary guessing game is amazingly good when you have a very large number of things that have to be eliminated - say thousands or millions, instead of just 10 numbers. With the very first guess, in a quantity of a million numbers, I'll eliminate 500,000 numbers, if I do it right, won't I?

    For a very small quantity of numbers, you can search sequentially with a computer, just as fast - because the buffers that computers use, will hold several numbers at a time.

    So, how many words do you want to use as "common" words, to be replaced by numbers?

  7. #7
    Registered User
    Join Date
    Dec 2010
    Posts
    12
    between 10-20

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    So a simple sequential search will be fine.

    Got your word list of common words? I ask only because freshmen are generally smart, but well - lame.

    The most common 25 words in English:
    1. the
    2. of
    3. and
    4. a
    5. to
    6. in
    7. is
    8. you
    9. that
    10. it
    11. he
    12. was
    13. for
    14. on
    15. are
    16. as
    17. with
    18. his
    19. they
    20. I
    21. at
    22. be
    23. this
    24. have
    25. from
    "I" is no good to compress, of course. Ditto "a".
    Last edited by Adak; 01-05-2011 at 07:51 AM.

  9. #9
    Registered User \007's Avatar
    Join Date
    Dec 2010
    Posts
    179
    Are you bound to C with this? This is a perfect job for Perl or Sed.

  10. #10
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by Adak View Post
    So a simple sequential search will be fine.

    Got your word list of common words? I ask only because freshmen are generally smart, but well - lame.

    The most common 25 words in English:


    "I" is no good to compress, of course. Ditto "a".
    There might be some contest on your list Adak...
    Problem is the software here won't let me print the *real* list... (Grin).

  11. #11
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by \007 View Post
    Are you bound to C with this? This is a perfect job for Perl or Sed.
    He's studying strings in C, so I'm sure he's anxious to drop it and do it in another language.


    @CT, keep that "other" list at least until he's a Sophomore.

  12. #12
    Registered User \007's Avatar
    Join Date
    Dec 2010
    Posts
    179
    That's unfortunate, C is pretty rough with strings. I guess technically C doesn't even contain strings, it contains character arrays.. but who is keeping track of this anyway!



    I think working with characters and "strings" in plain C is rather daunting at first because the language wasn't made to handle them too well. C++ added a lot of helping features though.

    Good luck OP!

  13. #13
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    If you put the 23 (25-2 for I & a), into a file named words.txt, a program like this to start with, would make sense, imo:

    Code:
    #include <stdio.h>
    #include <string.h>
    
    int main() {
      int i,j, len, n; 
      char words[23][10];
      char text[]={"Edelweiss, edelweiss, \n\tevery morning you greet me.\n\
      Small and white, clean and bright, \n\tyou look happy to meet me.\n\
      Blossom of snow, may you bloom and grow, \n\tbloom and grow, forever.\n\
      Edelweiss, edelweiss, \n\tbless my homeland forever."};
      char c;
      FILE *fp;
      n=0;
      fp=fopen("words.txt","r");
      if(fp==NULL) {
        printf("Error opening words.txt\n");
        return 1;
      }
      for(i=0;i<23;i++) {
        fgets(words[i], sizeof(words[0]), fp);
        len = strlen(words[i]);
        if(words[i][len-1]=='\n') { //remove the newline, & shorten the word 
          words[i][len-1]='\0';     //by one char
        }
        printf("\n%s", words[i]);
      }
      printf("\n\t\t\t    press enter to continue\n");
      printf("\n  %s", text);
      fclose(fp);
    
      printf("\n\n\t\t\t     press enter when ready\n");
    
      (void) getchar(); 
      return 0;
    }
    You probably know the song, but here's a nice duet with Julie Andrews and some Henry John fella you might know:

    YouTube - Edelweiss: John Denver and Julie Andrews

    The purpose of the text[] array is to use it as a trial content for separating words. If you can separate words in a char array, then you'll have very few problems separating words in a file.

    And that's the next part of the program - separating the words of the song lyrics in text[]. The strtok() function (part of string.h), can do this, but so will two while loops, as in this pseudo code:
    Code:
    char one[16]={""};
    int i = 0
    while(text[i] != '\0') { //while not at the end of the string
      j assigned to 0;
      while(text[i] != ' ' and text[i] != '\0') {
        one[] of j assigned the value of text[] of i
        increment i and j
      end of while    
      one of j assigned '\0' //make one a legit string
      print one[]
      assign one[] of 0, the end of string char: '\0'
       ++i;
    end of while
    That's the pseudo logic basics. If your text[] is formatted differently, you'll need to change this logic, of course.
    Last edited by Adak; 01-05-2011 at 07:21 PM.

Popular pages Recent additions subscribe to a feed

Tags for this Thread