Thread: Need help reading from txt

  1. #1
    Registered User
    Join Date
    Nov 2012
    Location
    Heraklion, Greece, Greece
    Posts
    26

    Need help reading from txt

    Hello guys.I need to read a text file which includes a paragraph of plain text.I need to put this paragraph in an array,where each array entry is one word.For example the text <I am superman> will be placed in the array like this:
    Array[0]="I",Array[1]="am",Array[2]="superman"
    Thanks in advance.

  2. #2
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    fscanf is actually pretty good at reading words. Just make sure that your storage is large and that the size is in the format string, like, "%1023s" for a 1KB array.

  3. #3
    Registered User
    Join Date
    May 2012
    Posts
    505
    Firstly you need to define what you mean by a word. Is "it's" a word? What about hypthenated words?

    Then there are several ways of breaking out the words. For a non-throw away program, it's probably best to read the entire text buffer into memory. Then scan through it, trying to match words. Store them in a char **, which you grow by one using realloc() on each entry. Don't worry about efficiency at this stage. make the strings dynamic, don't corrupt your text buffer.
    The matcher can be ad hoc. it will go something like "If an alphabetical character, read until you come to a non-alphabetical. If it is a single quote or a hypthen, is it followed by an alphabetical character? If yes, treat as a letter."
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  4. #4
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    For a single paragraph, I'd use a fixed words[200][20], type of structure. Yes, it wastes some memory, but it makes the program quite trivial with fscanf() using %s as the format, and words[i] as the destination. Run the while loop as long as fscanf() doesn't return EOF.

    You'll have punctuation like comma's and periods on the end of some words, but they're easy enough to remove, if present.

  5. #5
    Registered User
    Join Date
    Nov 2012
    Posts
    32
    Quote Originally Posted by Malcolm McLean View Post
    Firstly you need to define what you mean by a word. Is "it's" a word? What about hypthenated words?

    Then there are several ways of breaking out the words. For a non-throw away program, it's probably best to read the entire text buffer into memory. Then scan through it, trying to match words. Store them in a char **, which you grow by one using realloc() on each entry. Don't worry about efficiency at this stage. make the strings dynamic, don't corrupt your text buffer.
    The matcher can be ad hoc. it will go something like "If an alphabetical character, read until you come to a non-alphabetical. If it is a single quote or a hypthen, is it followed by an alphabetical character? If yes, treat as a letter."
    and write '\0' after each word over separator symbol

  6. #6
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by Shurik View Post
    and write '\0' after each word over separator symbol
    That's not a problem, if you use while(fscanf() with %s as the format specifier, )!= EOF). It automatically handles each word, and puts the end of string char into it's proper place.

    Handles hyphens and apostrophe's correctly, as well.

    The only thing (in alpha text, not other chars), that it doesn't handle is the comma or period - etc., at the end of a word, or sentence. It includes those, as part of the word.

    Two ways to fix it:

    1) Use an exclusion scanset.

    or

    2) remove them from words[i] after the fscanf(), but inside the while loop by:

    a) getting the length of the word with strlen(),

    b) then if(words[i][length-1] is not alpha (with isalpha(), then assign that char to an end of string char: '\0'.

    If you use a char pointer, this becomes pretty simple.

    char *pch = &words[i][length-1], then
    Code:
    if(!isalpha(*pch)) {
       *pch = '\0';
    }
    Last edited by Adak; 12-18-2012 at 11:23 AM.

  7. #7
    Registered User
    Join Date
    Nov 2012
    Location
    Heraklion, Greece, Greece
    Posts
    26
    Hello guys,thanks for the responses.
    Thus far i have done this.
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    int main(){
    char **A;
    char *fn;
    int n,i;
    int cpr;
    int c=0;
    FILE *fp;
    char *str;
    
    
    
    fp=fopen("text.txt", "r");
    
    
    do{
     fscanf(fp, "%s",str);
     c++;printf("%s",str);
     printf(" ");
    }while(fscanf(fp, "%s",str)!=EOF);
    printf("%d",c);
    
    
    return 0;
    }
    .When i input text.txt i am getting some weird results like some words are lost.Thanks in advance

  8. #8
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    Quote Originally Posted by Konstantinos View Post
    When i input text.txt i am getting some weird results like some words are lost.Thanks in advance
    Does this even work? Somehow you got away with using absolutely no storage. Some people told you to use an array of arrays and a while loop.

    The fact that you have called fscanf() twice in the loop explains the missing words, because each call will read a different word.

    If you use an array of arrays you will get each word in a cell, like you want.

  9. #9
    Registered User
    Join Date
    Nov 2012
    Location
    Heraklion, Greece, Greece
    Posts
    26
    Wow thanks.I am doing this just for checking.fscanf reads until whitespace,therefore i use loop in order to print all my text.

  10. #10
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Using fscanf is fine if you know the format in advance. But for your example "I am superman" I assume you want to be handle paragraphs of any length. To assign each word to an array entry you need to do tokenization. For simple purposes, a token is a word separated by delimiters. If the delimiters are space and period, then the text

    I am super-man.

    Contains the tokens "I", "am" and "super-man"

    So the basic approach becomes:

    Read a paragraph
    Tokenize the paragraph

    If you've already read the paragraph, then tokenizing can be done in this way:

    Code:
    #include <stdio.h>
    #include <string.h>
    #include <stdbool.h>
     
    #define MAXPARA 100000
    #define MAXWORDS 1000
    char *words[MAXWORDS];
    int words_count=0;
    
    const char DELIM[] = ".,! ;?@\n";
    char para[MAXPARA] = "\n\
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. \n\
    Maecenas porttitor congue massa. Fusce posuere, magna sed \n\
    pulvinar ultricies, purus lectus malesuada libero, sit amet \n\
    commodo magna eros quis urna. Nunc viverra imperdiet enim. \n\
    Fusce est.";
    char para_s[MAXPARA];
    
    int main()
    {
        // begin tokenization
        strcpy(para_s, para);
        char *str = para_s;
        while (true) {
            char *tok;
            if ((tok = strtok(str, DELIM)) == NULL)
                break; // no more tokens
            str = NULL;
            // save token into words array
            words[words_count] = strdup(tok);
            words_count++;
            if (words_count >= MAXWORDS)
                break; // too many words
        }
        // end tokenization
    return 0;
    }

  11. #11
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    The *scanf approach is actually not too much different from the tokenizing approach, if you're able to see *scanf as a tokenizer itself. The scanf functions are essentially tokenizing chunks of text delimited by white space. You essentially need space for the biggest word possible, and then space for the array of words. Then call *scanf consecutively. As the function reads a new word, copy it into the array of words.

  12. #12
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by whiteflags View Post
    The *scanf approach is actually not too much different from the tokenizing approach, if you're able to see *scanf as a tokenizer itself. The scanf functions are essentially tokenizing chunks of text delimited by white space. You essentially need space for the biggest word possible, and then space for the array of words. Then call *scanf consecutively. As the function reads a new word, copy it into the array of words.
    C99, whiteflags idea is very easy, and works well - just a slight wrinkle to remove the punctuation from the end of the words.

    For example:
    Code:
    #include <stdio.h>
    #include <string.h>
    #include <ctype.h>
    
    #define SIZE 40
    
    int main(void) {
       int i,len;
       char *ch;
       char words[SIZE][15]={0};
       FILE *fp=fopen("Edelweiss.txt", "r");
       if(!fp) {
          printf("Error opening file!\n");
          return 1;
       }
       i=0;
       while(fscanf(fp,"%s",words[i])!= EOF) {
          len=strlen(words[i]);    //handles comma's, periods, etc.
          ch=&words[i][len-1];
          if(!isalpha(*ch)) {
             *ch='\0';
          }
          printf("%2d: %s\n",i+1,words[i]);      
          
          ++i;
          if(i>SIZE-1) break;   //not necessary, but I like it   
       }
       fclose(fp);
       return 0;
    }
    
    //Output:
     1: Edelweiss's     //just to check on apostrophe's being included
     2: edelweiss
     3: every
     4: morning
     5: you
     6: greet
     7: me
     8: Small
     9: and
    10: white
    11: clean
    12: and
    13: bright
    14: you
    15: look
    16: happy
    17: to
    18: meet
    19: me
    20: Blossom
    21: of's
    22: snow
    23: may
    24: you
    25: bloom
    26: and
    27: grow
    28: bloom
    29: and
    30: grow
    31: forever
    32: Edelweiss
    33: edelweiss
    34: bless
    35: my
    36: homeland
    37: forever

  13. #13
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    As a rule, use the solution which works. If this approach works for you, then by all means use it. As whiteflags mentioned, this approach is equivalent to using tokenization with delimiters of " \t\n". The suggestion to remove the final non-alphabetic character from each token may be appropriate in some cases, but certainly not in all. However, this has nothing to do with tokenization.

    Also, Adak's approach above suffers from a buffer oveflow in case a word which is too long is encountered. I would do

    Code:
    fscanf(fp,"%14s",words[i])
    Last edited by c99tutorial; 12-19-2012 at 01:43 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Reading from a .txt
    By jpfrazao in forum C Programming
    Replies: 7
    Last Post: 11-03-2010, 08:06 PM
  2. help on reading
    By rhythm in forum C++ Programming
    Replies: 4
    Last Post: 01-27-2008, 06:01 PM
  3. Reading wmv
    By kryptkat in forum C Programming
    Replies: 4
    Last Post: 01-25-2008, 05:06 PM
  4. Reading From USB
    By mbh5m in forum C Programming
    Replies: 7
    Last Post: 01-14-2008, 08:29 AM
  5. reading from a .txt
    By sdherzo in forum C Programming
    Replies: 23
    Last Post: 06-25-2007, 06:50 AM

Tags for this Thread