Thread: Reading large complicated data files

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    9

    Reading large complicated data files

    Hi,

    I have a little experience with C but I am having problems with a particular data file.

    This file contains 3700+ entries which are split into 14 columns of mixed integer and floating point numbers. The main problem with the file is that not all of the columns contain data. This has given me endless problems since I would like to use the fscanf() function but I believe all of the arrays must be declared in this function. Is this the case? The first six columns are all that I require and there is no problem here as every entry has data within these columns, it is the whitespaces between other columns which are the problem.

    Does anyone have any other ideas as to how I could read this data?

    Thanks for your time,

    Dodzy

  2. #2
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Is your file tab delimited, or space delimited? Read a line at a time with fgets. If tab delimited, use strtok to break it into pieces, keeping track of the number of what "piece" you're on, so you know what to do with it. If space delimited, copy N characters at a time into a buffer, where N is the size of the chunk. Then, depending on what "chunk" you are on, handle it accordingly.


    Quzah.
    Hope is the first step on the road to disappointment.

  3. #3
    Registered User hk_mp5kpdw's Avatar
    Join Date
    Jan 2002
    Location
    Northern Virginia/Washington DC Metropolitan Area
    Posts
    3,817
    I'd also suggest using fgets to read in an entire line of data. Then I'd suggest perhaps using sscanf on that line you just read in to get those first six values... you don't have to do anything else with the remainder of the data from the lines.
    "Owners of dogs will have noticed that, if you provide them with food and water and shelter and affection, they will think you are god. Whereas owners of cats are compelled to realize that, if you provide them with food and water and shelter and affection, they draw the conclusion that they are gods."
    -Christopher Hitchens

  4. #4
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    His file has the problem of not having six values, or any value for that matter in some columns. I took a quick glance, and it has something like:
    Code:
    a b c d e f
    a   c d   f
    a b       f
    Things like that, so you wouldn't always get values for some of them. For example, if we tried to apply sscanf to the above, 'f' in the third line would wind up in variable 'c', unless you did painful stuff with format and size specifiers.


    Quzah.
    Hope is the first step on the road to disappointment.

  5. #5
    Registered User
    Join Date
    May 2006
    Posts
    9
    Thanks for the suggestions guys. Really appreciate the help.

    Dodzy

  6. #6
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    I think a bigger issue is going to be handling the way some columns run together and do not necessarily all begin on the same column.
    Code:
     245 94   63097.247   14.436 1840835.644   14.440
     245 95   61892.531    3.300 1841258.006    3.317
     245 96   60998.567    2.739 1841369.616    2.759
     245 97   61808.788    2.492 1839777.041    2.514
     245 98   63377.375    3.088 1837426.101    3.106
     245 99   66431.0    200.0  1833590.0    200.0
     245100   70214.0    275.0  1829025.0    275.0
      
     246 94   65388.508   15.2751846615.706   15.279
     246 95   64987.973   18.193 1846233.887   18.196
     246 96   62611.835    2.233 1847827.671    2.258
     246 97   63961.835   60.042 1845695.317   60.042
     246 98   64084.818    2.250 1844789.981    2.275
     246 99   67965.0    224.0  1840127.0    224.0
     246100   70124.274   39.594 1837185.816   39.595
    (Slightly trimmed snippet of input.)
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  7. #7
    Registered User
    Join Date
    May 2006
    Posts
    9
    I apologise for this in advance as this will seem like a stupid question to someone who is experienced with C but I don't understand the fgets() function.

    The syntax for the fgets() function is:

    fgets(char *s, int N, FILE *stream).

    What I don't understand is the nature of the N (i.e. the max number of array elements of the character array s). No matter how low a value I assign to N it will read in the complete string (or line of numbers representing a string in my case) with no regards to my maximum array size. Is this only a result of my fgets() being a conditional statement of a while loop?

    Sorry I am quite confused by this function and, hence, my post must also seem quite confusing!!

    Dodzy

  8. #8
    Registered User
    Join Date
    May 2006
    Posts
    9
    Quote Originally Posted by Dave_Sinkula
    I think a bigger issue is going to be handling the way some columns run together and do not necessarily all begin on the same column.
    (Slightly trimmed snippet of input.)
    Yes that was a problem with the early version of the .txt file that I have posted here, however, this has been rectified thanks to a boring morning spent separating these values.

    This should no longer be a problem!

    Dodzy

  9. #9
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by dodzy
    What I don't understand is the nature of the N (i.e. the max number of array elements of the character array s). No matter how low a value I assign to N it will read in the complete string (or line of numbers representing a string in my case) with no regards to my maximum array size.
    man fgets, first hit:
    char *
    fgets(char * restrict str, int size, FILE * restrict stream);


    The fgets() function reads at most one less than the number of characters specified by size from the given stream and stores them in the string str. Reading stops when a newline character is found, at end-of-file or error. The newline, if any, is retained. If any characters are read and there is no error, a `\0' character is appended to end the string.
    [edit]It stops when you'll run out of room in the buffer, or when it gets a newline.
    Last edited by Dave_Sinkula; 05-17-2006 at 09:26 AM.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  10. #10
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by dodzy
    Yes that was a problem with the early version of the .txt file that I have posted here, however, this has been rectified thanks to a boring morning spent separating these values.

    This should no longer be a problem!
    Posting the current file would allow me and others to continue playing along. I was working with this (I renamed your file for my testing).
    Code:
    #include <stdio.h>
    
    int main()
    {
       static const char filename[] = "mass.txt";
       FILE *file = fopen(filename, "r");
       if ( file != NULL )
       {
          char line[BUFSIZ];
          while ( fgets(line, sizeof line, file) != NULL )
          {
             int a, b;
             double c, d, e, f;
             if ( sscanf(line, "%3d%3d%lf%lf%lf%lf", &a, &b, &c, &d, &e, &f) == 6 )
             {
                printf("%3d %3d %10.3f %8.3f %11.3f %8.3f\n", a, b, c, d, e, f);
             }
          }
          fclose(file);
       }
       else
       {
          perror(filename);
       }
       return 0;
    }
    Last edited by Dave_Sinkula; 05-17-2006 at 09:42 AM. Reason: Changed output line formatting.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  11. #11
    Registered User
    Join Date
    May 2006
    Posts
    9

    New .txt file

    Sorry, I had only included it as a guide didn't realise anyone was working with it.

    Here is the newer version.

    Cheers,

    Dodzy

  12. #12
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Thanks.

    In the couple of troublish spots I looked it seems to be okay with the code I posted. My output (snipped):
    Code:
    245  94  63097.247   14.436 1840835.644   14.440
    245  95  61892.531    3.300 1841258.006    3.317
    245  96  60998.567    2.739 1841369.616    2.759
    245  97  61808.788    2.492 1839777.041    2.514
    245  98  63377.375    3.088 1837426.101    3.106
    245  99  66431.000  200.000 1833590.000  200.000
    245 100  70214.000  275.000 1829025.000  275.000
    246  94  65388.508   15.275 1846615.706   15.279
    246  95  64987.973   18.193 1846233.887   18.196
    246  96  62611.835    2.233 1847827.671    2.258
    246  97  63961.835   60.042 1845695.317   60.042
    246  98  64084.818    2.250 1844789.981    2.275
    246  99  67965.000  224.000 1840127.000  224.000
    246 100  70124.274   39.594 1837185.816   39.595
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

  13. #13
    Registered User
    Join Date
    May 2006
    Posts
    9

    Smile Wow!!!

    Yeah I just compiled and ran it myself and you have no idea how pleased I am!! I did not mean for anyone to write the code for me but you have made it seem really simple!!

    Thank you very much and thank you again!

    Dodzy

  14. #14
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Quote Originally Posted by Dave_Sinkula
    I think a bigger issue is going to be handling the way some columns run together and do not necessarily all begin on the same column.
    My first reply would have handled that scenario, if we took the method for "space delimited", since the idea with it was simply to count N bytes as a "chunk". They also wouldn't have had to edit their data file for it to work.


    Quzah.
    Hope is the first step on the road to disappointment.

  15. #15
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Quote Originally Posted by quzah
    My first reply would have handled that scenario, if we took the method for "space delimited", since the idea with it was simply to count N bytes as a "chunk". They also wouldn't have had to edit their data file for it to work.
    Not exactly, the original file was not well formatted. Some fields were not delimited by either whitespace or by a particular column. And the sscanf call handles the whitespace delimited fields just fine too.
    7. It is easier to write an incorrect program than understand a correct one.
    40. There are two ways to write error-free programs; only the third one works.*

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Reading data from a text file
    By Dark_Phoenix in forum C++ Programming
    Replies: 8
    Last Post: 06-30-2008, 02:30 PM
  2. using mmap for copying large files
    By rohan_ak1 in forum C Programming
    Replies: 6
    Last Post: 05-13-2008, 08:12 AM
  3. reading formatted data files
    By gL_nEwB in forum C++ Programming
    Replies: 5
    Last Post: 04-22-2006, 10:09 PM
  4. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  5. Reading Large Files!!!
    By jon in forum Windows Programming
    Replies: 1
    Last Post: 09-09-2001, 11:20 PM