Thread: efficiency in file operations

  1. #1
    Registered User stormbringer's Avatar
    Join Date
    Jul 2002
    Posts
    90

    efficiency in file operations

    hi

    i have to write a parser that replaces text in a file, can find several tokens and so on. now my question: there are several ways to access a file, i could use fgetc and work with singele chars (the direct and "easy" way, or i can use functions that read a lot of data and place the #read bytes in a bufer (and tehn work on the buffer). i'll make an exmple to make clear what i mean (if it isn't yet): for copying a file i can fputc(fgetc(f)), or i could fgets some bytes, store them in a bufer and the put them.
    which way is fster (because disk operations are slow). is fgets or fgetc faster?

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Your hard disk has a cache
    The operating system has a cache for the currently open files
    Your standard C library also has a buffer for the currently open files

    You can be pretty damn sure that each fgetc call does NOT result in your process going to sleep whilst it waits for a few mS for the disk to deliver the goods.

    IMO, the only reason fgets will be faster is that it is only one function call vs. the many you make with fgetc. Both at some point both calls will get caught up waiting for the disk when the various internal buffers are empty, at which point the milliseconds to do this will swamp the microseconds of difference between making a few function calls.
    And the time taken to read a char is likely to be small (either way) in comparison to the rest of the work your program has to perform.

    Write the program, then start measuring the performance using a profiler. Bottlenecks are not always in the obvious places...

    Or just test it
    Code:
    #include <stdio.h>
    
    // copy using fread
    void copy1 ( char *infile, char *outfile ) {
        FILE   *in = fopen( infile, "rb" );
        FILE   *out= fopen( outfile, "wb" );
        char    buff[BUFSIZ];
        size_t  n;
        while ( (n=fread(buff,1,BUFSIZ,in)) != 0 ) {
            fwrite( buff, 1, n, out );
        }
        fclose( in );
        fclose( out );
    }
    
    // copy using fgets on each line
    void copy2 ( char *infile, char *outfile ) {
        FILE   *in = fopen( infile, "rt" );
        FILE   *out= fopen( outfile, "wt" );
        char    buff[BUFSIZ];
        while ( fgets( buff, BUFSIZ, in ) != NULL ) {
            fputs( buff, out );
        }
        fclose( in );
        fclose( out );
    }
    
    // copy using fgetc on each char
    void copy3 ( char *infile, char *outfile ) {
        FILE   *in = fopen( infile, "rt" );
        FILE   *out= fopen( outfile, "wt" );
        int     ch;
        while ( (ch=fgetc(in)) != EOF ) {
            fputc( ch, out );
        }
        fclose( in );
        fclose( out );
    }
    
    /*
     * This macro found on
     * http://www.c-for-dummies.com/compilers/djgpp_asm.html
     */
    #define RDTSC(llptr) ({ \
            __asm__ __volatile__ ( \
            ".byte 0x0f; .byte 0x31" \
            : "=A" (llptr) \
            : : "eax", "edx"); })
    
    int main ( ) {
        unsigned long long a, b;
        RDTSC(a);
        RDTSC(b);
        printf( "Overhead: %lld -> %lld = %10lld\n", a, b, b-a );
        RDTSC(a);
        sleep(1);
        RDTSC(b);
        printf( "Sleep(1): %lld -> %lld = %10lld\n", a, b, b-a );
        RDTSC(a);
        copy1( "4a.txt", "5a.txt" );
        RDTSC(b);
        printf( "Copy1:    %lld -> %lld = %10lld\n", a, b, b-a );
        RDTSC(a);
        copy2( "4b.txt", "5b.txt" );
        RDTSC(b);
        printf( "Copy2:    %lld -> %lld = %10lld\n", a, b, b-a );
        RDTSC(a);
        copy3( "4c.txt", "5c.txt" );
        RDTSC(b);
        printf( "Copy3:    %lld -> %lld = %10lld\n", a, b, b-a );
        return 0;
    }
    
    I got these results
    Overhead: 6286077368030 -> 6286077368103 =         73
    Sleep(1): 6286077462890 -> 6286307919537 =  230456647
    Copy1:    6286308653110 -> 6286315490382 =    6837272
    Copy2:    6286315587619 -> 6286348807055 =   33219436
    Copy3:    6286349278809 -> 6286386417099 =   37138290
    On my machine, thats like .01 of a second slower to process a 1/2MB file, using fgetc instead of fgets. If you're really determined to do something, then just fread() the whole file into memory in one go.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User stormbringer's Avatar
    Join Date
    Jul 2002
    Posts
    90
    well, i'm working on 2gb log files. would be alittle much to read in memory :-)

    thanks

  4. #4
    Registered User
    Join Date
    Oct 2005
    Posts
    6
    by the way , how come you got 2GB log files ???
    What are you actually intending at?

  5. #5
    Unregistered User
    Join Date
    Sep 2005
    Location
    Antarctica
    Posts
    341
    reading and writing a char at a time, especially on large files, is extremely slow. You need to do things with buffers, the larger the buffer the better.

  6. #6
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    The only time I'd suppose that using your own buffer would be faster is if the the data you need to read is larger than BUFSIZ. The standard I/O library needs to make a system call every time it needs to refill its buffer so if you can use low-level I/O calls instead then it will be faster to do that in chunks larger than BUFSIZ than using the standard library functions. After all, system calls are relatively slow and if you can just make one call to read() instead of two then you can probably beat out the standard library functions in terms of speed.
    If you understand what you're doing, you're not learning anything.

  7. #7
    Gawking at stupidity
    Join Date
    Jul 2004
    Location
    Oregon, USA
    Posts
    3,218
    fread() and fwrite() don't seem to be restricted to a certain buffer size.

    I created a garbage file:
    Code:
    head -c 1048576 /dev/urandom > garbage
    Then I wrote 3 different programs:
    Code:
    itsme@itsme:~/C$ cat speed1.c
    #include <stdio.h>
    
    int main(void)
    {
      FILE *fp;
      int c;
    
      fp = fopen("garbage", "r");
      while((c = fgetc(fp)) != EOF)
        fputc(c, stdout);
    
      fclose(fp);
      return 0;
    }
    Code:
    itsme@itsme:~/C$ cat speed2.c
    #include <stdio.h>
    
    int main(void)
    {
      FILE *fp;
      unsigned char buf[1048576];
    
      fp = fopen("garbage", "r");
      fread(buf, 1, 1048576, fp);
      fwrite(buf, 1, 1048576, stdout);
    
      fclose(fp);
      return 0;
    }
    Code:
    itsme@itsme:~/C$ cat speed3.c
    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>
    
    int main(void)
    {
      unsigned char buf[1048576];
      int fd;
    
      fd = open("garbage", O_RDONLY);
      read(fd, buf, 1048576);
      write(1, buf, 1048576);
    
      close(fd);
      return 0;
    }
    And then I timed each one:
    Code:
    itsme@itsme:~/C$ time ./speed1 > /dev/null
    
    real    0m0.089s
    user    0m0.087s
    sys     0m0.002s
    itsme@itsme:~/C$ time ./speed2 > /dev/null
    
    real    0m0.004s
    user    0m0.001s
    sys     0m0.002s
    itsme@itsme:~/C$ time ./speed3 > /dev/null
    
    real    0m0.004s
    user    0m0.001s
    sys     0m0.002s
    fread() seems to perform just as well as speed3. Running an strace on speed2 verifies that there's only a single call to both read() and write().

    However, speed1 was significantly slower. Running an strace on that showed a multitude of read() and write() calls looking like:
    Code:
    read(3, "i\247\31g\230\353\4;A\263\276\310\232\344\252f\310jE\275"..., 4096) = 4096
    write(1, "\367\10d\fs/$Qn\244d\275\333e\223n\246B\313\254\213%\220"..., 4096) = 4096
    Hope that information is helpful.
    Last edited by itsme86; 10-04-2005 at 12:42 PM.
    If you understand what you're doing, you're not learning anything.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  2. sequential file program
    By needhelpbad in forum C Programming
    Replies: 80
    Last Post: 06-08-2008, 01:04 PM
  3. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM
  4. Simple File encryption
    By caroundw5h in forum C Programming
    Replies: 2
    Last Post: 10-13-2004, 10:51 PM
  5. Need a suggestion on a school project..
    By Screwz Luse in forum C Programming
    Replies: 5
    Last Post: 11-27-2001, 02:58 AM