Thread: How to calculate an average of all bytes in a given file?

  1. #16
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Quote Originally Posted by Sebastiani View Post
    An average is simply all the values added together and then divided by the number of values. Problem is, adding a bunch of bytes together using a single byte will likely overflow, so you'll either need a larger data type or use perhaps a clever mathematical technique to overcome that issue.
    Using unsinged int, overflow won't be a problem unless the file is over 16.8 Megabytes. It shouldn't be a problem.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  2. #17
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    There's not a possibility of running into overflow -- there's essentially a certainty of it (assuming ASCII text and say four characters). You'll have to store the sum elsewhere. (EDIT: I'm assuming you're intending storing the sum in an unsigned char, for whatever reason. If you're thinking of an int, then no worries.)

  3. #18
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    If an unsigned integer is used to accumulate the total sum of all bytes, and if all bytes were 0xFF, then the file could still be 16MB before an overflow would occur. In that case, a float or double would suffice just fine.
    Mainframe assembler programmer by trade. C coder when I can.

  4. #19
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Using float would be worse then a long unsigned int.

    Double has a 52 fraction bit, which means 53 bit precision. Which is a significant gain against an unsigned long int's 32 bits. So use double if you think the file may be larger than 16.8 MB.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  5. #20
    Registered User
    Join Date
    May 2009
    Posts
    60
    Here sample...you have to modify, use at your own risk ^^

    Code:
    /*
     * @author		bvkim
     * @date		19 August, 2009
     * @file		sample.c
     * @note
     */
    
    #define DEBUG
    
    #include <stdio.h>
    #include <stdlib.h>
    
    typedef struct _DATA {
    	unsigned int	sum;
    	unsigned int	count;
    } DATA;
    
    
    int	average_bytes( const char *file, DATA *data )
    {
    	FILE* fp;
    	unsigned int c;
    	
    	fp = fopen( file, "r" );
    	if( fp == NULL ) {
    		return 0;
    	}
    	do {
    		c = fgetc(fp);
    		if( c == EOF )
    			break;
    		#ifdef DEBUG
    		printf("%d - ", c);
    		#endif
    		data->sum += c;
    		data->count++;
    	
    	} while ( c != EOF );	
    	#ifdef DEBUG
    	printf("\n");
    	#endif
    	fclose( fp );
    	
    	return 1;	
    }
    
    int
    main( int argc, char *argv[] )
    {
    	DATA *data;
    	float avg;
    	data = (DATA *) malloc( sizeof data ) ;
    	data->sum = 0;
    	data->count = 0;
    	if( average_bytes("test", data) == 0 ) {	
    		printf("[error] error\n");
    		exit(1);
    	}
    	#ifdef DEBUG
    	printf("Sum %d - Count %d\n", data->sum, data->count);
    	#endif
    	avg = data->sum / data->count;
    	
    	printf("byte average: %.2f \n", avg );
    	
    	free( data );
    	
    	return 0;	
    }

  6. #21
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Quote Originally Posted by King Mir View Post
    Using float would be worse then a long unsigned int.

    Double has a 52 fraction bit, which means 53 bit precision. Which is a significant gain against an unsigned long int's 32 bits. So use double if you think the file may be larger than 16.8 MB.
    I don't think so. Max range for a float is 10**38. That's a helluva lot more than an unsigned int. We don't care about fractions in this case.
    Mainframe assembler programmer by trade. C coder when I can.

  7. #22
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Quote Originally Posted by Dino View Post
    I don't think so. Max range for a float is 10**38. That's a helluva lot more than an unsigned int. We don't care about fractions in this case.
    Yes, but you want integer precision since if the result is expected to be in the range of 0-255 of the final calculation. To do that, the integer you want to store must have at most the number of significant bits that the format can store. This is one plus the the number of bits of the significand for the IEEE floating point representation.

    Otherwise you'll have rounding errors. Specifically, when trying to add a number to your running total, then number will be rounded off of it's least significant bits.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  8. #23
    Registered User
    Join Date
    Aug 2009
    Posts
    12
    I doubt you solved the problem correctly. Assume BLOCK_SIZE = 3, the average should be 3/1 = 3. Why adding the value of the bytes and dividing them by 3. It should be adding the number of the bytes and dividing them by the number of the blocks. Assume that the size of the given file in bytes is no greater than a multiple of 256 bytes, how could we solve this problem? Change BLOCK_SIZE = 256 instead.

    Quote Originally Posted by Dino View Post
    If a file had 3 characters in it: A B and C. The average would be B. (65+66+67)/3 = 66 = B

  9. #24
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Quote Originally Posted by King Mir View Post
    Yes, but you want integer precision since if the result is expected to be in the range of 0-255 of the final calculation. To do that, the integer you want to store must have at most the number of significant bits that the format can store. This is one plus the the number of bits of the significand for the IEEE floating point representation.

    Otherwise you'll have rounding errors. Specifically, when trying to add a number to your running total, then number will be rounded off of it's least significant bits.
    I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

    In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

    What are you saying that I'm not understanding?
    Mainframe assembler programmer by trade. C coder when I can.

  10. #25
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Quote Originally Posted by Dino View Post
    I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

    In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

    What are you saying that I'm not understanding?
    When adding a small number to a large number, the smaller number can be rounded off.

    For example, imagine using a format with only four bits of significand, and adding 1.0000x2^5(32) and 1.1000x2^1 (3). The result should be 1.00011x2^5 (35), but the last digit does not fit in the encoding, so it must be rounded off, rounding either up or down getting either 34 or 36.

    IEEE floating point formats are similar and suffer from the outlined problem.

    Normally, you don't need 24 digits of precision, so this is not a problem. But in this case you want integer precision, no matter how high the accumulated total gets.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  11. #26
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Quote Originally Posted by ljin View Post
    I doubt you solved the problem correctly. Assume BLOCK_SIZE = 3, the average should be 3/1 = 3. Why adding the value of the bytes and dividing them by 3. It should be adding the number of the bytes and dividing them by the number of the blocks. Assume that the size of the given file in bytes is no greater than a multiple of 256 bytes, how could we solve this problem? Change BLOCK_SIZE = 256 instead.
    BLOCK_SIZE in the fread() has nothing to do with the average of all bytes in a file. With BLOCK_SIZE, you're just reading chunks of the file at a time.

    Now, if your task was to calculate how many blocks were in a file of a given size, then you wouldn't even have to read the file to figure that out.
    Mainframe assembler programmer by trade. C coder when I can.

  12. #27
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by Dino View Post
    I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

    In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

    What are you saying that I'm not understanding?
    For example:
    Code:
    include <stdio.h>
    #include <math.h>
    
    int main() {
        float b = powf(2.0f,32.0f);
        float c = b + 255.0;
        if ((c-b) == 0.0f) {
            printf("Oops, I guess that byte didn't get added after all!\n");
        }
        return 0;
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Newbie homework help
    By fossage in forum C Programming
    Replies: 3
    Last Post: 04-30-2009, 04:27 PM
  2. algorithm for duplicate file checking help
    By geekoftheweek in forum C Programming
    Replies: 1
    Last Post: 04-04-2009, 01:46 PM
  3. calculate average from a file
    By mrsirpoopsalot in forum C++ Programming
    Replies: 11
    Last Post: 01-20-2009, 02:25 PM
  4. gcc link external library
    By spank in forum C Programming
    Replies: 6
    Last Post: 08-08-2007, 03:44 PM
  5. Possible circular definition with singleton objects
    By techrolla in forum C++ Programming
    Replies: 3
    Last Post: 12-26-2004, 10:46 AM