How to calculate an average of all bytes in a given file?

**King Mir** · 08-18-2009

Originally Posted by Sebastiani

An average is simply all the values added together and then divided by the number of values. Problem is, adding a bunch of bytes together using a single byte will likely overflow, so you'll either need a larger data type or use perhaps a clever mathematical technique to overcome that issue.

Using unsinged int, overflow won't be a problem unless the file is over 16.8 Megabytes. It shouldn't be a problem.

**tabstop** · 08-18-2009

There's not a possibility of running into overflow -- there's essentially a certainty of it (assuming ASCII text and say four characters). You'll have to store the sum elsewhere. (EDIT: I'm assuming you're intending storing the sum in an unsigned char, for whatever reason. If you're thinking of an int, then no worries.)

**Dino** · 08-18-2009

If an unsigned integer is used to accumulate the total sum of all bytes, and if all bytes were 0xFF, then the file could still be 16MB before an overflow would occur. In that case, a float or double would suffice just fine.

**King Mir** · 08-18-2009

Using float would be worse then a long unsigned int.

Double has a 52 fraction bit, which means 53 bit precision. Which is a significant gain against an unsigned long int's 32 bits. So use double if you think the file may be larger than 16.8 MB.

**bvkim** · 08-18-2009

Here sample...you have to modify, use at your own risk ^^

Code:

/*
 * @author		bvkim
 * @date		19 August, 2009
 * @file		sample.c
 * @note
 */

#define DEBUG

#include <stdio.h>
#include <stdlib.h>

typedef struct _DATA {
	unsigned int	sum;
	unsigned int	count;
} DATA;


int	average_bytes( const char *file, DATA *data )
{
	FILE* fp;
	unsigned int c;
	
	fp = fopen( file, "r" );
	if( fp == NULL ) {
		return 0;
	}
	do {
		c = fgetc(fp);
		if( c == EOF )
			break;
		#ifdef DEBUG
		printf("%d - ", c);
		#endif
		data->sum += c;
		data->count++;
	
	} while ( c != EOF );	
	#ifdef DEBUG
	printf("\n");
	#endif
	fclose( fp );
	
	return 1;	
}

int
main( int argc, char *argv[] )
{
	DATA *data;
	float avg;
	data = (DATA *) malloc( sizeof data ) ;
	data->sum = 0;
	data->count = 0;
	if( average_bytes("test", data) == 0 ) {	
		printf("[error] error\n");
		exit(1);
	}
	#ifdef DEBUG
	printf("Sum %d - Count %d\n", data->sum, data->count);
	#endif
	avg = data->sum / data->count;
	
	printf("byte average: %.2f \n", avg );
	
	free( data );
	
	return 0;	
}

**Dino** · 08-18-2009

Originally Posted by King Mir

Using float would be worse then a long unsigned int.

Double has a 52 fraction bit, which means 53 bit precision. Which is a significant gain against an unsigned long int's 32 bits. So use double if you think the file may be larger than 16.8 MB.

I don't think so. Max range for a float is 10**38. That's a helluva lot more than an unsigned int. We don't care about fractions in this case.

**King Mir** · 08-18-2009

Originally Posted by Dino

I don't think so. Max range for a float is 10**38. That's a helluva lot more than an unsigned int. We don't care about fractions in this case.

Yes, but you want integer precision since if the result is expected to be in the range of 0-255 of the final calculation. To do that, the integer you want to store must have at most the number of significant bits that the format can store. This is one plus the the number of bits of the significand for the IEEE floating point representation.

Otherwise you'll have rounding errors. Specifically, when trying to add a number to your running total, then number will be rounded off of it's least significant bits.

**ljin** · 08-19-2009

I doubt you solved the problem correctly. Assume BLOCK_SIZE = 3, the average should be 3/1 = 3. Why adding the value of the bytes and dividing them by 3. It should be adding the number of the bytes and dividing them by the number of the blocks. Assume that the size of the given file in bytes is no greater than a multiple of 256 bytes, how could we solve this problem? Change BLOCK_SIZE = 256 instead.

Originally Posted by Dino

If a file had 3 characters in it: A B and C. The average would be B. (65+66+67)/3 = 66 = B

**Dino** · 08-19-2009

Originally Posted by King Mir

Yes, but you want integer precision since if the result is expected to be in the range of 0-255 of the final calculation. To do that, the integer you want to store must have at most the number of significant bits that the format can store. This is one plus the the number of bits of the significand for the IEEE floating point representation.

Otherwise you'll have rounding errors. Specifically, when trying to add a number to your running total, then number will be rounded off of it's least significant bits.

I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

What are you saying that I'm not understanding?

**King Mir** · 08-19-2009

Originally Posted by Dino

I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

What are you saying that I'm not understanding?

When adding a small number to a large number, the smaller number can be rounded off.

For example, imagine using a format with only four bits of significand, and adding 1.0000x2^5(32) and 1.1000x2^1 (3). The result should be 1.00011x2^5 (35), but the last digit does not fit in the encoding, so it must be rounded off, rounding either up or down getting either 34 or 36.

IEEE floating point formats are similar and suffer from the outlined problem.

Normally, you don't need 24 digits of precision, so this is not a problem. But in this case you want integer precision, no matter how high the accumulated total gets.

**Dino** · 08-19-2009

Originally Posted by ljin

I doubt you solved the problem correctly. Assume BLOCK_SIZE = 3, the average should be 3/1 = 3. Why adding the value of the bytes and dividing them by 3. It should be adding the number of the bytes and dividing them by the number of the blocks. Assume that the size of the given file in bytes is no greater than a multiple of 256 bytes, how could we solve this problem? Change BLOCK_SIZE = 256 instead.

BLOCK_SIZE in the fread() has nothing to do with the average of all bytes in a file. With BLOCK_SIZE, you're just reading chunks of the file at a time.

Now, if your task was to calculate how many blocks were in a file of a given size, then you wouldn't even have to read the file to figure that out.

**tabstop** · 08-19-2009

Originally Posted by Dino

I'm not sure I follow you. They only opportunity for a fractional value comes at divide time at the end, just as there would be when using an int for an accumulator. Then, the average can be cast to an int (or unsigned char) and you are done. When accumulating, there is no rounding and there is no fractional portion of the total - it's a whole number.

In terms of working with whole numbers, a float is just like an int, but a lot bigger bucket in the same amount of space (both are 4 bytes).

What are you saying that I'm not understanding?

For example:

Code:

include <stdio.h>
#include <math.h>

int main() {
    float b = powf(2.0f,32.0f);
    float c = b + 255.0;
    if ((c-b) == 0.0f) {
        printf("Oops, I guess that byte didn't get added after all!\n");
    }
    return 0;
}

Thread: How to calculate an average of all bytes in a given file?

Thread Tools

Search Thread

Display

Similar Threads

Newbie homework help

algorithm for duplicate file checking help

calculate average from a file

gcc link external library

Possible circular definition with singleton objects