# Thread: Storing a float in 16 bits

1. ## Storing a float in 16 bits

I need to store 10 million+ floats in memory for a program I'm writing, so I wanted to be able to store them in 2 bytes instead of 4. I was wondering if there is a good way to truncate a float down to 2 bytes and if needed go up from the 2 byte representation to a float.

Thanks!

2. It boils down to range and precision.
Can you maintain the range and precision you want in 16 bits?

3. Originally Posted by Salem
It boils down to range and precision.
Can you maintain the range and precision you want in 16 bits?
Yeah, I'm willing to sacrifice precision, and in terms of range, the values I want to store only vary from -100 to +100, so range is not a big issue

4. There are some standard 16-bit floating point values, but they are pretty limited in useability - graphics processors use them somtimes to store "floating point pixels".

What range are your numbers? Would it be suitable to store them in a 12-bit signed integer and 4-bit fraction part?

The following does that. Note that you loose quite a bit of precision this way, but you can't have both compact format and a lot of precision.

Code:
```#include <stdio.h>
#include <math.h>

short ftofix16(float num) {

short i, f;

if (fabs(num) > 2047.999f) {
printf("Error: number out of range (num=%f)\n", num);
}

i = (short)num;
f = (short)(fabs(num * 16)) & 15;
return (i << 4) | f;
}

float fix16tof(int n)
{
float s = 1.0f;
if (n < 0) {
s = -1.0f;
n = -n;
}
return s * ((float)(n >> 4) + ((n & 15) / 16.0f));
}

int main(int argc, char **argv) {
float f, g, h;
short a, b, c;
for(;;) {
scanf("%f %f %f", &f, &g, &h);
a = ftofix16(f);
b = ftofix16(g);
c = ftofix16(h);
printf("%04x, %04x, %04x\n", a, b, c);
printf("%f, %f, %f\n", fix16tof(a), fix16tof(b), fix16tof(c));
}
return 0;
}```

5. I wrote up the code shown above whilst replying, so I didn't know the range. With such a small number of significant digits, you could go for "8.8". The code would need to change from shifting by 4 to shifting by 8 and & 15 to & 255. The multiplication by 16 should be multiplication by 256. Otherwise same idea.

--
Mats

6. Thanks mats! That was really helpful.

7. Originally Posted by kara3434
Yeah, I'm willing to sacrifice precision, and in terms of range, the values I want to store only vary from -100 to +100, so range is not a big issue
Then use a 8.8 fixed point representation. That gives a range of the whole part from -128 to 127, and divisions of 1/256 in the fractional part.

8. If you would rather have a 16-bit float than a 16-bit fixed, then check out my website (link in sig). Go to the Useful classes page. You should find Shortfloat there.
It's C++ though, but you can break it up into lots of little C functions instead.