# Thread: How can I work with doubles and stil get exact values?

1. ## How can I work with doubles and stil get exact values?

Basically I need a way to make sure that something like 1.1 will be stored as 1.1 and not as 1.1000000088...
I guess one way would be to multiply each number by 100000 and cast it into a long int, but because of the way my programm is designed, this wouldn't be all that efficient to implement...
Is there another way to solve this problem, maybe a method that "cuts off" the unwanted part?
Any help would be appreciated...

2. I'm afraid that this kind of thing goes with the teritory when emulating floating point values on an computer. If it is that important to me, i always use scaled integers.

3. Pick the precision you want, say two trailing decimals.

so precision = 100;

Code:
```float trunc(const float value, int precision)
{
int truncate = (int) (value * precision);
float result = (float)(truncate / precision);
return result;
}

void trunc( float *value, int precision)
{
int truncate = (int) ((*value) * precision);
*value = (float)(truncate / precision);
}```

4. [code]
float trunc(const float value, int precision)
{
int truncate = (int) (value * precision);
float result = (float)(truncate / precision);
return result;
}

That won't work, truncate / precision will compute an integer value and even if you convert it afterwards you only get the integer part.
(float) (truncate) / passion won't return the wanted value either so it seems that the only correct way to do this is to used scaled integer values.

5. Your right. But a slight modification makes it work.

Code:
```float trunc(const float value, int precision)
{
int truncate = value * precision;
float intermediate = (float)truncate;
float result = intermediate / precision;
return result;
}

int main()
{
float num = 1.62592;
num = trunc(num, 1000);
printf("%f", num);             // will print "1.625000"
getch();
return 0;
}```

6. And using any of those on modern computers is a waste of time. The double data type is a 64-bit value. These will be handled by the FPU if you have floating point turned on in your compiler. The type cast will not waste nearly as many clock cycles as the other methods.

To store a 64 bit value into an integer all you need is one opcode: fistp [integervar]. This stores st(0) into integervar and then pops st(0). This does not take long at all and the value is correctly truncated or type casted. I'm not exactly sure what compilers do when you typecast from double (64-bit) to integer (16-bit - in 16-bit environment) but that is how I would do it in assembly. On an FPU machine it is best to use double and long double's.

Here are the 16-bit data types data width and ranges.

unsigned char
8 bits 0 to 255

char
8 bits -128 to 127

enum
16 bits -32,768 to 32,767

unsigned int
16 bits 0 to 65,535

short int
16 bits -32,768 to 32,767

int
16 bits -32,768 to 32,767

unsigned long
32 bits 0 to 4,294,967,295

long
32 bits -2,147,483,648 to 2,147,483,647

float
32 bits 3.4 x 10-38 to 3.4 x 10+38

double
64 bits 1.7 x 10-308 to 1.7 x 10+308

long double
80 bits 3.4 x 10-4932 to 1.1 x 10+4932

near (pointer) 16 bits not applicable

far (pointer) 32 bits not applicable

7. Times by # of decimals.
store decimals (%) as a double.
Subtract decimals from Original.
divide back into original.

8. It's impossible to store .1 exactly in a binary representation
as it's repeating when in binary. You will have to specifiy what the problem exactly is but you could try printing out by rounding and comparing a value to x +- epsilon. You could try your own binary coded decimal representation but that would probably be slow.

Popular pages Recent additions