# Thread: How do I convert a float to its exact integer division?

1. ## How do I convert a float to its exact integer division?

Hello All,

Let's say that I have the number 123.3443.

Now, I want this to be re-written as 16166984/131072.

So, I kind of have a basic understanding of floats now. There's the sign (1 bit), the exponent (8 bits? 13?) and the mantissa (53 in my case, as I'm using doubles). Or I guess I could use floats instead of doubles but nevertheless...

So, 123.3443 would be written as +1.233443 x 10^2.

sign : 0
exponent : 127 + 2 = 129 (given that the exponent is zeroed at 127)
mantissa : 1.233443

Oh wait, I'm screwing this up. Basically, if we had 3.75 this would be written in its binary form as 11.11

Why?

3 = 11, obviously.

But .75 = 3 / 4 = 1/2 + 1/4 which is how the trailing numbers after the . work, diving by terms in the binary sequence.

So, I need to write code that convert the the remainder part more as a sum because 3 + .75 is indeed 3.75.

I know I need to count the length of the binary or at least, I need it to create the denominator. And I need to convert the decimal part into a fraction so I need the numerator as well.

Does anyone have an idea of how I would start this? 2. I suppose we should start with "why 16166984/131072". Is this supposed to be a literal translation of the actual number stored (ie convert the mantissa to an int, then divide by the correct power of 2 as given by the exponent), or the nearest feasible fraction to what is stored, or what? I'm assuming you don't want simplification, since both the numerator and denominator are even. 3. Okay, this is supposed to be nearest feasible fraction. So I guess I have the number I want to convert and it's conversion. The difference between the two should be something like 1e-15, if I can control that.

Evaluating 16166984 / (2 ^ 17) = 123.3442993, just using my good ol' TI-84 4. Originally Posted by MutantJohn Okay, this is supposed to be nearest feasible fraction. So I guess I have the number I want to convert and it's conversion. The difference between the two should be something like 1e-15, if I can control that.

Evaluating 16166984 / (2 ^ 17) = 123.3442993, just using my good ol' TI-84
Heh, I was typing and not reading what I wrote. By "nearest feasible fraction" I meant 1233443/1000, ie the fraction within tolerance with the smallest denominator. I may have a reference for that (Knuth's TAOCP vol 2 as I recall). 5. I'm trying to reduce the error from floating point arithmetic so I'm looking for something as exact as can be. I only really need this math for calculating determinants of 2 x 2 matrices but if the floats have such and such error parameters, I need to instead replace the float arithmetic with exact integers.

Instead of .25 - .75, I need 1/4 - 3/4 because the sign of this operation is really what I'm after and I need to know for absolutely certain that 1 - 3 is less than 0. Granted, I wouldn't bust this algorithm out for this but this is the general idea.

A 2x2 is essentially a*d - b*c and I need the sign of this operation exactly. 6. Example:
Bit pattern for 9.28125 is
01000000 00100010 10010000 00000000 00000000 00000000 00000000 00000000

The exponent is 1026 - 1023 = 3.

Using ** for exponentiation, the significand bits (including the one that is not represented) represent this sum:

2**3 + 2**0 + 2**(-2) + 2**(-5)

To turn this into an integer fraction, multiply by 2**5 (and put the result over that as the divisor).

2**8 + 2**5 + 2**3 + 2**0
-------------------------------
2**5

which is 297 / 32 == 9.28125.

EDIT:
But beware denormalized representations!. 7. Originally Posted by MutantJohn I'm trying to reduce the error from floating point arithmetic so I'm looking for something as exact as can be.
You could forgo the use of floating point values and use your own "rational" type. 8. You could also use a bignum library (gnuMP), which has pseudo-infinite precision. 9. I'm unsure as to why you think you need to reduce the error introduced by a floating point variable in the calculation of a matrix determinant. Care to show us what kind of algorithm you are using? 10. It's a common thing in computational geometry to worry about such a thing. I'm essentially performing the insphere test for a tetrahedron and a point.

If you consider a positively oriented tetrahedron t (made up of vertices a, b, c and d) and a point, p, it is possible to determine if p is contained by the circumsphere of the tetrahedron. If the test is negative, it's inside, positive if outside and on the surface if exactly equal to 0.

In this instance, the value of the number is not so much the issue as is the sign.

If we consider a moving set of points that we wish to triangulate at a timestep (basically, each point has a velocity and we triangulate with time), the positions may not be accurately represented and we run the risk of arriving at the wrong sign. This happens when the result of the insphere test is incredibly small or as the jargon persists, nearly degenerate.

I've noticed throughout my meshing travels that I'll get determinants as small as 1e-15 and when this happens, I get infinite flip-flopping in terms of repair routines meaning the same configuration of tetrahedra cannot be repaired.

Using exact integers is nice because it gives you the correct sign no matter what. I have heard of GMP being used in this regard but I almost kind of wanted to see if I could do it myself. But maybe GMP would just be easier.

Eh, maybe I butchered that but that's my stab at explaining it. If I was totally wrong the nevertheless, it's a thing I read and got curious about. I was wondering if I'd be able to do it but GMP is probably the best choice as it's starting to sound hard.

Edit :

insphere test is the determinant of : (q = magnitude)

1 a.x a.y a.z a.q
1 b.x b.y b.z b.q
1 c.x c.y c.z c.q
1 d.x d.y d.z d.q
1 p.x p.y p.z p.q 11. Well in that case indeed you must use an arbitrary precision arithmetic library as suggested. Converting the floating point representation of a real to a fraction won't do you any good. You can as easily get the sign from the double you are working with. The processor doesn't know how to perform fractional arithmetic. Any sign errors you get during floating point operations will necessarily be duplicated on the fraction representation.

But libraries like the already mentioned GMP, or MPFR will do you good. I suggest MPFR. GMP is a broader library and you would only need the MPF group of functions there. It's also slightly more functional for arbitrary precision arithmetic than GMP/MPF since that's its sole focus. 12. For the fraction:
Isn't it just the number of '1's in the fractional part (to the right of the implied decimal point) divided by 2^n where n is the same number.

For example: .101 = 5 / 2^3 = 5 / 8

If there was a whole number part then add that in multiplied by the same denominator

11.101 would become (3x8 + 5) / 8 = 29 / 8

So the problem reduces to looking at the number of digits precision (or more properly: where is the right-most occurring '1').

This would fall apart for any repeated fraction (such as 0.3333 in base ten for example) - because we can not deduce the EXACT integer numerator & denominator when we know some of the precision had been cut off. 13. There is the inconvenient fact that not all decimal values of the form 123.3443 can be represented exactly using floating point. I haven't bothered to check whether 123.3443 can be represented in typical floating point representations or not.

Anyhoo ....

In any event, since you seem to be happy fiddling with the bits to get mantissa, sign, and exponents separately, the approach should be simple.

Assuming you have some means of working with a rational number in the forms of two integers (including operations for adding, multiplying, dividing, and simplifying) it should be easy. The most significant bits of the mantissa will represent the sum of negative powers of two, where the first is non-zero. So the the bits in the mantissa might be 1001110 representing 1/2 + 0/4 + 0/8 + 1/16 +1/32 + 1/64 + 0/128 = (64 + 8 + 4 + 2)/128 = 78/128. Greatest common divisor of 78 and 128 is 2, so that simplifies to 39/64. [Assuming I've done the maths right].

Depending on floating point representation, the exponent can represent a power of a base which is 2,10, or 16 (typically). For negative powers, simply multiply the denominator (bottom of the fraction) by the appropriate power of the base and simplify. For positive powers, multiply the numerator (top of the fraction) by appropriate power of the base, and simplify.

If there are other features of the floating point representation (able to represent infinities, NaNs, etc) you need to take that into account.

Done.

If you are starting with user input like 123.3443 then DON'T read it as a float, and try to convert back (since there will be some imprecision introduced in the conversion to float). Work with it as a string. First loop over the digits to the left of the decimal points to produce an integer 123 (or a rational of the form 123/1). Once past the decimal point, simply keep track of the negative power of 10 corresponding to each digit. For each digit, form a fraction of the form digit/10^power and add. So .3443 would produce a sequence 3/10 + 4/100 + 4/1000 + 3/1000. Then (using the fact you have an approach to add rationals) add them up and add to the 123. If you have an input of the form 123.3443E5 then multiply by 10, 5 times. Popular pages Recent additions 