Thread: How do I convert a float to its exact integer division?

  1. #1
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665

    How do I convert a float to its exact integer division?

    Hello All,

    Let's say that I have the number 123.3443.

    Now, I want this to be re-written as 16166984/131072.

    So, I kind of have a basic understanding of floats now. There's the sign (1 bit), the exponent (8 bits? 13?) and the mantissa (53 in my case, as I'm using doubles). Or I guess I could use floats instead of doubles but nevertheless...

    So, 123.3443 would be written as +1.233443 x 10^2.

    sign : 0
    exponent : 127 + 2 = 129 (given that the exponent is zeroed at 127)
    mantissa : 1.233443

    Oh wait, I'm screwing this up. Basically, if we had 3.75 this would be written in its binary form as 11.11

    Why?

    3 = 11, obviously.

    But .75 = 3 / 4 = 1/2 + 1/4 which is how the trailing numbers after the . work, diving by terms in the binary sequence.

    So, I need to write code that convert the the remainder part more as a sum because 3 + .75 is indeed 3.75.

    I know I need to count the length of the binary or at least, I need it to create the denominator. And I need to convert the decimal part into a fraction so I need the numerator as well.

    Does anyone have an idea of how I would start this?

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    I suppose we should start with "why 16166984/131072". Is this supposed to be a literal translation of the actual number stored (ie convert the mantissa to an int, then divide by the correct power of 2 as given by the exponent), or the nearest feasible fraction to what is stored, or what? I'm assuming you don't want simplification, since both the numerator and denominator are even.

  3. #3
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Okay, this is supposed to be nearest feasible fraction. So I guess I have the number I want to convert and it's conversion. The difference between the two should be something like 1e-15, if I can control that.

    Evaluating 16166984 / (2 ^ 17) = 123.3442993, just using my good ol' TI-84

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by MutantJohn View Post
    Okay, this is supposed to be nearest feasible fraction. So I guess I have the number I want to convert and it's conversion. The difference between the two should be something like 1e-15, if I can control that.

    Evaluating 16166984 / (2 ^ 17) = 123.3442993, just using my good ol' TI-84
    Heh, I was typing and not reading what I wrote. By "nearest feasible fraction" I meant 1233443/1000, ie the fraction within tolerance with the smallest denominator. I may have a reference for that (Knuth's TAOCP vol 2 as I recall).

  5. #5
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    I'm trying to reduce the error from floating point arithmetic so I'm looking for something as exact as can be. I only really need this math for calculating determinants of 2 x 2 matrices but if the floats have such and such error parameters, I need to instead replace the float arithmetic with exact integers.

    Instead of .25 - .75, I need 1/4 - 3/4 because the sign of this operation is really what I'm after and I need to know for absolutely certain that 1 - 3 is less than 0. Granted, I wouldn't bust this algorithm out for this but this is the general idea.

    A 2x2 is essentially a*d - b*c and I need the sign of this operation exactly.

  6. #6
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Example:
    Bit pattern for 9.28125 is
    01000000 00100010 10010000 00000000 00000000 00000000 00000000 00000000

    The exponent is 1026 - 1023 = 3.

    Using ** for exponentiation, the significand bits (including the one that is not represented) represent this sum:

    2**3 + 2**0 + 2**(-2) + 2**(-5)

    To turn this into an integer fraction, multiply by 2**5 (and put the result over that as the divisor).

    2**8 + 2**5 + 2**3 + 2**0
    -------------------------------
    2**5

    which is 297 / 32 == 9.28125.

    EDIT:
    But beware denormalized representations!.
    Last edited by oogabooga; 01-09-2014 at 12:32 PM. Reason: added caveat
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  7. #7
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Quote Originally Posted by MutantJohn View Post
    I'm trying to reduce the error from floating point arithmetic so I'm looking for something as exact as can be.
    You could forgo the use of floating point values and use your own "rational" type.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  8. #8
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    You could also use a bignum library (gnuMP), which has pseudo-infinite precision.

  9. #9
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    I'm unsure as to why you think you need to reduce the error introduced by a floating point variable in the calculation of a matrix determinant. Care to show us what kind of algorithm you are using?
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #10
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    It's a common thing in computational geometry to worry about such a thing. I'm essentially performing the insphere test for a tetrahedron and a point.

    If you consider a positively oriented tetrahedron t (made up of vertices a, b, c and d) and a point, p, it is possible to determine if p is contained by the circumsphere of the tetrahedron. If the test is negative, it's inside, positive if outside and on the surface if exactly equal to 0.

    In this instance, the value of the number is not so much the issue as is the sign.

    If we consider a moving set of points that we wish to triangulate at a timestep (basically, each point has a velocity and we triangulate with time), the positions may not be accurately represented and we run the risk of arriving at the wrong sign. This happens when the result of the insphere test is incredibly small or as the jargon persists, nearly degenerate.

    I've noticed throughout my meshing travels that I'll get determinants as small as 1e-15 and when this happens, I get infinite flip-flopping in terms of repair routines meaning the same configuration of tetrahedra cannot be repaired.

    Using exact integers is nice because it gives you the correct sign no matter what. I have heard of GMP being used in this regard but I almost kind of wanted to see if I could do it myself. But maybe GMP would just be easier.

    Eh, maybe I butchered that but that's my stab at explaining it. If I was totally wrong the nevertheless, it's a thing I read and got curious about. I was wondering if I'd be able to do it but GMP is probably the best choice as it's starting to sound hard.

    Edit :

    insphere test is the determinant of : (q = magnitude)

    1 a.x a.y a.z a.q
    1 b.x b.y b.z b.q
    1 c.x c.y c.z c.q
    1 d.x d.y d.z d.q
    1 p.x p.y p.z p.q

  11. #11
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Well in that case indeed you must use an arbitrary precision arithmetic library as suggested. Converting the floating point representation of a real to a fraction won't do you any good. You can as easily get the sign from the double you are working with. The processor doesn't know how to perform fractional arithmetic. Any sign errors you get during floating point operations will necessarily be duplicated on the fraction representation.

    But libraries like the already mentioned GMP, or MPFR will do you good. I suggest MPFR. GMP is a broader library and you would only need the MPF group of functions there. It's also slightly more functional for arbitrary precision arithmetic than GMP/MPF since that's its sole focus.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  12. #12
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    For the fraction:
    Isn't it just the number of '1's in the fractional part (to the right of the implied decimal point) divided by 2^n where n is the same number.

    For example: .101 = 5 / 2^3 = 5 / 8

    If there was a whole number part then add that in multiplied by the same denominator

    11.101 would become (3x8 + 5) / 8 = 29 / 8

    So the problem reduces to looking at the number of digits precision (or more properly: where is the right-most occurring '1').

    This would fall apart for any repeated fraction (such as 0.3333 in base ten for example) - because we can not deduce the EXACT integer numerator & denominator when we know some of the precision had been cut off.
    Last edited by nonoob; 01-10-2014 at 02:01 PM.

  13. #13
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    There is the inconvenient fact that not all decimal values of the form 123.3443 can be represented exactly using floating point. I haven't bothered to check whether 123.3443 can be represented in typical floating point representations or not.

    Anyhoo ....

    In any event, since you seem to be happy fiddling with the bits to get mantissa, sign, and exponents separately, the approach should be simple.

    Assuming you have some means of working with a rational number in the forms of two integers (including operations for adding, multiplying, dividing, and simplifying) it should be easy. The most significant bits of the mantissa will represent the sum of negative powers of two, where the first is non-zero. So the the bits in the mantissa might be 1001110 representing 1/2 + 0/4 + 0/8 + 1/16 +1/32 + 1/64 + 0/128 = (64 + 8 + 4 + 2)/128 = 78/128. Greatest common divisor of 78 and 128 is 2, so that simplifies to 39/64. [Assuming I've done the maths right].

    Depending on floating point representation, the exponent can represent a power of a base which is 2,10, or 16 (typically). For negative powers, simply multiply the denominator (bottom of the fraction) by the appropriate power of the base and simplify. For positive powers, multiply the numerator (top of the fraction) by appropriate power of the base, and simplify.

    If there are other features of the floating point representation (able to represent infinities, NaNs, etc) you need to take that into account.

    Done.

    If you are starting with user input like 123.3443 then DON'T read it as a float, and try to convert back (since there will be some imprecision introduced in the conversion to float). Work with it as a string. First loop over the digits to the left of the decimal points to produce an integer 123 (or a rational of the form 123/1). Once past the decimal point, simply keep track of the negative power of 10 corresponding to each digit. For each digit, form a fraction of the form digit/10^power and add. So .3443 would produce a sequence 3/10 + 4/100 + 4/1000 + 3/1000. Then (using the fact you have an approach to add rationals) add them up and add to the 123. If you have an input of the form 123.3443E5 then multiply by 10, 5 times.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C Float: Should use Division or Multiplication
    By hqt in forum C Programming
    Replies: 8
    Last Post: 01-01-2012, 03:30 AM
  2. Replies: 9
    Last Post: 07-15-2011, 03:22 PM
  3. int division to float
    By rimig88 in forum C++ Programming
    Replies: 4
    Last Post: 04-22-2008, 08:48 AM
  4. Replies: 8
    Last Post: 07-08-2005, 09:12 AM
  5. Is there a: convert float to integer function??
    By AssistMe in forum C Programming
    Replies: 4
    Last Post: 03-07-2005, 11:42 PM