Thread: How can I work with doubles and stil get exact values?

  1. #1
    Unregistered
    Guest

    Question How can I work with doubles and stil get exact values?

    Basically I need a way to make sure that something like 1.1 will be stored as 1.1 and not as 1.1000000088...
    I guess one way would be to multiply each number by 100000 and cast it into a long int, but because of the way my programm is designed, this wouldn't be all that efficient to implement...
    Is there another way to solve this problem, maybe a method that "cuts off" the unwanted part?
    Any help would be appreciated...

  2. #2
    It's full of stars adrianxw's Avatar
    Join Date
    Aug 2001
    Posts
    4,829
    I'm afraid that this kind of thing goes with the teritory when emulating floating point values on an computer. If it is that important to me, i always use scaled integers.
    Wave upon wave of demented avengers march cheerfully out of obscurity unto the dream.

  3. #3
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Pick the precision you want, say two trailing decimals.

    so precision = 100;

    Code:
    float trunc(const float value, int precision)
    {
    int truncate = (int) (value * precision);
    float result = (float)(truncate / precision);
    return result;
    }
    
    
    
    void trunc( float *value, int precision)
    {
    int truncate = (int) ((*value) * precision);
    *value = (float)(truncate / precision);
    }
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  4. #4
    Unregistered
    Guest
    [code]
    float trunc(const float value, int precision)
    {
    int truncate = (int) (value * precision);
    float result = (float)(truncate / precision);
    return result;
    }


    That won't work, truncate / precision will compute an integer value and even if you convert it afterwards you only get the integer part.
    (float) (truncate) / passion won't return the wanted value either so it seems that the only correct way to do this is to used scaled integer values.

  5. #5
    Unregistered
    Guest
    Your right. But a slight modification makes it work.

    Code:
    float trunc(const float value, int precision)
    {
    int truncate = value * precision;
    float intermediate = (float)truncate;
    float result = intermediate / precision;
    return result;
    }
    
    
    int main()
    {
     float num = 1.62592;
     num = trunc(num, 1000);
     printf("%f", num);             // will print "1.625000"
     getch();
      return 0;
    }

  6. #6
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    And using any of those on modern computers is a waste of time. The double data type is a 64-bit value. These will be handled by the FPU if you have floating point turned on in your compiler. The type cast will not waste nearly as many clock cycles as the other methods.


    To store a 64 bit value into an integer all you need is one opcode: fistp [integervar]. This stores st(0) into integervar and then pops st(0). This does not take long at all and the value is correctly truncated or type casted. I'm not exactly sure what compilers do when you typecast from double (64-bit) to integer (16-bit - in 16-bit environment) but that is how I would do it in assembly. On an FPU machine it is best to use double and long double's.

    Here are the 16-bit data types data width and ranges.


    unsigned char
    8 bits 0 to 255

    char
    8 bits -128 to 127

    enum
    16 bits -32,768 to 32,767

    unsigned int
    16 bits 0 to 65,535

    short int
    16 bits -32,768 to 32,767

    int
    16 bits -32,768 to 32,767

    unsigned long
    32 bits 0 to 4,294,967,295

    long
    32 bits -2,147,483,648 to 2,147,483,647

    float
    32 bits 3.4 x 10-38 to 3.4 x 10+38

    double
    64 bits 1.7 x 10-308 to 1.7 x 10+308

    long double
    80 bits 3.4 x 10-4932 to 1.1 x 10+4932



    near (pointer) 16 bits not applicable

    far (pointer) 32 bits not applicable

  7. #7
    A Banana Yoshi's Avatar
    Join Date
    Oct 2001
    Posts
    859
    Times by # of decimals.
    store decimals (%) as a double.
    Subtract decimals from Original.
    divide back into original.
    Yoshi

  8. #8
    Blank
    Join Date
    Aug 2001
    Posts
    1,034
    It's impossible to store .1 exactly in a binary representation
    as it's repeating when in binary. You will have to specifiy what the problem exactly is but you could try printing out by rounding and comparing a value to x +- epsilon. You could try your own binary coded decimal representation but that would probably be slow.

Popular pages Recent additions subscribe to a feed