Thread: Floating Point Addition

  1. #1
    Registered User
    Join Date
    Sep 2008
    Posts
    22

    Floating Point Addition

    I am writing a program that does some floating addition that uses bit patterns with shifts applied to the mantissa and such to obtain the sum of the two floating point numbers. Logically and on paper I can get this to compute the correct sum, but I must be missing something in my program because the output ( in base 2 scientific notation) is not correct ? Anyone see where my error is ? Thank you for the help !

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <ctype.h>
    #include <assert.h>
    
    int isNegative (float f)
    {
        unsigned int* iptr = (unsigned int*)&f;
        return ( ((*iptr) & 0x80000000) ? 1:0);
    }
    
    unsigned char getExponent (float f)
    {
             unsigned int* iptr = (unsigned int*)&f;
             return (((*iptr >> 23) & 0xff) - 127);
             
    }
    
    unsigned int getMantissa (float f)
    {
             unsigned int* iptr = (unsigned int*)&f;
             if( *iptr == 0 ) return 0;
             return ((*iptr & 0xFFFFFF) | 0x800000 );
            
    }
    
    float sum (float left, float right)
    {
          unsigned int littleMan;
          unsigned int bigMan;
          unsigned char littleExp;
          unsigned char bigExp;
          unsigned char lexp = getExponent(left);
          unsigned char rexp = getExponent(right);
          
          int   Dexponent;
       
    if (lexp > rexp)
    {
             bigExp = lexp;
             bigMan = getMantissa(left);
             littleExp = rexp;
             littleMan = getMantissa(right);
    }
    else
    {
        bigExp = rexp;
        bigMan = getMantissa(right);
        littleExp = lexp;
        littleMan = getMantissa(left);
    }
    
    printf("little: %x %x\n", littleExp, littleMan);
    printf("big:    %x %x\n", bigExp, bigMan);
    
        
    void shift(  unsigned int *valToShift, int bitsToShift )
    {
        // Masks is used to mask out bits to check for a "sticky" bit.
        
        static unsigned masks[24] =
        {
            0, 1, 3, 7, 0xf, 0x1f, 0x3f, 0x7f, 
            0xff, 0x1ff, 0x3ff, 0x7ff, 0xfff, 0x1fff, 0x3fff, 0x7fff,
            0xffff, 0x1ffff, 0x3ffff, 0x7ffff, 0xfffff, 0x1fffff, 0x3fffff, 0x7fffff
        };
            
        // HOmasks - masks out the H.O. bit of the value masked by the masks entry.
        
        static unsigned HOmasks[24] =
        {
            0, 
            1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80, 
            0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 
            0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
        };
            
        // shiftedOut- Holds the value that will be shifted out of a mantissa
        // during the denormalization operation (used to round a denormalized value).
        
        int             shiftedOut;
        
        assert( bitsToShift <= 23 );
        
        
        //  Grabs the bits we're going to shift out (so we can determine
        // how to round this value after the shift).
        
        shiftedOut = *valToShift & masks[ bitsToShift ];
        
        // Shift the value to the right the specified number of bits:
        
        *valToShift = *valToShift >> bitsToShift;
        
        // If necessary, round the value:
        
        if( shiftedOut > HOmasks[ bitsToShift ] )
        {
            // If the bits we shifted out are greater than 1/2 the L.O. bit, then
            // round the value up by one.
           
            *valToShift = *valToShift + 1;
        }
        else if( shiftedOut == HOmasks[ bitsToShift ] )
        {
            // If the bits we shifted out are exactly 1/2 of the L.O. bit's value,
            // then round the value to the nearest number whose L.O. bit is zero.
           
            *valToShift = *valToShift + ((*valToShift & 1) == 1);
        }
        // else we round the value down to the previous value.  The current
        // value is already truncated (rounded down), so we don't have to do anything.
    }
    
        // I got two actual floating point values.  I want to add them together.
        // 1. "denormalize" one of the operands if their exponents aren't
        // the same (when adding or subtracting values, the exponents must be the same).
        // 
        // Algorithm: choose the value with the smaller exponent.  Shift its mantissa
        // to the right the number of bits specified by the difference between the two
        // exponents.
        
        if( rexp > lexp )
        {
            shift( &littleMan, (rexp - lexp));
            Dexponent = rexp;
        }
        else if( rexp < lexp )
        {
            shift( &littleMan, (lexp - rexp));
            Dexponent = lexp;
        }
        
      
    
    
    unsigned int result = Dexponent;
    
    float fresult = *(float*)&result;
    return(fresult);
    
    }
    
    int main()
    {
        const int SIZE = 256;
        char line[SIZE];
        
        while (1)
        {
              float f1;
              float f2;
              float left = f1;
              float right = f2;
              
              printf("Please enter the first float ( \"q\" to quit):");
              fgets(line,SIZE,stdin);
              
              if (toupper(line[0]) =='Q')
              break;
              
              f1 = atof(line);
              
              printf("Please enter the second float ( \"q\" to quit):");
              fgets(line,SIZE,stdin);
              
              if (toupper(line[0]) == 'Q')
              break;
              
              f2 = atof(line);
              
              if (isNegative(f1) || isNegative(f2))
              printf ("One of thse is negative, but %g + %g == %g\n", f1,f2,sum(f1,f2));
              else
              printf("%g + %g == %g\n", f1,f2,sum(f1,f2));
    }
    
    return(EXIT_SUCCESS);
    }

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    You've got this function "shift" nested in your other function, you naughty gcc user you. You probably want to define it in a legal way (somewhere else, that is) and just call it there in your sum function.

  3. #3
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Reading again, you do call shift.... but where do you add? At some point you should add bigMan to littleMan, right?

  4. #4
    Registered User
    Join Date
    Sep 2008
    Posts
    22
    ok, I think I see my error , I am not passing along the anything to the following addition wise

    Code:
    unsigned int result = Dexponent;
    
    float fresult = *(float*)&result;
    return(fresult);
    and what I was passing in to Shift ( *valToShift, int bitsToShift) was just what I had on paper , but i need to pass in (*bigMan, int littleMan). I think ???

    Thank you for the help !
    Last edited by ggraz; 10-11-2008 at 08:48 PM.

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by ggraz View Post
    ok, I think I see my error , I am not passing along the anything to the following addition wise

    Code:
    unsigned int result = Dexponent;
    
    float fresult = *(float*)&result;
    return(fresult);
    and what I was passing in to Shift ( *valToShift, int bitsToShift) was just what I had on paper , but i need to pass in (*bigMan, int littleMan). I think ???

    Thank you for the help !
    No, that needed to be valToShift and bitsToShift. I still don't see "floatMan = bigMan + littleMan", nor do I see you actually even trying to build the answer with the mantissa-of-answer and exponent-of-answer. In other words, this:
    Code:
    // Algorithm: choose the value with the smaller exponent.  Shift its mantissa
        // to the right the number of bits specified by the difference between the two
        // exponents.
    is steps 1 and 2 of the 4 or 5 that are actually necessary to add two numbers together.

  6. #6
    Registered User
    Join Date
    Sep 2008
    Posts
    22
    ok, got it ( well not really ) but I will have to rethink how I get steps 1 & 2 in the program. I see the logic , just have to put it into the program. Thank you .... .....I will add to this post once I figure out what I have to add . Thank you for the help !

  7. #7
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    What makes you think there's anything wrong with steps 1 and 2 as you have them?

  8. #8
    Registered User
    Join Date
    Sep 2008
    Posts
    22
    Sorry.......... (Programming all day is starting to get to me ! ), your right ...................I have to put the rest of the steps together.

  9. #9
    Registered User
    Join Date
    Sep 2008
    Posts
    22
    Ok, I revised this one a-lot , but I think I am on the right track. Basically I need to do floating point addition using exponents and the mantissas. my desired output is a floating point number , ie. 2.5 + 2 = 4.5

    Like I said I think I am close but the program is not working , any ideas ?

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <ctype.h>
    #include <assert.h>
    
    int isNegative (float f)
    {
        unsigned int* iptr = (unsigned int*)&f;
        return ( ((*iptr) & 0x80000000) ? 1:0);
    }
    
    unsigned char getExponent (float f)  // Purpose is to return the 8 bit exponent of the floating point value
    {
             unsigned int* iptr = (unsigned int*)&f;
             return (((*iptr >> 23) & 0xff)-127) ;
             
    }
    
    unsigned int getMantissa (float f) // Purpose to return the 24 bit mantissa of the floating point value.
    {
             unsigned int* iptr = (unsigned int*)&f;
             if( *iptr == 0 ) return 0;
             return ((*iptr & 0xFFFFFF) | 0x800000 );
            
    }
    
    float sum (float left, float right) // Purpose to return the sum of the floating point values left & right .
    {
          // Will obtain the exponents of the left & right and will obtain the mantissa.
          unsigned int littleMan;
          unsigned int bigMan;
          unsigned char littleExp;
          unsigned char bigExp;
          unsigned char lexp = getExponent(left);
          unsigned char rexp = getExponent(right);
          
         
       
    if (lexp > rexp)
    {
             bigExp = lexp;
             bigMan = getMantissa(left);
             littleExp = rexp;
             littleMan = getMantissa(right);
    }
    else
    {
        bigExp = rexp;
        bigMan = getMantissa(right);
        littleExp = lexp;
        littleMan = getMantissa(left);
    }
    
    
    printf("little: %x %x\n", littleExp, littleMan);
    printf("big:    %x %x\n", bigExp, bigMan);
    
    //Purpose is to extract difference in exponet values to determin how much to shift the mantissa    
    int expSub = (bigExp - littleExp);
    printf("Subtraction of the Exp: %x\n", expSub);
    
    // Purpose is shift the mantissas to allign binary points.
    int shifta = (littleMan << expSub);
    printf("The value of the Exp after the shift:  %x\n", shifta);
    
    // Purpose is to add mantissas
    int addMantissa = (bigMan + shifta);
    printf("The value of the two Mantissas added:  %x\n", addMantissa);  
    
    
    // Purpose is if the mantissa is too big , extending into the 24 bit , shift over to to fit mantissa and update bigExp to compensate for the shift and strip the hidden bit. 
    if (addMantissa >= 0x1000000)
    {
        shifta(&addMantissa,1);
        ++bigExp;
    }
    else
    {
        if (addMantissa != 0)
        {
          while( (addMantissa < 0x800000) && (bigExp > -127))
          { 
                 addMantissa  = addMantissa << 1;
                 --bigExp;
          }
         }
         
          
      
    // Purpose is to reassemble the floating point number 
    
    unsigned int result = ( (expSub + 127)<<23) | ( addMantissa & 0x7fffff));
    
    float fresult = *(float*)&result;
    return(fresult);
    
    }
    
    int main()
    {
        const int SIZE = 256;
        char line[SIZE];
        
        while (1)
        {
              float f1;
              float f2;
              float left = f1;
              float right = f2;
              
              printf("Please enter the first float ( \"q\" to quit):");
              fgets(line,SIZE,stdin);
              
              if (toupper(line[0]) =='Q')
              break;
              
              f1 = atof(line);
              
              printf("Please enter the second float ( \"q\" to quit):");
              fgets(line,SIZE,stdin);
              
              if (toupper(line[0]) == 'Q')
              break;
              
              f2 = atof(line);
              
              if (isNegative(f1) || isNegative(f2))
              printf ("One of thse is negative, but %g + %g == %g\n", f1,f2,sum(f1,f2));
              else
              printf("%g + %g == %g\n", f1,f2,sum(f1,f2));
    }
    
    return(EXIT_SUCCESS);
    }

    Thank you for the help.

  10. #10
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Wouldn't compile.

    Code:
    shifta(&addMantissa,1);
    'shifta' is not a function.

    I'm pretty sure there's a missing closed brace just before "// Purpose is to reassemble the floating point number "

    Code:
    unsigned int result = ( (expSub + 127)<<23) | ( addMantissa & 0x7fffff);
    I had to get rid of a closing parenthesis.

    Code:
    while( (addMantissa < 0x800000) && (bigExp > -127))
    Comparison is always true due to limited range of data type.

    How do you expect something as complicated as floating point working if you can't get the syntax issues worked out?
    Last edited by nonoob; 10-12-2008 at 09:33 AM.

  11. #11
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    There isn't any better way to understand how floating point works besides emmulating it yourself huh!

    So, should you be expecting your function to do the usual thing for +/- INF? How about for NANs? I got those all working when I made my floating point emulation class. This is an assignment right?
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Decimal places on Floating point number
    By manutdfan in forum C Programming
    Replies: 1
    Last Post: 10-29-2006, 12:56 PM
  2. How accurate is the following...
    By emeyer in forum C Programming
    Replies: 22
    Last Post: 12-07-2005, 12:07 PM
  3. floating point question
    By Eric Cheong in forum C Programming
    Replies: 8
    Last Post: 09-10-2004, 10:48 PM
  4. 2 questions about floating point and %
    By ams80 in forum C Programming
    Replies: 2
    Last Post: 08-14-2002, 10:55 AM
  5. Replies: 2
    Last Post: 09-10-2001, 12:00 PM