Floating Point Addition

**ggraz** · 10-11-2008

I am writing a program that does some floating addition that uses bit patterns with shifts applied to the mantissa and such to obtain the sum of the two floating point numbers. Logically and on paper I can get this to compute the correct sum, but I must be missing something in my program because the output ( in base 2 scientific notation) is not correct ? Anyone see where my error is ? Thank you for the help !

Code:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <assert.h>

int isNegative (float f)
{
    unsigned int* iptr = (unsigned int*)&f;
    return ( ((*iptr) & 0x80000000) ? 1:0);
}

unsigned char getExponent (float f)
{
         unsigned int* iptr = (unsigned int*)&f;
         return (((*iptr >> 23) & 0xff) - 127);
         
}

unsigned int getMantissa (float f)
{
         unsigned int* iptr = (unsigned int*)&f;
         if( *iptr == 0 ) return 0;
         return ((*iptr & 0xFFFFFF) | 0x800000 );
        
}

float sum (float left, float right)
{
      unsigned int littleMan;
      unsigned int bigMan;
      unsigned char littleExp;
      unsigned char bigExp;
      unsigned char lexp = getExponent(left);
      unsigned char rexp = getExponent(right);
      
      int   Dexponent;
   
if (lexp > rexp)
{
         bigExp = lexp;
         bigMan = getMantissa(left);
         littleExp = rexp;
         littleMan = getMantissa(right);
}
else
{
    bigExp = rexp;
    bigMan = getMantissa(right);
    littleExp = lexp;
    littleMan = getMantissa(left);
}

printf("little: %x %x\n", littleExp, littleMan);
printf("big:    %x %x\n", bigExp, bigMan);

    
void shift(  unsigned int *valToShift, int bitsToShift )
{
    // Masks is used to mask out bits to check for a "sticky" bit.
    
    static unsigned masks[24] =
    {
        0, 1, 3, 7, 0xf, 0x1f, 0x3f, 0x7f, 
        0xff, 0x1ff, 0x3ff, 0x7ff, 0xfff, 0x1fff, 0x3fff, 0x7fff,
        0xffff, 0x1ffff, 0x3ffff, 0x7ffff, 0xfffff, 0x1fffff, 0x3fffff, 0x7fffff
    };
        
    // HOmasks - masks out the H.O. bit of the value masked by the masks entry.
    
    static unsigned HOmasks[24] =
    {
        0, 
        1, 2, 4, 0x8, 0x10, 0x20, 0x40, 0x80, 
        0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 
        0x10000, 0x20000, 0x40000, 0x80000, 0x100000, 0x200000, 0x400000
    };
        
    // shiftedOut- Holds the value that will be shifted out of a mantissa
    // during the denormalization operation (used to round a denormalized value).
    
    int             shiftedOut;
    
    assert( bitsToShift <= 23 );
    
    
    //  Grabs the bits we're going to shift out (so we can determine
    // how to round this value after the shift).
    
    shiftedOut = *valToShift & masks[ bitsToShift ];
    
    // Shift the value to the right the specified number of bits:
    
    *valToShift = *valToShift >> bitsToShift;
    
    // If necessary, round the value:
    
    if( shiftedOut > HOmasks[ bitsToShift ] )
    {
        // If the bits we shifted out are greater than 1/2 the L.O. bit, then
        // round the value up by one.
       
        *valToShift = *valToShift + 1;
    }
    else if( shiftedOut == HOmasks[ bitsToShift ] )
    {
        // If the bits we shifted out are exactly 1/2 of the L.O. bit's value,
        // then round the value to the nearest number whose L.O. bit is zero.
       
        *valToShift = *valToShift + ((*valToShift & 1) == 1);
    }
    // else we round the value down to the previous value.  The current
    // value is already truncated (rounded down), so we don't have to do anything.
}

    // I got two actual floating point values.  I want to add them together.
    // 1. "denormalize" one of the operands if their exponents aren't
    // the same (when adding or subtracting values, the exponents must be the same).
    // 
    // Algorithm: choose the value with the smaller exponent.  Shift its mantissa
    // to the right the number of bits specified by the difference between the two
    // exponents.
    
    if( rexp > lexp )
    {
        shift( &littleMan, (rexp - lexp));
        Dexponent = rexp;
    }
    else if( rexp < lexp )
    {
        shift( &littleMan, (lexp - rexp));
        Dexponent = lexp;
    }
    
  


unsigned int result = Dexponent;

float fresult = *(float*)&result;
return(fresult);

}

int main()
{
    const int SIZE = 256;
    char line[SIZE];
    
    while (1)
    {
          float f1;
          float f2;
          float left = f1;
          float right = f2;
          
          printf("Please enter the first float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) =='Q')
          break;
          
          f1 = atof(line);
          
          printf("Please enter the second float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) == 'Q')
          break;
          
          f2 = atof(line);
          
          if (isNegative(f1) || isNegative(f2))
          printf ("One of thse is negative, but %g + %g == %g\n", f1,f2,sum(f1,f2));
          else
          printf("%g + %g == %g\n", f1,f2,sum(f1,f2));
}

return(EXIT_SUCCESS);
}

**tabstop** · 10-11-2008

You've got this function "shift" nested in your other function, you naughty gcc user you. You probably want to define it in a legal way (somewhere else, that is) and just call it there in your sum function.

**tabstop** · 10-11-2008

Reading again, you do call shift.... but where do you add? At some point you should add bigMan to littleMan, right?

**ggraz** · 10-11-2008

ok, I think I see my error , I am not passing along the anything to the following addition wise

Code:

unsigned int result = Dexponent;

float fresult = *(float*)&result;
return(fresult);

and what I was passing in to Shift ( *valToShift, int bitsToShift) was just what I had on paper , but i need to pass in (*bigMan, int littleMan). I think ???

Thank you for the help !

**tabstop** · 10-11-2008

Originally Posted by ggraz

ok, I think I see my error , I am not passing along the anything to the following addition wise

Code:

unsigned int result = Dexponent;

float fresult = *(float*)&result;
return(fresult);

and what I was passing in to Shift ( *valToShift, int bitsToShift) was just what I had on paper , but i need to pass in (*bigMan, int littleMan). I think ???

Thank you for the help !

No, that needed to be valToShift and bitsToShift. I still don't see "floatMan = bigMan + littleMan", nor do I see you actually even trying to build the answer with the mantissa-of-answer and exponent-of-answer. In other words, this:

Code:

// Algorithm: choose the value with the smaller exponent.  Shift its mantissa
    // to the right the number of bits specified by the difference between the two
    // exponents.

is steps 1 and 2 of the 4 or 5 that are actually necessary to add two numbers together.

**ggraz** · 10-11-2008

ok, got it ( well not really ) but I will have to rethink how I get steps 1 & 2 in the program. I see the logic , just have to put it into the program. Thank you .... .....I will add to this post once I figure out what I have to add . Thank you for the help !

**tabstop** · 10-11-2008

What makes you think there's anything wrong with steps 1 and 2 as you have them?

**ggraz** · 10-11-2008

Sorry.......... (Programming all day is starting to get to me ! ), your right ...................I have to put the rest of the steps together.

**ggraz** · 10-12-2008

Ok, I revised this one a-lot , but I think I am on the right track. Basically I need to do floating point addition using exponents and the mantissas. my desired output is a floating point number , ie. 2.5 + 2 = 4.5

Like I said I think I am close but the program is not working , any ideas ?

Code:

#include <stdlib.h>
#include <stdio.h>
#include <ctype.h>
#include <assert.h>

int isNegative (float f)
{
    unsigned int* iptr = (unsigned int*)&f;
    return ( ((*iptr) & 0x80000000) ? 1:0);
}

unsigned char getExponent (float f)  // Purpose is to return the 8 bit exponent of the floating point value
{
         unsigned int* iptr = (unsigned int*)&f;
         return (((*iptr >> 23) & 0xff)-127) ;
         
}

unsigned int getMantissa (float f) // Purpose to return the 24 bit mantissa of the floating point value.
{
         unsigned int* iptr = (unsigned int*)&f;
         if( *iptr == 0 ) return 0;
         return ((*iptr & 0xFFFFFF) | 0x800000 );
        
}

float sum (float left, float right) // Purpose to return the sum of the floating point values left & right .
{
      // Will obtain the exponents of the left & right and will obtain the mantissa.
      unsigned int littleMan;
      unsigned int bigMan;
      unsigned char littleExp;
      unsigned char bigExp;
      unsigned char lexp = getExponent(left);
      unsigned char rexp = getExponent(right);
      
     
   
if (lexp > rexp)
{
         bigExp = lexp;
         bigMan = getMantissa(left);
         littleExp = rexp;
         littleMan = getMantissa(right);
}
else
{
    bigExp = rexp;
    bigMan = getMantissa(right);
    littleExp = lexp;
    littleMan = getMantissa(left);
}


printf("little: %x %x\n", littleExp, littleMan);
printf("big:    %x %x\n", bigExp, bigMan);

//Purpose is to extract difference in exponet values to determin how much to shift the mantissa    
int expSub = (bigExp - littleExp);
printf("Subtraction of the Exp: %x\n", expSub);

// Purpose is shift the mantissas to allign binary points.
int shifta = (littleMan << expSub);
printf("The value of the Exp after the shift:  %x\n", shifta);

// Purpose is to add mantissas
int addMantissa = (bigMan + shifta);
printf("The value of the two Mantissas added:  %x\n", addMantissa);  


// Purpose is if the mantissa is too big , extending into the 24 bit , shift over to to fit mantissa and update bigExp to compensate for the shift and strip the hidden bit. 
if (addMantissa >= 0x1000000)
{
    shifta(&addMantissa,1);
    ++bigExp;
}
else
{
    if (addMantissa != 0)
    {
      while( (addMantissa < 0x800000) && (bigExp > -127))
      { 
             addMantissa  = addMantissa << 1;
             --bigExp;
      }
     }
     
      
  
// Purpose is to reassemble the floating point number 

unsigned int result = ( (expSub + 127)<<23) | ( addMantissa & 0x7fffff));

float fresult = *(float*)&result;
return(fresult);

}

int main()
{
    const int SIZE = 256;
    char line[SIZE];
    
    while (1)
    {
          float f1;
          float f2;
          float left = f1;
          float right = f2;
          
          printf("Please enter the first float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) =='Q')
          break;
          
          f1 = atof(line);
          
          printf("Please enter the second float ( \"q\" to quit):");
          fgets(line,SIZE,stdin);
          
          if (toupper(line[0]) == 'Q')
          break;
          
          f2 = atof(line);
          
          if (isNegative(f1) || isNegative(f2))
          printf ("One of thse is negative, but %g + %g == %g\n", f1,f2,sum(f1,f2));
          else
          printf("%g + %g == %g\n", f1,f2,sum(f1,f2));
}

return(EXIT_SUCCESS);
}

Thank you for the help.

**nonoob** · 10-12-2008

Wouldn't compile.

Code:

shifta(&addMantissa,1);

'shifta' is not a function.

I'm pretty sure there's a missing closed brace just before "// Purpose is to reassemble the floating point number "

Code:

unsigned int result = ( (expSub + 127)<<23) | ( addMantissa & 0x7fffff);

I had to get rid of a closing parenthesis.

Code:

while( (addMantissa < 0x800000) && (bigExp > -127))

Comparison is always true due to limited range of data type.

How do you expect something as complicated as floating point working if you can't get the syntax issues worked out?

**iMalc** · 10-12-2008

There isn't any better way to understand how floating point works besides emmulating it yourself huh!

So, should you be expecting your function to do the usual thing for +/- INF? How about for NANs? I got those all working when I made my floating point emulation class. This is an assignment right?

Thread: Floating Point Addition

Thread Tools

Search Thread

Display

Floating Point Addition

Similar Threads

Decimal places on Floating point number

How accurate is the following...

floating point question

2 questions about floating point and %

Structures and floating point variables (repost with code tags)