# Floating Point Bit Level Arithmetic

This is a discussion on Floating Point Bit Level Arithmetic within the C Programming forums, part of the General Programming Boards category; Hello, for a CS lab I am supposed to take two floating point numbers some operation to perform on them ...

1. ## Floating Point Bit Level Arithmetic

Hello, for a CS lab I am supposed to take two floating point numbers some operation to perform on them as input, and then perform the operation between the two numbers at the bit level. This means initially, I have to extract the separate parts of each floating point number and store them as integers, like the exponent, sign bit, and fraction part.

I seem to be getting a bit of unexpected behavior when doing this, however.

Code:
```#define SIGNBIT ((uint)0x80000000)
#define EXPONENT ((uint)0x7F800000)
#define FRACTION ((uint)0x007FFFFF)
#define BIT32 ((uint)0x80000000)
#define BIT24 ((uint)0x00800000)

typedef unsigned int uint;

float doComp(uint *xf1, uint *xf2, char op)
{
// 1: extract and display sign, biased and
//    unbiased exponent, plus fraction bit parts

float SUM;

uint SIGN1 = (*xf1 & SIGNBIT) >> 31;
uint SIGN2 = (*xf2 & SIGNBIT) >> 31;

uint EXP1 = ((*xf1 & EXPONENT) >> 23) - 127;
uint EXP2 = ((*xf2 & EXPONENT) >> 23) - 127;

uint FRAC1 = (*xf1 & FRACTION);
uint FRAC2 = (*xf2 & FRACTION);

char sign;

printf("xf1: %lf = %c 1.%u * 2^(%u)\n", *(float*)xf1, SIGN1?'-':'+', FRAC1, EXP1);
printf("xf2: %lf = %c 1.%u * 2^(%u)\n", *(float*)xf2, SIGN2?'-':'+', FRAC2, EXP2);

// 2: compute f1 op f2 at the bit level by
//    appropriately shifting and manipulating
//    the bit components -- normalize result

return 0.0;
}```
That is the code that I am using to try and extract and print the separate parts of the floating point number. However, I am getting approximate fraction parts rather than exact numbers in odd cases. For example, inputting 1.5 would give me a fraction part of .4194304. I may simply be misunderstanding IEEE Floating point numbers, but I did not think that 1.5 would have to be rounded... If this is just normal floating point round off error, then all's well, I suppose. I'm just not sure.

Also, another question I have regards a part of performing the operations. For adding and subtracting, I will obviously have to compare the exponents of the two numbers and change one so that they have the same exponent. Then I also have to shift the fraction part of the number. The problem is, how do I deal with the implied one in the fraction part when shifting? Is there an easy way to add it in? I have a few ideas I'm gonna try tonight but any help/hints/prods in the right direction would be appreciated!

Thank!

2. The significand does not work like you seem to expect it to. It's not simply a 23-bit integer. Rather, you have to take each bit as a negative power of 2: the top bit is 2**(-1), the next is 2**(-2), then 2**(-3) and so on. There's also an implied 23rd (or 24th if you start counting from 1) bit whose value is zero if the exponent is zero, and one otherwise (that is, it's 2**0).

So look at your exponent when you pass in 1.5: the value is 0x400000, which is 0b10000000000000000000000. Note that the top (non-implied) bit is 1, which means 0.5, and the rest are zero. Thus your fractional part of 0.5. If you had 0.75, you'd get 0x600000, which means the top two bits are set: 0.5 + 0.25.

Don't forget that when the exponent is zero, you don't subtract 127, but instead treat it as if the exponent is -126 (and, of course, 0xff will be Inf or NaN).

3. Oh... well that makes a lot more sense, I guess I was just being a little dumb about it

I'll be taking another go at the lab tonight, and if I run into any other major road blocks, hopefully someone here can save me again

4. Ok, so I'm still having a bit of trouble. I realize now that the decimal is not stored how I thought it was. But now I'm not sure how I can store the decimal for performing the operations and printing. For example, one of the first things we are supposed to do is print each input in IEEE format ( +/- 1.FRAC * 2 ^ (EXP) ). I can display the sign and exponent fine, but I don't know how to handle the fraction part so that it will display properly. We are only supposed to use integer operations, so I don't think I can manually convert the 23bit fraction to a decimal value. Anybody here know what I can do? Anything to get me going in the right path is appreciated!

5. Originally Posted by DanV2
The problem is, how do I deal with the implied one in the fraction part when shifting? Is there an easy way to add it in?
To insert the implied 1 in the fraction, bitwise OR the unpacked fraction with the manifest constant BIT24:
Code:
`*xf1 |= BIT24;`

6. Originally Posted by DanV2
Ok, so I'm still having a bit of trouble. I realize now that the decimal is not stored how I thought it was. But now I'm not sure how I can store the decimal for performing the operations and printing. For example, one of the first things we are supposed to do is print each input in IEEE format ( +/- 1.FRAC * 2 ^ (EXP) ). I can display the sign and exponent fine, but I don't know how to handle the fraction part so that it will display properly. We are only supposed to use integer operations, so I don't think I can manually convert the 23bit fraction to a decimal value. Anybody here know what I can do? Anything to get me going in the right path is appreciated!
After extracting the fractional part of the float into an integer, shift it all the way to the left. Now extract each bit from the integer and mutiply 'n divide by powers of 10 and 2 respectively.
In pseudocode it'd be something like

1. (binary fraction) 1.11 == 0x00600000 (fractional part extracted into int)
2. left shift, so it's now 0x60000000
3. extract leftmost bit, multiply by (10/2), and store it in say n
4. left shift int by 1
5. repeat steps 3 and 4 until int is zero.

7. Okay, well that all makes sense and seems very logical. Now, however, I am still a bit puzzled. I emailed my TA the other day but haven't gotten a response yet, so I turn to here.

So I can extract all of the separate parts and add the implied one bit to the fraction part of each input. I also believe I could easily shift the bits of the fraction parts the necessary amount if the exponents of the two inputs are different (I believe it would just involve calculating the difference between the exponents and shifting the fraction part of the smaller fraction to the right). Then, two calculate the sum of the two numbers, you would have to add the fraction parts, then use the bitwise or operator to put all of the parts back into a floating point number. For example, I can add 127 back to the exponent part and shift it back to the left 23 bits, but then how would I get that into a floating point number? I assume I'd have to use the bitwise or in some way, but I can't seem to figure out the syntax.

For example, I have the value 0x43000000 stored in a uint. I want to change the corresponding bits of a float pointer variable SUM to be the same as those in the uint EXP1.

I tried the command SUM = *SUM | EXP1; but I get compiler errors about having the wrong operands. What is the proper command/syntax for getting the bits in my float pointer to match those of the exponent?

Thanks!

8. Originally Posted by DanV2
Ok, so I'm still having a bit of trouble. I realize now that the decimal is not stored how I thought it was. But now I'm not sure how I can store the decimal for performing the operations and printing. For example, one of the first things we are supposed to do is print each input in IEEE format ( +/- 1.FRAC * 2 ^ (EXP) ). I can display the sign and exponent fine, but I don't know how to handle the fraction part so that it will display properly. We are only supposed to use integer operations, so I don't think I can manually convert the 23bit fraction to a decimal value. Anybody here know what I can do? Anything to get me going in the right path is appreciated!
Hint: use the paper 'n pencil method of converting a binary fraction into a decimal one:
Code:
`0.1101 == 1 * (10/2) + 1 * (10/2)^2 + 0 * (10/2)^3 + 1 * (10/2)^4 + ...`

9. Originally Posted by DanV2
For example, I have the value 0x43000000 stored in a uint. I want to change the corresponding bits of a float pointer variable SUM to be the same as those in the uint EXP1.

I tried the command SUM = *SUM | EXP1; but I get compiler errors about having the wrong operands. What is the proper command/syntax for getting the bits in my float pointer to match those of the exponent?
Change SUM from float* to uint* and point it to the float.
Then assign the uint into the location pointed to by SUM.
Code:
```EXP1 = 0x43000000;
SUM = (uint *) &f;
*SUM = EXP1;```

10. Ok, so you guys have been VERY helpful so far, especially itCbitC, and I definitely appreciate it. I am almost done with the program but have one small problem I can't seem to figure out with division. Here is the relevant code:

Code:
```#include <stdio.h>
#include <stdlib.h>

int showmode = 1;

#define ZERO ((uint)0x00000000)
#define SIGNBIT ((uint)0x80000000)
#define EXPONENT ((uint)0x7F800000)
#define FRACTION ((uint)0x007FFFFF)
#define BIT32 ((uint)0x80000000)
#define BIT24 ((uint)0x00800000)
#define PEXP ((uint)0x3F800000)

typedef unsigned int uint;

float doComp(uint *xf1, uint *xf2, char op)
{
// 1: extract and display sign, biased and
//    unbiased exponent, plus fraction bit parts

uint SIGN1 = (*xf1 & SIGNBIT) >> 31;
uint SIGN2 = (*xf2 & SIGNBIT) >> 31;

uint EXP1 = ((*xf1 & EXPONENT) >> 23) - 127;
uint EXP2 = ((*xf2 & EXPONENT) >> 23) - 127;

uint FRAC1 = (*xf1 & FRACTION);
uint FRAC2 = (*xf2 & FRACTION);

FRAC1 |= BIT24;
FRAC2 |= BIT24;

float pFRAC1 = ZERO;
float pFRAC2 = ZERO;
uint *temp1 = (uint*)&pFRAC1;
uint *temp2 = (uint*)&pFRAC2;

*temp1 |= (FRAC1 & FRACTION);
*temp2 |= (FRAC2 & FRACTION);
*temp1 |= PEXP;
*temp2 |= PEXP;

printf("\nf1:\t%lf = %c %lf * 2^(%i)\n", *(float*)xf1, SIGN1?'-':'+', pFRAC1, EXP1);
printf("f2:\t%lf = %c %lf * 2^(%i)\n\n", *(float*)xf2, SIGN2?'-':'+', pFRAC2, EXP2);

// 2: compute f1 op f2 at the bit level by
// 	  appropriately shifting and manipulating
// 	  the bit components -- normalize result

float FRES;
uint *RES = (uint*) &FRES;
uint FRACRES;
uint EXPRES;
uint SIGNRES;
*RES = ZERO;
int SHIFT = 0;

if(op == '-')
{
op = '+';
if(SIGN2 == 0)
SIGN2 = 1;
else
SIGN2 = 0;
}

switch(op)
{
case '+':
...
break;

case '*':
...
break;

case '/':
if((*xf2 & FRACTION) == 0 && (*xf2 & EXPONENT) == 0)
{
if((*xf1 & FRACTION) == 0 && (*xf1 & EXPONENT) == 0)
{
*RES |= BIT32;
*RES |= EXPONENT;
*RES |= ((uint)(0x000FFFFF));
break;
}
*RES |= (SIGN1 << 31);
*RES |= EXPONENT;
break;
}
if((*xf1 & FRACTION) == 0 && (*xf1 & EXPONENT) == 0)
break;

if(EXP2 == 128 && (*xf2 & FRACTION) == 0)
{
if(EXP1 == 128 && (*xf1 & FRACTION) == 0)
{
*RES |= BIT32;
*RES |= EXPONENT;
*RES |= ((uint)(0x000FFFFF));
break;
}
break;
}

FRACRES = ZERO;
FRAC1 >>= 8;
FRAC2 >>= 8;

//printf("FRAC1 = %u\nFRAC2 = %u\n", FRAC1, FRAC2);

float TEMP = (float)FRAC1 / FRAC2;
uint *uTEMP = (uint*)&TEMP;
//printf("(float)FRAC1 / FRAC2 = %lf\n", TEMP);
FRACRES = *uTEMP & FRACTION;
FRACRES |= BIT24;
//printf("*uTEMP & FRACTION = %u\n", *uTEMP & FRACTION);
//printf("FRACRES = %u\n", FRACRES);

while(FRACRES < 8388608 && FRACRES != 0)
{
SHIFT--;
FRACRES <<= 1;
}

while(FRACRES > 16777215)
{
SHIFT++;
FRACRES >>= 1;
}

//printf("FINAL FRACRES = %u\n", FRACRES);

EXPRES = EXP1 - EXP2 + 127 + SHIFT;
printf("EXPRES = %i\n", (int)EXPRES - 127);
EXPRES <<= 23;

SIGNRES = SIGN1 ^ SIGN2;

*RES |= SIGNRES << 31;
*RES |= EXPRES;
*RES |= (FRACRES & FRACTION);

break;
}

//printf("RES = %u\n", *SUM);
return FRES;
}

int main(int argc, char *argv[])
{
float f1, f2, r1, r2;
uint *xf1 = (uint *)&f1;
uint *xf2 = (uint *)&f2;
char op;
int stop = 0, nitem;

while ((nitem = scanf("%f %c %f", &f1, &op, &f2)) == 3)
{
switch (op)
{
case '+': r1 = f1 + f2; break;
case '-': r1 = f1 - f2; break;
case '*': r1 = f1 * f2; break;
case '/': r1 = f1 / f2; break;
default:
stop = 1;
break;
}

if (stop) break;

if (showmode)
{
printf("\n%+10.6f = %08x\n", f1, *xf1);
printf("%+10.6f = %08x\n", f2, *xf2);
}

r2 = doComp(xf1, xf2, op);
printf("result: %f (%f)\n", r1, r2);
}
if (nitem != EOF)
printf("input expression format error\n");
}```
Now, for some values, the division switch above works fine. But for values like 1 / 5, it'll give me .4 instead of .2. Is there something obvious here I'm missing? I'll be spending more time on this tomorrow, going over it with a fine toothed comb, but any outside help would be appreciated.

Also, I do apologize about the lack of comments. I'm trying to get into the habit of adding them as I go but it isn't easy to remember. Consequently, if the purpose of a piece of code is unclear, ask and I can explain (hopefully).

And yes, the code is a bit messy, but I spent ~10 hours puzzling this out today, so I have no motivation to organize it tonight.... sorry

And of course, thanks again!

11. Originally Posted by DanV2
Now, for some values, the division switch above works fine. But for values like 1 / 5, it'll give me .4 instead of .2. Is there something obvious here I'm missing?
Post a snippet of the code that demonstrates the above problem, instead of the whole thing.

12. It's okay, I actually managed to work it out. I don't know why, but when dividing values that resulted in a fraction part with a value other than 0, the exponent of the result would be off by one bit. If you'd like to try and help me find out why, that'd be welcome. I have, however, already turned in the lab, so this would be for my own personal benefit

Here is the interesting/problematic part of code, now with a few comments!
Code:
```typedef unsigned int uint;

float doComp(uint *xf1, uint *xf2, char op)
{
.......

switch(op)
{
...

case '/':
//Check for exceptions
...

//Algin radix points and set exponent bits of result
...

//Perfrom division of mantissas
FRAC1 >>= 8;
FRAC2 >>= 8;

//Store result in float temporarily
//Manually change to int later
float TEMP = (float)FRAC1 / FRAC2;
uint *uTEMP = (uint*)&TEMP;
FRACRES = *uTEMP & FRACTION;
FRACRES |= BIT24;

//Normalize mantissa of result
while(FRACRES < 8388608 && FRACRES != 0)
{
SHIFT--;
FRACRES <<= 1;
}

while(FRACRES > 16777215)
{
SHIFT++;
FRACRES >>= 1;
}

//Set result bits
EXPRES = EXP1 - EXP2 + 127 + SHIFT;

//I don't know why I need this, but I do
//Otherwise, division where the fraction part is not 0
//Will end up being twice what they should be
//i.e. -- 1/4 = 0.25
//but  -- 1/5 = 0.4
//This if fixes that :D
if((FRACRES & FRACTION) != ZERO)
EXPRES--;

EXPRES <<= 23;
SIGNRES = SIGN1 ^ SIGN2;

*RES |= SIGNRES << 31;
*RES |= EXPRES;
*RES |= (FRACRES & FRACTION);

break;
}

return FRES;
}```
To try and make it a bit more readable, I left out parts of the text which I am mostly certain are working properly.

The part in red fixed the mysterious error. I am still not sure why it happened. As far as I know, it is just a part of IEEE floating point notation of which I am unaware. In testing, however, I did find that the problem only occurred when the 23 fraction bits of the result were NOT all zero.

As I said above, I have turned the assignment, but I still would like to know why this happened, if anyone knows. Also, if my code is still too much or if something doesn't make sense, just ask.

Thanks!

13. Originally Posted by DanV2
It's okay, I actually managed to work it out. I don't know why, but when dividing values that resulted in a fraction part with a value other than 0, the exponent of the result would be off by one bit. If you'd like to try and help me find out why, that'd be welcome. I have, however, already turned in the lab, so this would be for my own personal benefit

Here is the interesting/problematic part of code, now with a few comments!
Code:
```typedef unsigned int uint;

float doComp(uint *xf1, uint *xf2, char op)
{
.......

switch(op)
{
...

case '/':
//Check for exceptions
...

//Algin radix points and set exponent bits of result
...

//Perfrom division of mantissas

/* not sure what's the need to divide the mantissas */
FRAC1 >>= 8;
FRAC2 >>= 8;

//Store result in float temporarily
//Manually change to int later
float TEMP = (float)FRAC1 / FRAC2;
uint *uTEMP = (uint*)&TEMP;
FRACRES = *uTEMP & FRACTION;
FRACRES |= BIT24;

//Normalize mantissa of result

/* what's the reason for comparing FRACRES with all these constants */
while(FRACRES < 8388608 && FRACRES != 0)
{
SHIFT--;
FRACRES <<= 1;
}

while(FRACRES > 16777215)
{
SHIFT++;
FRACRES >>= 1;
}

//Set result bits
EXPRES = EXP1 - EXP2 + 127 + SHIFT;

//I don't know why I need this, but I do
//Otherwise, division where the fraction part is not 0
//Will end up being twice what they should be
//i.e. -- 1/4 = 0.25
//but  -- 1/5 = 0.4
//This if fixes that :D
if((FRACRES & FRACTION) != ZERO)
EXPRES--;

EXPRES <<= 23;
SIGNRES = SIGN1 ^ SIGN2;

*RES |= SIGNRES << 31;
*RES |= EXPRES;
*RES |= (FRACRES & FRACTION);

break;
}

return FRES;
}```
To try and make it a bit more readable, I left out parts of the text which I am mostly certain are working properly.

The part in red fixed the mysterious error. I am still not sure why it happened. As far as I know, it is just a part of IEEE floating point notation of which I am unaware. In testing, however, I did find that the problem only occurred when the 23 fraction bits of the result were NOT all zero.

As I said above, I have turned the assignment, but I still would like to know why this happened, if anyone knows. Also, if my code is still too much or if something doesn't make sense, just ask.

Thanks!
It'd make sense if you care to explain the part of the code that is highlighted in blue.

14. I understand what all the code is doing, as I've done this and much more previously.
Unfortunately you seem to have partly missed the point of this exercize...
It seems to me that you're supposed to emmulate IEEE754 floating point maths without using any code that would involve using floating point instructions. Yet this line of code here just straight out does floating point division, the entire thing you're trying to emmulate through bit-manipulation:
Code:
`float TEMP = (float)FRAC1 / FRAC2;`
Now if it were me I certainly wouldn't give you a zero mark on this because clearly a fair amount of the code here is actually more or less correct, such as the blue parts that re-normalise the significand.

Were you supposed to make it work for denormalised values as well?
What about NANs, is that handled by the bits you omitted?

What about unit tests; had you been taught how to make those, or do you get given the unit test framework for this? Or how else do you make sure that your code gives the right answers?

15. Originally Posted by itCbitC
It'd make sense if you care to explain the part of the code that is highlighted in blue.
Right, like iMalc mentioned, the while loops are for normalizing the mantissa of the result. The fraction part of an IEEE Floating point number should be 23 bits with an implied 1 bit, so the while loops shift right or left as necessary to normalize the result.

As for shifting the two FRAC values to the right, I'm not completely sure why we do, but the TA told us that it was necessary to truncate the values so that the result would fit into an integer. At least, that's what I remember from the explanation he gave.

Originally Posted by iMalc
I understand what all the code is doing, as I've done this and much more previously.
Unfortunately you seem to have partly missed the point of this exercize...
It seems to me that you're supposed to emmulate IEEE754 floating point maths without using any code that would involve using floating point instructions. Yet this line of code here just straight out does floating point division, the entire thing you're trying to emmulate through bit-manipulation:
Code:
`float TEMP = (float)FRAC1 / FRAC2;`
Now if it were me I certainly wouldn't give you a zero mark on this because clearly a fair amount of the code here is actually more or less correct, such as the blue parts that re-normalise the significand.

Were you supposed to make it work for denormalised values as well?
What about NANs, is that handled by the bits you omitted?

What about unit tests; had you been taught how to make those, or do you get given the unit test framework for this? Or how else do you make sure that your code gives the right answers?
For the floating point division, I was a bit uncertain. However, in the lab instructions, the TA indicated that for multiplication, you should store the result in a temporary floating point number. When I inquired about it, he said it should be fine if we use a temporary floating point number to store the result of the division. I'm not sure how else you would be able to calculate the result of the division of the mantissas as integer division would eliminate the fraction part of the result.

Operations that resulted in +/- inf or nan we didn't have to handle, but I put in checks for them anyways to return the proper results. We only had to check for 0/0 and division by 0, both of which I did but eliminated from the above code as I believe it works fine.

I'm not sure what you mean by making it work for denormalized values or by unit tests. As for making sure we return the right values, the main function will perform normal floating point operations and print both that value and the value from the doComp function. The main function I was given also already had code to read in the two float values and the operation from the user and to convert the float pointers to unsigned int pointers.

Page 1 of 2 12 Last