FPN math

**awsdert** · 10-29-2019

No not trying to convert a simple integer, trying to construct an fpn from 2 seperate integers, one for the whole number part and another for the decimal part, I'm trying to do it that way cause the function in mitsy will be doing it that way, starts with normal integer then when hits '.' it passes it into another variable, resets the one currently working with and continues normally, afetr the loop ends it checks the aforementioned variable and enters fpn mode if it's not 0, which now that I think of it will exclude 0.N so I'll have to rework that, doesn't change that I need 2 variables to keep track of though, the reason I don't just use a native float is that I need to support compiling to non-native systems, the simplest way to do that is construct the binary directly, the binary can then be passed into instructions or in preprocessor mode be passed into a native float via functions and then used in the expression given to the preprocessor

**flp1969** · 10-29-2019

Originally Posted by awsdert

No not trying to convert a simple integer, trying to construct an fpn from 2 seperate integers, one for the whole number part and another for the decimal part

Fixed point to floating point then? If you already have the bits isn't its just a matter of shifting then to the correct position?

Of course, you have to recalculate the fractional part... if you are dealing the integral and fractional parts as 32 bits values, 2^32 (fractional) is the same as 1.0, so:

n = (2^32*f)/(10^(log10(f) + 1)).

Taking 3.14 as your example... f=14 can be encoded as (2^32*14)/(10^2) = 601295421 (0b00100011110101110000101000111101 in 32 bits binary - of course this calculation must be done with enough precision to avoid overflows). So 3.14 can be encoded as 0b11.[00100011110101110000101000111101]. Shifting the binary point 1 position to the left we get 0b1.100100011110101110000101000111101 and e=1. Now we have our "inplicit" one and the fractional part that will satisfy the floating point float format if restricted to 23 bits: [0b1.100_1000_1111_0101_1100_0010]_1000111101.

So, M=0b10010001111010111000010 (0x48f5c2 - 23 bits), E=128 (0x80) (E=e+127) and S=0. Almost exactly what is expected for a floating point (float) value... The only difference is about rounding. Notice the _10001111101 final part, if this msb is 1 we need to add 1 to M, getting exactly the correct value:

To be sure:

v = (1 + 0x48f5c3 / 2^23) * 2^1 = (1+4781507/2^23)*2 = 3.14000010490417480468

Thread: FPN math

Thread Tools

Search Thread

Display

Hybrid View

Similar Threads

C++ and Math

hex math

Basic Math Problem. Undefined Math Functions

math.h

Math Help

Tags for this Thread