Thread: FPN math

  1. #46
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by awsdert View Post
    Some of you will already know this but I'm working on a function for a compiler I'm making, this function is supposed to read both integers and floating numbers based on text provided. I was struggling to identify what variables I needed where and how to handle them, I then thought to do a small project/file with fixed data to verify those details, thing is I've now hit a roadblock and struggling to come up with ideas of how to fix it, I think my problem lies in how I calculate the mantissa (parameter man) but it could also be the exponent (parametet exp) or both, I started with the floating part as 0 to check the exponent so I don't think it's that though. Here's what I have so far:
    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <inttypes.h>
    #include <float.h>
    typedef unsigned char uchar;
    typedef unsigned long long ullong;
    #define TYPE float
    typedef union LDBL {
    	TYPE fpn;
    	uchar hex[sizeof(TYPE)];
    	struct {
    		ulong man : 23;
    		ulong exp : 8;
    		ulong sig : 1;
    	};
    } LDBL;
    
    void print( char *text, LDBL val ) {
    	printf("%s",text);
    	for ( size_t i = 0; i < sizeof(TYPE); ++i ) {
    		printf( " %02X", val.hex[i] );
    	}
    }
    #define NUM 2
    #define FPN 0
    int main() {
    	ulong num = 3, fpn = 0, exp = 0;
    	ulong bit = 1, pos = 0;
    	LDBL test = {0}, mine = {0};
    	test.fpn = fpn;
    	test.fpn /= 10;
    	test.fpn += num;
    	printf("%E\n", test.fpn);
    	for ( pos = 0; bit; bit <<= 1 ) {
    		mine.man |= bit;
    		if ( num & bit ) mine.exp = pos;
    		++pos;
    		if ( pos == FLT_MAX_EXP ) break;
    	}
    	mine.exp += FLT_MAX_EXP;
    	if ( exp )
    		mine.exp = (mine.exp == (FLT_MAX_EXP) * 2) ? -1 : mine.exp + exp;
    	num *= 10;
    	bit = 1;
    	pos = 0;
    	if ( mine.exp != (FLT_MAX_EXP + exp) )
    		for ( ; !(bit & num); bit <<= 1, ++pos );
    	for ( ; pos < 23; ++pos ) {
    		fpn <<= 1;
    		mine.man <<= 1;
    		if ( !fpn ) break;
    		mine.man |= (fpn & bit);
    	}
    	print( "test", test );
    	putchar('\n');
    	print( "mine", mine );
    	putchar('\n');
    	printf( "%08X %08X\n", test.exp, mine.exp );
    	return 0;
    }
    I'll take a look at some bugs mentioned of mitsy before taking a break until I think of something or someone else manages to think of something (and posts it)

    Edit: Oh and here are my results:
    Code:
    gcc -Wall -o "test_fpn" "test_fpn.c" && "./test_fpn"
    3.000000E+00
    test 00 00 40 40
    mine FE FF FF 40
    00000080 00000081
    Compilation finished successfully.

    I was actually about to start on a very similar problem. What about something along these lines?

    Code:
    double fpn(long ipart, long rpart, long epart)
    { 
     const double epsilon = 0.000001;
     double i = ipart, r = rpart, e = epart;
     while(r > 1)
      r /= 10
     double sum = i + r, 
      result = pow(sum, e),
      check = pow(result, 1 / e),
      distance = abs(sum - check);
     return distance < epsilon 
      && distance >= 0 ? 
       result : NAN;
    }
    Not sure whether or not that would be robust enough for you but honestly just wondering myself if something like that might work.
    Last edited by Sir Galahad; 11-16-2019 at 07:32 PM.

  2. #47
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Quote Originally Posted by Sir Galahad View Post
    I was actually about to start on a very similar problem. What about something along these lines?

    Code:
    double fpn(long ipart, long rpart, long epart)
    { 
     const double epsilon = 0.000001;
     double i = ipart, r = rpart, e = epart;
     while(r > 1)
      r /= 10
     double sum = i + r, 
      result = pow(sum, e),
      check = pow(result, 1 / e),
      distance = abs(sum - check);
     return distance < epsilon 
      && distance >= 0 ? 
       result : NAN;
    }
    Not sure whether or not that would be robust enough for you but honestly just wondering myself if something like that might work.
    I appreciate the effort but that effecttively amounts to just a hack, I'm planning for this to support literals like 0.1f36 or 0.1f128 and not just base 2, 10 & 16 but anything between 2 to 62, to that end I need to programme around manual value creation as the base, automated limits the size that can be used to just 64 bits (or 80 if long double is extended)

    Anyways I've managed to refine my function further but not managed to fix the incorrect results, can anyone see any common points on these? (besides exponent equaling 10 since other values with that exponent value are correct):
    Code:
    gcc -Wall -o "test_fpn" "test_fpn.c" && "./test_fpn"
    pos = 24
    gcc = 001000111110100110101100
    mcc = 010001111101001101010111
    test.man = 000000023E9AC mine.man = 000000047D357 exp t = 0A0 m = 0A0
    fpn t = 1.100000e+10 m = 1.341007e+10 value '1.1e+10'
    pos = 25
    gcc = 011010011111001010111111
    mcc = 001001111100101011111101
    test.man = 000000069F2BF mine.man = 000000027CAFD exp t = 0A1 m = 0A1
    fpn t = 3.140000e+10 m = 2.252078e+10 value '3.14e+10'
    pos = 24
    gcc = 001000111110100110101100
    mcc = 010001111101001101010111
    test.man = 000000023E9AC mine.man = 000000047D357 exp t = 0A0 m = 0A0
    fpn t = -1.100000e+10 m = -1.341007e+10 value '-1.1e+10'
    pos = 25
    gcc = 011010011111001010111111
    mcc = 001001111100101011111101
    test.man = 000000069F2BF mine.man = 000000027CAFD exp t = 0A1 m = 0A1
    fpn t = -3.140000e+10 m = -2.252078e+10 value '-3.14e+10'
    Count = 106, Wrong = 4
    Compilation finished successfully.
    I only know that I need to find some condition for increasing pos before it is used since that's used to clamp the loop:
    Code:
    #include <limits.h>
    #include <inttypes.h>
    #include <float.h>
    #include <string.h>
    #include <stdio.h>
    typedef signed char schar;
    typedef unsigned char uchar;
    typedef unsigned long ulong;
    typedef signed long long sllong;
    typedef unsigned long long ullong;
    
    #define HALF_MAN_BIT 10
    #define HALF_EXP_BIT 5
    #define LDBL_MAN_BIT 52
    #define LDBL_EXP_BIT 37
    
    #define FLT_SCNf "%f"
    #define FLT_SCNe "%e"
    #define FLT_MAN_BIT 23
    #define FLT_EXP_BIT 8
    typedef union FLT_UNION
    {
    	float fpn;
    	uchar raw[sizeof(float)];
    	struct
    	{
    		ullong man:FLT_MAN_BIT;
    		ullong exp:FLT_EXP_BIT;
    		ullong sig:1;
    	};
    } FLT_UNION;
    
    #define DBL_SCNf "%lf"
    #define DBL_SCNe "%le"
    #define DBL_MAN_BIT 52
    #define DBL_EXP_BIT 11
    typedef union DBL_UNION
    {
    	double fpn;
    	uchar raw[sizeof(double)];
    	struct
    	{
    		ullong man:DBL_MAN_BIT;
    		ullong exp:DBL_EXP_BIT;
    		ullong sig:1;
    	};
    } DBL_UNION;
    
    #define FPN_PFX(VAL) FLT##_##VAL
    #define FPN_MAX FPN_PFX(_MAX)
    #define FPN_SCNf FPN_PFX(SCNf)
    #define FPN_SCNe FPN_PFX(SCNe)
    #define FPN_MAN_BIT FPN_PFX(MAN_BIT)
    #define FPN_EXP_BIT FPN_PFX(EXP_BIT)
    #define FPN_MAX_EXP FPN_PFX(MAX_EXP)
    #define FPN_MAX_10_EXP FPN_PFX(MAX_10_EXP)
    typedef FPN_PFX(UNION) FPN_UNION;
    typedef struct FPN_RET {
    	long pos;
    	FPN_UNION val;
    } FPN_RET;
    
    int fpn_read (char *text);
    
    char *fpn_text[] = {
    	"0",
    	"1", "10", "16", "100", "101",
    	"0.1", "0.01", "0.001", "0.101",
    	"1.1", "1.01", "1.001", "1.101", "3.14",
    	"1e+1", "1e+8", "1e+10", "1e+100", "3e+1",
    	"1e-1", "1e-8", "1e-10", "1e-100", "3e-1",
    	".1e+1", ".1e+8", ".1e+10", ".1e+100", ".3e+1",
    	".1e-1", ".1e-8", ".1e-10", ".1e-100", ".3e-1",
    	"1.1e+1", "1.1e+8", "1.1e+10", "1.1e+100", "3.1e+1",
    	"1.1e-1", "1.1e-8", "1.1e-10", "1.1e-100", "3.1e-1",
    	"3.14e+1", "3.14e+8", "3.14e+10", "3.14e+100",
    	"3.14e-1", "3.14e-8", "3.14e-10", "3.14e-100",
    	"-0",
    	"-1", "-10", "-16", "-100", "-101",
    	"-0.1", "-0.01", "-0.001", "-0.101",
    	"-1.1", "-1.01", "-1.001", "-1.101", "-3.14",
    	"-1e+1", "-1e+8", "-1e+10", "-1e+100", "-3e+1",
    	"-1e-1", "-1e-8", "-1e-10", "-1e-100", "-3e-1",
    	"-0.1e+1", "-0.1e+8", "-0.1e+10", "-0.1e+100", "-0.3e+1",
    	"-0.1e-1", "-0.1e-8", "-0.1e-10", "-0.1e-100", "-0.3e-1",
    	"-1.1e+1", "-1.1e+8", "-1.1e+10", "-1.1e+100", "-3.1e+1",
    	"-1.1e-1", "-1.1e-8", "-1.1e-10", "-1.1e-100", "-3.1e-1",
    	"-3.14e+1", "-3.14e+8", "-3.14e+10", "-3.14e+100",
    	"-3.14e-1", "-3.14e-8", "-3.14e-10", "-3.14e-100",
    	NULL
    };
    
    int main ()
    {
    	size_t i = 0, wrong = 0;
    	while (fpn_text[i] && i < 110)
    		wrong += fpn_read (fpn_text[i++]);
    	while (fpn_text[i]) ++i;
    	printf( "Count = %zu, Wrong = %zu", i, wrong );
    	return 0;
    }
    
    void printb( char const *text, void const * addr, size_t const bits ) {
    	size_t size = bits / CHAR_BIT, b = 0;
    	uchar const *a = addr;
    	uchar val, bit;
    	if ( bits % CHAR_BIT ) ++size;
    	printf(text);
    	while ( size-- ) {
    		for ( val = a[size], bit = 1; bit; val <<= 1, bit <<= 1, ++b ) {
    			//if ( b == bits ) return;
    			putchar( '0' + !!(val & SCHAR_MIN) );
    		}
    		if ( b >= bits ) return;
    	}
    }
    
    FPN_RET fpn_make ( ulong base, ullong num, ullong fpn, long exp, long dig)
    {
    	FPN_RET ret = {0};
    	long //_exp = exp,
    		exp_bias = FPN_MAX_EXP,
    		exp_10_max = FPN_MAX_10_EXP,
    		pos = 0, pos_max = FPN_MAN_BIT;
    	ullong one = 1, NUM;
    	if (dig < 1) return ret;
    	if ( exp <= -exp_10_max )
    		return ret;
    	if ( exp >= exp_10_max ) {
    		ret.val.exp = ~0;
    		return ret;
    	}
    	if ( num ) for ( NUM = num; NUM; NUM /= base, --dig );
    	else if ( fpn ) --dig;
    	else return ret;
    	/* Calculate one */
    	for (; one <= fpn; one *= base, --dig );
    	for (; dig > 0; one *= base, --dig );
    	/* Preserve exponent in case we need it later */
    	pos = exp;
    	while ( pos > 0 ) {
    		num *= base;
    		fpn *= base;
    		if ( fpn >= one ) {
    			NUM = fpn / one;
    			fpn -= NUM * one;
    			num += NUM;
    		}
    		--pos;
    	}
    	while ( pos < 0 ) {
    		NUM = num % base;
    		fpn += NUM * one;
    		one *= base;
    		num /= base;
    		++pos;
    	}
    	/* Calculate normal exponent */
    	pos = 0;
    	if ( num ) {
    		for ( ; !(num & 1); ret.val.exp++, num >>= 1 );
    		for ( NUM = num; NUM > 1; ++pos, NUM >>= 1 );
    	}
    	else for ( NUM = fpn; NUM < one; --pos, NUM <<= 1 );
    	/* Set exponent and mantissa */
    	ret.val.man = num;
    	ret.val.exp += exp_bias + pos - 1;
    	ret.pos = pos;
    	for ( ; pos < pos_max; ++pos)
    	{
    		fpn *= 2;
    		ret.val.man <<= 1;
    		if (fpn >= one)
    		{
    			ret.val.man |= 1;
    			fpn -= one;
    		}
    	}
    	fpn *= 2;
    	if ( fpn >= one ) ret.val.man++;
    	return ret;
    }
    
    int fpn_read (char *text)
    {
    	char *_text = text;
    	ulong base = 10;
    	long exp = 0, dig = 0;
    	ullong num = 0, fpn = 0;
    	FPN_UNION test = { 0 };
    	FPN_RET ret = {0};
    	_Bool same = 0, neg = (*text == '-');
    	sscanf (text, FPN_SCNf, &(test.fpn));
    	if ( neg || *text == '+' ) {
    		++text;
    	}
    	while (*text == '0')
    		++text;
    	for (; *text >= '0' && *text <= '9'; ++text)
    	{
    		num *= base;
    		num += (*text - '0');
    		++dig;
    	}
    	if (dig < 1)
    		dig = 1;
    	if (*text == '.')
    	{
    		for (++text; *text >= '0' && *text <= '9'; ++text)
    		{
    			fpn *= base;
    			fpn += (*text - '0');
    			++dig;
    		}
    	}
    	if (*text == 'e' || *text == 'E')
    	{
    		if (*(++text) == '-')
    		{
    			for (++text; *text >= '0' && *text <= '9'; ++text)
    			{
    				exp *= base;
    				exp -= (*text - '0');
    			}
    		}
    		else
    		{
    			for (++text; *text >= '0' && *text <= '9'; ++text)
    			{
    				exp *= base;
    				exp += (*text - '0');
    			}
    		}
    	}
    	ret = fpn_make (base, num, fpn, exp, dig);
    	ret.val.sig = neg;
    	same = ( test.man == ret.val.man &&
    		test.exp == ret.val.exp && test.sig == ret.val.sig );
    	if ( !same ) {
    		printf("pos = %ld\n", ret.pos );
    #if 0
    		printb( "num = ", &num, FPN_MAN_BIT );
    		putchar('\n');
    		printb( "fpn = ", &fpn, FPN_MAN_BIT );
    		putchar('\n');
    #endif
    		num = test.man;
    		printb( "gcc = ", &num, FPN_MAN_BIT );
    		putchar('\n');
    		num = ret.val.man;
    		printb( "mcc = ", &num, FPN_MAN_BIT );
    		putchar('\n');
    		printf ("test.man = %013llX ", (ullong)(test.man) );
    		printf ("mine.man = %013llX ", (ullong)(ret.val.man));
    		printf ("exp t = %03X m = %03X\n", test.exp, ret.val.exp);
    		printf ("fpn t = " FPN_SCNe " m = " FPN_SCNe " ", test.fpn, ret.val.fpn);
    		printf ("value '%s'\n", _text);
    	}
    	return !same;
    }

  3. #48
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,791
    Well, dtoa.c is roughly 6000 lines long (including comments, of course) to do the same thing you're trying to do. I reckon that if David M. Gay, Apple, BSD, gcc and a whole bunch of other people could have made it shorter and less complex they probably would have. It uses arbitrary precision integers ("big ints") for the internal calculations and Knuth (in TAOCP, Volume 2, Seminumerical Algorithms in the sections about floating point) alludes to big integers as well. From what I've read and the code that I've looked at the biggest hassle seems to be handling edge cases and rounding modes. Of course, the C Standard does not require IEEE-754 compliance, it only suggests it (and using the gcc flag -funsafe-math-optimizations basically makes gcc abandon it completely). So I think if it was me doing this I'd ask the question "Do I really need this to comply with IEEE-754?" or I'd just use one of the implementations already available with a compatible open source license.

    Edit: If I was hiring someone to write this (I admit that there is no way I'd do it myself and even if I did attempt it I wouldn't trust the result) I'd want them to have a degree in mathematics (better still a PhD) because from what I've read and from the source code I've looked at it's not simple and not straightforward.
    Last edited by Hodor; 11-17-2019 at 11:53 PM.

  4. #49
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Quote Originally Posted by Hodor View Post
    Well, dtoa.c is roughly 6000 lines long (including comments, of course) to do the same thing you're trying to do. I reckon that if David M. Gay, Apple, BSD, gcc and a whole bunch of other people could have made it shorter and less complex they probably would have. It uses arbitrary precision integers ("big ints") for the internal calculations and Knuth (in TAOCP, Volume 2, Seminumerical Algorithms in the sections about floating point) alludes to big integers as well. From what I've read and the code that I've looked at the biggest hassle seems to be handling edge cases and rounding modes. Of course, the C Standard does not require IEEE-754 compliance, it only suggests it (and using the gcc flag -funsafe-math-optimizations basically makes gcc abandon it completely). So I think if it was me doing this I'd ask the question "Do I really need this to comply with IEEE-754?" or I'd just use one of the implementations already available with a compatible open source license.

    Edit: If I was hiring someone to write this (I admit that there is no way I'd do it myself and even if I did attempt it I wouldn't trust the result) I'd want them to have a degree in mathematics (better still a PhD) because from what I've read and from the source code I've looked at it's not simple and not straightforward.
    Well IEE754 has the highest likelihood of matching what modern CPUs do atm so I'd rather have results matching that and WOW 6000 lines, man that is a lot, did they have a list of those edge cases? And yes to some extent I will be using big int math in the final function but that more to ensure it can handle upto f1024 (I've read somewhere that some integers can go that high so matching that seems a reasonable limit), another reason for matching IEE754 is that the result should be easy to transform into other formats if need be (haven't checked that since that is for later targets)

  5. #50
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Well I've just tried using a gcc extension (__int128) to see if any bits were being cropped before setting pos, it was not.

  6. #51
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by awsdert View Post
    I appreciate the effort but that effecttively amounts to just a hack, I'm planning for this to support literals like 0.1f36 or 0.1f128 and not just base 2, 10 & 16 but anything between 2 to 62, to that end I need to programme around manual value creation as the base, automated limits the size that can be used to just 64 bits (or 80 if long double is extended)
    Yes but bit twiddling with floats and doubles isn't something you want to be doing either unless you absolutely have to. And in this case you don't because there does exist SOME way to construct them directly from floating point numbers using the constituent integer parts in a satisfactory manner.

    Quote Originally Posted by Hodor View Post
    Well, dtoa.c is roughly 6000 lines long (including comments, of course) to do the same thing you're trying to do. I reckon that if David M. Gay, Apple, BSD, gcc and a whole bunch of other people could have made it shorter and less complex they probably would have. It uses arbitrary precision integers ("big ints") for the internal calculations and Knuth (in TAOCP, Volume 2, Seminumerical Algorithms in the sections about floating point) alludes to big integers as well. From what I've read and the code that I've looked at the biggest hassle seems to be handling edge cases and rounding modes. Of course, the C Standard does not require IEEE-754 compliance, it only suggests it (and using the gcc flag -funsafe-math-optimizations basically makes gcc abandon it completely). So I think if it was me doing this I'd ask the question "Do I really need this to comply with IEEE-754?" or I'd just use one of the implementations already available with a compatible open source license.

    Edit: If I was hiring someone to write this (I admit that there is no way I'd do it myself and even if I did attempt it I wouldn't trust the result) I'd want them to have a degree in mathematics (better still a PhD) because from what I've read and from the source code I've looked at it's not simple and not straightforward.
    It's a tough standard to adhere to with enough corner cases to keep even an expert programmer busy for several weeks/months.

  7. #52
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Quote Originally Posted by Sir Galahad View Post
    Yes but bit twiddling with floats and doubles isn't something you want to be doing either unless you absolutely have to. And in this case you don't because there does exist SOME way to construct them directly from floating point numbers using the constituent integer parts in a satisfactory manner.
    As I stated before I plan to implement larger sized floats than what is supported by even long double, to this end I need to implement the logic myself (which I have fixed just now btw, just need to identify correct condition for incrementing the mantissa)
    Quote Originally Posted by Sir Galahad View Post
    It's a tough standard to adhere to with enough corner cases to keep even an expert programmer busy for several weeks/months.
    As I asked before is there a list of those corner cases for me to test against?

    Finally, I want to say this in the loudest voice possible sooo... I'VE DONE IT!!! i FIXED THE INCORRECTLY POSITIONED MANTISSA!!!

    Ahem, sorry 'bout that, couldn't resist after spending so long on it, turned out to be a rather simplistic solution
    Code:
    ...
    	if ( pos > pos_max ) {
    		pos_max = pos - pos_max;
    		ret.val.man = num >> pos_max;
    		fpn <<= pos_max;
    	}
    	for ( ; pos < pos_max; ++pos)
    ...
    Edit:
    ...
    ...
    ...
    ...
    ...
    Code:
    gcc -Wall -o "test_fpn" "test_fpn.c" && "./test_fpn"
    Count = 106, Wrong = 0
    Compilation finished successfully.
    Moved the code I previously posted into onlinegdb:
    GDB online Debugger | Code, Compile, Run, Debug online C, C++
    Last edited by awsdert; 11-18-2019 at 01:47 PM.

  8. #53
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by awsdert View Post
    As I stated before I plan to implement larger sized floats than what is supported by even long double, to this end I need to implement the logic myself (which I have fixed just now btw, just need to identify correct condition for incrementing the mantissa)

    As I asked before is there a list of those corner cases for me to test against?

    Finally, I want to say this in the loudest voice possible sooo... I'VE DONE IT!!! i FIXED THE INCORRECTLY POSITIONED MANTISSA!!!

    Ahem, sorry 'bout that, couldn't resist after spending so long on it, turned out to be a rather simplistic solution
    Code:
    ...
    	if ( pos > pos_max ) {
    		pos_max = pos - pos_max;
    		ret.val.man = num >> pos_max;
    		fpn <<= pos_max;
    	}
    	for ( ; pos < pos_max; ++pos)
    ...
    Edit:
    ...
    ...
    ...
    ...
    ...
    Code:
    gcc -Wall -o "test_fpn" "test_fpn.c" && "./test_fpn"
    Count = 106, Wrong = 0
    Compilation finished successfully.
    Moved the code I previously posted into onlinegdb:
    GDB online Debugger | Code, Compile, Run, Debug online C, C++
    Well that's good news. You obviously have a much better grasp of the problem than I could ever hope to. And no I don't know of any actual list of corner cases, just a memory of things like the popular "What every programmer should know about floating point numbers" article, brief perusals of the IEEE standard, etc.

  9. #54
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Quote Originally Posted by Sir Galahad View Post
    Well that's good news. You obviously have a much better grasp of the problem than I could ever hope to. And no I don't know of any actual list of corner cases, just a memory of things like the popular "What every programmer should know about floating point numbers" article, brief perusals of the IEEE standard, etc.
    Ahh ok, well anyways I'm now testing on hexadecimal as well, I've got it to work where either side of the decimal point is 0 but still need to work on the both sides set scenario, for now I'm thinking it's just the data I'm feeding in so I'm playing around with that for time being, either way it's sufficient to be able to read in decimal values for time being, hexadecimal will be a later task to worry about.

  10. #55
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Well partially resolved the incorrect mantissa on hexadecimal, added a new line in the following block:
    Code:
    	if ( num ) {
    		for ( ; !(num & 1u); ret.val.exp++, num >>= 1 );
    		for ( NUM = num; NUM > 1; ++pos, NUM >>= 1 );
    		if ( fpn && base > 10 ) while ( !(fpn & 1u) ) fpn >>= 1; // This line
    	}
    Which left me with these results:
    Code:
    gcc -Wall -o "test_fpn" "test_fpn.c" && "./test_fpn"
    pos = 1
    gcc = 010010100000000000000000
    mcc = 010101000000000000000000
    test.man = 00000004A0000 mine.man = 0000000540000 exp t = 082 m = 082
    fpn t = 1.262500e+01 m = 1.325000e+01 value '0xC.Ap+A'
    Index = 144
    pos = 1
    gcc = 010010100000000000000000
    mcc = 010101000000000000000000
    test.man = 00000004A0000 mine.man = 0000000540000 exp t = 082 m = 082
    fpn t = 1.262500e+01 m = 1.325000e+01 value '0xC.Ap-A'
    Index = 149
    pos = 0
    gcc = 000000000000000000000000
    mcc = 000000000000000000000000
    test.man = 0000000000000 mine.man = 0000000000000 exp t = 000 m = 000
    fpn t = 0.000000e+00 m = -0.000000e+00 value '-C.Ap+A'
    Index = 196
    pos = 0
    gcc = 000000000000000000000000
    mcc = 000000000000000000000000
    test.man = 0000000000000 mine.man = 0000000000000 exp t = 000 m = 000
    fpn t = 0.000000e+00 m = -0.000000e+00 value '-C.Ap-A'
    Index = 201
    Tried = 210, Count = 210, Wrong = 4
    Compilation finished successfully.
    So it seems I'm definitly providing the right data, I just gotta figure out the current edge case and I should be good to go on hexadecimal floats too, not so sure on binary but should be fine in theory

  11. #56
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    I got it
    GDB online Debugger | Code, Compile, Run, Debug online C, C++
    Changed this:
    Code:
    if ( num ) {
        for ( ; !(num & 1u); ret.val.exp++, num >>= 1 );
        for ( NUM = num; NUM > 1; ++pos, NUM >>= 1 );
        if ( fpn && base > 10 ) while ( !(fpn & 1u) ) fpn >>= 1;
    }
    to this:
    Code:
    if ( num ) {
        for ( ; !(num % base); ret.val.exp++, num >>= 1 );
        for ( NUM = num; NUM > 1; ++pos, NUM >>= 1 );
    }
    Edit: Works fully for float type but double & probably long double each have a corner case,
    in the instances of #.#?+/-100 the mantissa and exponent are completely wrong
    Last edited by awsdert; 11-19-2019 at 05:11 PM.

  12. #57
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    GDB online Debugger | Code, Compile, Run, Debug online C, C++
    So I've been transforming the function parameters and where it puts the exponent and mantissa so I can begin using the bignum math, I noticed some errors crop up along the way and in the course of fixing them I had created a function to check pretty much every number in a given range (tested as far as 100) for all 3 types of FPN, namely: N, 0.N & N.N and all those plus e+/-N (look for exponent() in the provided link if you still don't understand) and I found that on both FLT_* and DBL_* based information the raw comes out correctly which now leads me to believe the reason I was getting incorrect raw on doubles and not floats is actually because of how I read the text in fpn_read(). I've looked at the code but I don't see where the problem actually occurs, perhaps someone else can spot it by monday (gotta go to work and have a long day tomorrow so unlikely to get round to looking again until monday).

  13. #58
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,791
    Quote Originally Posted by awsdert View Post
    GDB online Debugger | Code, Compile, Run, Debug online C, C++
    So I've been transforming the function parameters and where it puts the exponent and mantissa so I can begin using the bignum math, I noticed some errors crop up along the way and in the course of fixing them I had created a function to check pretty much every number in a given range (tested as far as 100) for all 3 types of FPN, namely: N, 0.N & N.N and all those plus e+/-N (look for exponent() in the provided link if you still don't understand) and I found that on both FLT_* and DBL_* based information the raw comes out correctly which now leads me to believe the reason I was getting incorrect raw on doubles and not floats is actually because of how I read the text in fpn_read(). I've looked at the code but I don't see where the problem actually occurs, perhaps someone else can spot it by monday (gotta go to work and have a long day tomorrow so unlikely to get round to looking again until monday).

    Do the results change depending on whether you use optimisation or not (or, indeed, gcc vs. clang)? Your use of union -- writing to a member but then reading from a different member, especially if the read member is a different type -- is not legal C so the compiler is free to do whatever it wants. The standard says that if you set member x that you must read member x of the union, and that reading member y after it was set using .x is not legal). The approach you're using is a common thing though so I'm not sure (apart from it not being "legal" C). I can't think of a way to avoid the illegal use of the union, apart from using C++ style casts which are out of the question (clearly, because this is C), so hmm... dunno. I think it's kind of safe even though it's technically illegal. Personally I wouldn't use bitfields because the compiler can arrange their order however it wants to. With shifts and masks and preprocessing you can deal with endianess (and even architectures that use mixed endian where integers and floating point endianess can have different endianess -- e.g. some ARM processors -- in a predictable and determinable way. There is no way to reliably predict and account for how bitfields are packed, as far as I know, so unsigned ints, shifts and masks (which can be reliably predicted and accounted for) is the way to go.

    Edit: Apart from the use of bitfields (which I would not use; I'd use shifts and masks) I'd use the union approach that you're using as well... gotta get the job done *somehow*. So, I'm not saying you're wrong doing that I'm just raising it as a point to be aware of and as a question
    Last edited by Hodor; 11-23-2019 at 07:06 AM.

  14. #59
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    While I get your point in this scenario that's not an issue since using a known field to check against, also as you will note from the new code I'm no longer using the bitfields, they're just still there in case I decide I need to use them which is getting unlikely I'm merely struggling to find what is being passed with incorrect data or handled incorrectly to produce the few wrong results out of 210 fixed values

  15. #60
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,791
    Quote Originally Posted by awsdert View Post
    While I get your point in this scenario that's not an issue since using a known field to check against, also as you will note from the new code I'm no longer using the bitfields, they're just still there in case I decide I need to use them which is getting unlikely I'm merely struggling to find what is being passed with incorrect data or handled incorrectly to produce the few wrong results out of 210 fixed values

    Hmm. Yeah, that's true. I think I was looking at an older version of the code somehow; sorry about that.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ and Math
    By darren78 in forum C++ Programming
    Replies: 2
    Last Post: 07-08-2010, 09:19 AM
  2. hex math
    By kroiz in forum C Programming
    Replies: 25
    Last Post: 01-20-2009, 03:46 PM
  3. Basic Math Problem. Undefined Math Functions
    By gsoft in forum C Programming
    Replies: 1
    Last Post: 12-28-2004, 03:14 AM
  4. math.h
    By sweets in forum C++ Programming
    Replies: 2
    Last Post: 05-05-2003, 01:27 PM
  5. Math Help
    By CAP in forum C Programming
    Replies: 2
    Last Post: 08-19-2002, 12:03 AM

Tags for this Thread