# floating point number comparison

• 05-18-2009
lehe
floating point number comparison
Hi,
I heard that don't directly compare two floating numbers, since their representations in computer are inaccurate. But I still have some questions.

1. Is this only applied to the results from some arithmetic computation? If I have a floating number whose value is given instead of from computation but by assignment to a number directly or from the command line argument, is it the same case? e.g.
Code:

```double v = 0; // or other real values if (v>0){  ... }else if(v == 0){ }else if(v < 0){ }```
2. Also I searched on internet on this topic. It seems to me all are about comparing if two floating numbers are the same by giving some tolerance. To compare if a floating number is bigger than the other, does it need to firstly compare if they are equal within the tolerance, then "else if" one is bigger than the other? Or can I do this without comparing equality first?

3. Can anyone recommend me a function or STL function for comparing two floating point numbers (can deal with both float and double types)?

Thanks and regards!
• 05-18-2009
matsp
1. ALL floating point values are subject to approximation. There are lots of values that we may think are precies, but when stored in binary form, they form infinite series of decimals, just like 1/3 is 0.3333333333.. in decimal, where 1/3 in base 3 is 0.1 - NOT an infinite series of digits. In binary 0.1 is not possible to describe in a finite number of decimals - it becomes 0.099999999999999999... Depending on EXACTLY what the compiler does when producing the code to compare numbers, it may match constant values that come from somewhere else, etc. But it's also entirely possible that:
Code:

```double d = 0.1; if (d == 0.1) {   printf("True\n"); }```
actually doesn't print "True" - because of some of the intermediate operations done between the loading of the constant 0.1 and the value held in d.

2. Yes, you obviously start by taking the difference of the two numbers. Once you have a difference, you can do:
a) take the absolute value (using fabs()) to see if it's "close enough" [however that is done].
b) see if it's above or below zero. That will tell you which of the two numbers are greater.

--
Mats
• 05-18-2009
Aisthesis
I'm pretty new to this, but I would think that the program itself would give some kind of tolerance range, determined by where it rounds floats. If you want to make it more precise and determine your own tolerance for equality, then I'd use doubles or long doubles.

If you're worried about this, I'd experiment around with how much floats get rounded in calculations and take it from there.

I think there's also a tendency to round large numbers much more than small ones. Just guessing (without running this), I think the following holds:

Code:

```float x; x = 1000000.001 cout << x;```
I think the output here will just be 1000000.

However,
Code:

```float x x = .004; cout << x;```
I think float will hold x as .004 (and hence output .004) in this case.

Also, for equality, you really don't care normally on a float whether we're talking about 1,000,000.1 or 1,000,000 with a difference of .1. But you typically do care about the difference between .01 and .09, which is of course smaller.

If it were me, I would just let the default rounding of the computer decide on equality for floats unless there's some specific context where it really makes a difference.

In that case, I'd use doubles and then take the trouble to make exact definitions of when 2 values are equal (i.e., put an upper bound on absolute value of difference).

Back to your original question of how to compare: You can definitely do this without a bunch of if/else if statements by using abs() from <cmath>:

If x and y are two floats and Tol is an rvalue of whatever type (constant or could also be the result of some more complex function) that says what our tolerance is, then all you have to do is use:
abs(x - y) < Tol instead of x == y.
• 05-18-2009
matsp
Aisthesis: There are two issues you bring up [I'm not sure if you intended to do that...]

So, the first is what's called a epsilon value - which is the smallest value that can be added/subtracted and still changes the actual value. This value depends on the size of the actual value - for floating point numbers it is about 1 / 2^23 * log2(x) where x is the actual number - so put another way, it's about 2^23 times SMALLER than the actual number. For double numbers, the value is 2^53 times smaller.

The second, somewhat related, but not necessarily directly related is the magnitude of difference that is considered "the same", and yes, for small numbers, we want only a small difference.

One solution that sometimes works is if we want to compare a and b, is to divide a/b, and subtract 1.0. If the value is 0 [or the difference to zero is less than, say 0.01], then they were equal. If it's grater than 0, b is smaller than a. If less than zero, b is greater than a. This doesn't work very well if you get values of (near)zero for a or b. Note that dividing by a TINY number is valid, but 1/0.00001 is 100000, so you can quickly end up with HUGE values.

I don't believe there is ONE solution that will always work. It has to be adjusted to what the range and type of numbers you can be expected to see.

--
Mats
• 05-18-2009
Aisthesis
Nicely stated, Mats. After what you've said, I agree fully that tolerance needs to have a flexible solution depending on context: In some contexts, the difference between .02 and .04 is huge, in others it's negligible.

Also very informative on the rounding for floats and doubles.

Do you think the same thing from your code example (with doubles) can happen with floats? I guess it probably has to if it does so for doubles. And that really means that we should indeed define equality ourselves using a tolerance range every time.
• 05-18-2009
matsp
The ONLY difference between float and double is the size of the numbers that are supported (number of digits the number consists of, and the range it can hold). The float type commonly uses 32 bits for the number, with 1 bit for sign, 8 bits of exponent and 23+1 bits of mantissa. Double uses 64-bits: 1 bit sign, 11 bits of exponent, and 52+1 bit of mantissa. It gives an 8x larger exponent, and a bit more than twice the number of digits. Besides that, there is no functional difference - all problems you get with double exist in float as well. It actually gets even more interesting when you MIX them:
Say you have this:
Code:

```float f = 1.4f; double d = 1.4; if (d == f) { ... }```
You may think that this would work perfectly, right? But what happens with a floating point number when we extend it to double [which the compiler must do to compar the two!]? It is padded with zero's to fill out the number. Since 1.4 is a multiple of 0.1, it can not be precise in binary. So f's value is 1.399997 or some such, and d's value is 1.39999999999998. We then exted f to double precision: 1.3999970000000 - not the same as d, is it?

[Note the last digit, I just made up. It could be just about anything, but probably greater than 5].

--
Mats
• 05-18-2009
brewbuck
Quote:

Originally Posted by lehe
Code:

```double v = 0; // or other real values if (v>0){  ... }else if(v == 0){ }else if(v < 0){ }```

This specific example is fine. Floats might not be infinitely precise, but that doesn't change the fact that a number is either negative, positive, or zero. Not being able to test in a simple way whether a number is negative or positive would be ludicrous.

Also, the number zero is represented perfectly. Imagine if it was not. Multiplying something by zero could result in a non-zero result. Again, that would be ludicrous.

Using floating point types requires thinking, not paranoia.
• 05-18-2009
matsp
Quote:

Originally Posted by brewbuck
This specific example is fine. Floats might not be infinitely precise, but that doesn't change the fact that a number is either negative, positive, or zero. Not being able to test in a simple way whether a number is negative or positive would be ludicrous.

Also, the number zero is represented perfectly. Imagine if it was not. Multiplying something by zero could result in a non-zero result. Again, that would be ludicrous.

Using floating point types requires thinking, not paranoia.

A very good point!

--
Mats
• 05-18-2009
nonoob
You may find the following useful:
Re: Whose fault is my problem? <-- link

It shows the algorithm that is used by the language APL to do its floating point comparisons. The algorithm goes something like this:

Code:

```fuzz = 1.0e-10; m = max(X,Y); if (abs(X - Y) / m < fuzz)     /* they're equal */ else     /* they're not equal */```
... although the real way it's implemented is to mask off the lowest so many bits of precision of the mantissa at the machine language level.
• 05-18-2009
VirtualAce
It is common practice to use an epsilon as has been stated. It is not guaranteed to work as matsp has clearly pointed out. If it does not work then the advantage is you can lower or raise the epsilon which again may not work in all cases.

In general (unless comparing with 0.0f) I avoid using == on any floats or doubles. If your code depends on that then you may need to re-think it a bit to get around the comparison. I rarely find a reason why I would ever 'have' to precisely compare one float to another. Usually >= or <= is enough or some combination thereof.
• 05-18-2009
phantomotap
I've always liked this little example program for what it does. (Well, actually, I like its longer, interactive cousin.) I shouldn't look at it to long, but it is illustrative of what has been explained. (Where it will compile and run--almost "everywhere" considering the "wintel" box.)

Soma

Code:

```#include <cmath> #include <iomanip> #include <iostream> #include <limits> #include <sstream> #include <string> typedef unsigned long ui32; // whatever 32 bit type typedef unsigned long long ui64; // whatever 64 bit type template <         typename T > struct get_bad_alias { }; template <> struct get_bad_alias<float> {         typedef ui32 type; }; template <> struct get_bad_alias<double> {         typedef ui64 type; }; template <         typename T > bool compare_test (         T a,         T b,         T r,         T e ) {         using std::abs;         if(abs(a - b) < e)         {                 return(true);         }         T re(0);         if(abs(b) > abs(a))         {                 re = abs((a - b) / b);         }         else         {                 re = abs((a - b) / a);         }         if(re <= r)         {                 return(true);         }         return(false); } template <         typename T > std::string fppair (         const T & a,         const T & b ) {         std::ostringstream out;         out.setf(std::ios::fixed, std::ios::floatfield);         out.precision(16);         out << '(' << a << ',' << ' ' << b << ')';         return(out.str()); } template <         typename T > void dump (         T a,         T b,         T r,         T e ) {         using std::cout;         T diff(std::numeric_limits<T>::epsilon());         cout << fppair(a, b) << ' ' << compare_test(a + diff, b, r, e) << '\n';         cout << fppair(a, b) << ' ' << compare_test(a, b, r, e) << '\n';         cout << fppair(a, b) << ' ' << compare_test(a - diff, b, r, e) << '\n';         cout << '\n'; } template <         typename T > void generate_on_a (         T a,         T b,         T r,         T e ) {         typedef typename get_bad_alias<T>::type alias;         alias & aliasa(*reinterpret_cast<alias *>(&a));         aliasa -= 2;         dump(a, b, r, e);         ++aliasa;         dump(a, b, r, e);         ++aliasa;         dump(a, b, r, e);         ++aliasa;         dump(a, b, r, e);         ++aliasa;         dump(a, b, r, e); } template <         typename T > void generate_on_b (         T a,         T b,         T r,         T e ) {         typedef typename get_bad_alias<T>::type alias;         alias & aliasb(*reinterpret_cast<alias *>(&b));         aliasb -= 2;         dump(a, b, r, e);         ++aliasb;         dump(a, b, r, e);         ++aliasb;         dump(a, b, r, e);         ++aliasb;         dump(a, b, r, e);         ++aliasb;         dump(a, b, r, e); } template <         typename T > void generate (         T a,         T b,         T r,         T e ) {         generate_on_a(a, b, r, e);         std::cout << "\n\n";         generate_on_b(a, b, r, e);         std::cout << "\n\n\n\n"; } template <         typename T > void epsilon_test (         T a,         T b ) {         generate(a, b, std::numeric_limits<T>::epsilon(), std::numeric_limits<T>::epsilon()); } int main() {         std::cout << std::boolalpha;         epsilon_test(+0.250f, +0.250f);         epsilon_test(+0.250f, -0.250f);         epsilon_test(-0.250f, +0.250f);         epsilon_test(-0.250f, -0.250f);         epsilon_test(+0.250, +0.250);         epsilon_test(+0.250, -0.250);         epsilon_test(-0.250, +0.250);         epsilon_test(-0.250, -0.250);         return(0); }```