Faster bitwise operator

**Yarin** · 04-27-2009

Is != or < faster?
In the bignum class I made, != is faster. And just from what I can best think of, it would be for the processor register comparison too.
But I just wanted to confirm this with you guys - which is faster?

**whiteflags** · 04-27-2009

I would wager that both are done in constant time, and whether one is faster than the other depends on processing power.

**matsp** · 04-27-2009

For simple types (e.g. integers and floats), they are identical. I actually fail to see how they would be majorly different in a bignum class or similar either - you compare from the most significant bit and if the current element is different or greater/lesser doesn't make any difference to the processor - it's still a compare operation (or memory read in processors like 68K, PDP-11, VAX, 6502, 6809) followed by a branch/jump instruction.

So unless there is some major difference in the implementationin your bignum class [e.g. one reads single bytes and the other does larger chunks, 32- or 64-bit words] , I don't understand why one would be better than the other.

--
Mats

**anon** · 04-27-2009

For example a string. If they have a different length, then != is true. On the other hand, strings of unequal length might begin with a very long equal sequence before they differ.

**matsp** · 04-27-2009

Originally Posted by anon

For example a string. If they have a different length, then != is true. On the other hand, strings of unequal length might begin with a very long equal sequence before they differ.

Sure. But I bet most string compares just compare the string itself, without comparing length! Because for many cases, it's either equal length (and thus no benefit) or the beginning is not similar.

Certainly, MS VS 2008 implementation of basic_string::compare() doesn't compare lengths.

--
Mats

**JVene** · 04-27-2009

Back to "Is != or < faster?" specifically, let me invoke a hint of history here.

This is the same in C or C++; they are elementary statements.

C was originally written to replace an assembler, essentially to be a chip independent assembler or just slightly higher than assembler.

Many constructs in the elementary syntax of C (and therefore C++) translate into 1 assembler instruction, when a chip supports that as such. These two happen to translate to x86 instructions of identical meaning (!= or <, >, >=, <=, etc).

As such, the speed of such comparisons, or of any of the elementary constructs like this, will relate closely to the chip on which you're executing. So, from a language viewpoint, it hardly makes sense to think of this in terms of the "language speed", but of the fact that implementations differ, optimizations may significantly affect some of the basic constructs and you generally shouldn't focus on microscopic performance issues too early. There are very rare exceptions to this general rule, and by the time you get to the point that it matters in your work, you'll have some sense of the assembler implications of your C/C++ code anyway, and the question will be moot.

**cpjust** · 04-27-2009

Does anyone know which (x86) instructions those operators boil down to?
All you'd need to do is see how many clock cycles each of those instructions takes and compare their speed, right?

Although, without knowing too much about chip implementations, the only guess I can make as to why one might be faster is that != would just compare the binary representation of the two numbers, whereas < or > would need to check the sign of the variable for a signed int and then compare the numbers (whether the CPU can check the sign and do the compare at the same time, I have no idea).

**King Mir** · 04-27-2009

Originally Posted by cpjust

Does anyone know which (x86) instructions those operators boil down to?
All you'd need to do is see how many clock cycles each of those instructions takes and compare their speed, right?

For Integers, It should boil down to CMP R2,R1 (or however registers are names) followed by either JE or JGE. JE and JGE both just check the control flags, and are therefore trivial operations. For floating point it would use FCOM, instead of CMP.

So for primitives there is no difference.

**JVene** · 04-27-2009

cpjust

x86 does as King Mir states - cmp is the generic call to perform a comparison - it is very much like a subtraction. In x86, subtractions and cmp set flags in the CPU, indicating things like sign, zero flag, overflow - various technical results. After that, a conditional jump instruction is what makes the decision as to truth or falsehood of a particular comparison like <, !=, etc. All of these have the same cycle count - so the answer you've received is the best you can get - they are basically the same. There is no point in asking which is faster.

What makes this a little more complicated is the various CPU's ability to predict branches, cache reaction to branches, hyperthreading, speed of RAM, alignment in RAM, whether or not the cmp is applied to a register or memory location (I forget if/when that was or was not possible in x86)....

In other words, you're focused on such a microscopic level of performance that you probably can't get the results you're hoping for without deciding to code in assembler - and even that isn't likely to be all that productive in this case.

Each model CPU and combination of hardware could make some difference.

**iMalc** · 04-28-2009

For the processor (all the common ones at least) they are surely absolutely equal.
For a bignum class the < would be slightly slower because it involves two comparisons in the loop (excluding the loop condition) rather than just one:

Code:

{
    if (a.data[i] < b.data[i]) return true;
    if (a.data[i] > b.data[i]) return false;
    // else continue iterating.
}

// vs

{
    if (a.data[i] != b.data[i]) return false;
}

**matsp** · 04-28-2009

iMalc:iMalc - good point, but we could do that in a different way:

Code:

   int cmp;
   for(...)
   {
      if ((cmp = a.data[i] - b.data[i]));
      {
         return (cmp > 0)?true:false; 
      }
   }

This assumes that the common case is similar numbers. But I don't think it's much worse for the case with immediately different numbers - and that is the fast case in either variant, anyways.

--
Mats

**Wraithan** · 04-28-2009

I think the question that should actually be asked is have you done the testing and determined this is where there is a bottleneck in your code? Honestly the difference between the two isn't going to show itself very easily in most cases. And even if one is faster than the other, I am pretty sure you could optimize another part of your code and result in a higher net change than you would receive from this comparison check.

So go ahead and profile your code and see where you are hitting some hiccups. Also you can easily (via polymorphism) replace the function that does the comparison dynamically and do some profiling yourself to determine which is ultimately faster.

**JVene** · 04-28-2009

I risk promoting unwanted pedantic continuation, but....

For some reason I kept thinking about this question as I tried to sleep.

To further illustrate why this particular subject of optimization does not make sense to pursue:

The speed of the comparison doesn't lie simply in the clock tick count of cmp and a conditional jump (the x86 assembler resulting from an integer comparison of any kind). The actual result - that is, whether the jump is taken or not, which is also saying whether the test is true or not, affects the speed.

If a jump is taken, we can't easily account for the cost of what happens to the instruction cache, the delay in RAM for selecting the jump destination instruction vs the 'next' instruction should the jump NOT be taken. This means the data feed into a loop that uses such a test could affect the performance.

If you changed the comparison, perhaps it might affect the predilection of the branch predictor to be correct more often than not, and, here's where it really suggests the futility of the idea - the previous instructions that were in the CPU's pipeline might affect the speed of the transition between cmp and the conditional jump.

Say there were a multiplication prior to the cmp, and assume that the results of the multiplication are not involved with the operands of the cmp (the instructions are not interdependent).

Depending on alignment relative to the availability of execution units that the moment the cmp is to be scheduled, the CPU might be performing the previous multiplication instruction at the same time - parallel execution like this is part of modern CPU operation.

However, the pending conditional jump may not execute until the multiplication is finished, which takes more clock cycles than a cmp. In essense, the cmp may be held up by the multiplication which preceded it, in the logical stream of instructions.

In other words, the surrounding environment of code would have impact on the performance of a particular fast instruction, like cmp is, relative to a question like this, and indeed, points us back to the observation that we're dealing with assembler level optimizations that must be CPU aware (since different CPU's have different characteristics along the lines of execution units, branch prediction robustness, cache sizes, it goes on....)

Wraithan and others point out, this can't be a position of bottleneck, and optimization of code for instructions that amount to 1 or 2 clock ticks is a focus in the wrong direction.

For any improvement one might hope to gain, you'd have much more significant results with an extra 100Mhz, a larger cache or a smarter branch predictor than any microscopic comparison at this level.

All of this suggests that any answer beyond the fact that the instructions have the same timing is no meaning. That is, if someone conducts a test, and as a result of comparing != to < and others, somehow sees data that suggests != is faster or slower, it can't be taken as a generalized result that applies to all uses of !=. In a different context, != might be slower, and those assuming the previous test of speed were valid would have been mislead.

**matsp** · 04-28-2009

Jvene,

True, the surrounding code, branch prediction, cache content, predictive fetching of cache, and many other things DO indeed affect the execution time.

Most of the time when discussing optimization at instruction level, we assume that the code is in cache - otherwise, making sure the code is in cache (e.g. making the code smaller) will have a better impact than changing the instructions themselves. Note also that a single branch that goes the other way (to a not-yet-executed path, so therefore not in cache) is very rarely the target of performance optimization, since it only happens once. If two branches are about equally likely, then both would most likely be in cache if the code is executed enough to warrant looking at in the first place.

Another point I'd like to rectify. If we have something like:

Code:

a = b * c;
if (d > 7)
    ...

then the processor WILL NOT wait for the multiplication to finish before performing the jump for d > 7 comparison. I'm not saying there are NO processors that do so, but certainly none of the common modern CPU's (e.g. x86, Sparc etc).

--
Mats

**iMalc** · 04-28-2009

Originally Posted by matsp

iMalc:iMalc - good point, but we could do that in a different way:

Code:

   int cmp;
   for(...)
   {
      if ((cmp = a.data[i] - b.data[i]));
      {
         return (cmp > 0)?true:false; 
      }
   }

This assumes that the common case is similar numbers. But I don't think it's much worse for the case with immediately different numbers - and that is the fast case in either variant, anyways.

--
Mats

That at first seemed like a great idea, and would work for people who have implemented something where each decimal digit is stored in an int or something wasteful like that, but it wont work for a "wasteless" implementation where the bigint is an array of unsigned longs. The reason being that the result of the subtraction would have to range from 0xFFFFFFFFU - 0 to 0 - 0xFFFFFFFFU. So that's -0xFFFFFFFF to +0xFFFFFFFF, requring an extra bit, as well as casting to a signed datatype. Of course it will work if you cast both to a larger datatype first, however I imagine the efficiency will then come out about the same as the original way I posted. Well, maybe still a little ahead if you're lucky.

Thread: Faster bitwise operator

Thread Tools

Search Thread

Display

Faster bitwise operator

Similar Threads

Smart pointer class

Screwy Linker Error - VC2005

C bitwise operator exercise - more like newb killer

operator overloading and dynamic memory program

Bitwise operator test - It keeps skipping past a certain point