Odd assembly problem

**abachler** · 11-07-2007

except that the results seem to indicate that the MS generated code with the inc in the loop runs slower, even though it 'should' run faster. Im just testing right now, i will probably write the final code in SSE, since speed and accuracy are both important.

**dwks** · 11-07-2007

You do have compiler optimisations enabled, right? With GCC, there is a huge difference even between "gcc" and "gcc -O1".

**abachler** · 11-11-2007

Yes, and while the GetTickCount() code is non-trivial, it is consistent, so the results woudl still be valid for a matrix solution to find the relative speeds of the routines. Now if i coudl only remember how you do those matrix solutions, heh...

**abachler** · 11-11-2007

I retouched the assembly to perform an optimized loop

Code:

int main(int argc, char* argv[]){

    double Input = 1.000;
    __declspec(align(16))  double SIMD_Input[] = {1.0 , 1.0};
    DWORD temp;
    Input = 1.0;
    Start = GetTickCount();
    Stop = Start + 100000;
    temp = 0;
    while(temp < 16777216){
       
        
        Input = sin(atan(Input));
      
        temp++;
        
        }
    printf("%d\t%f\n" , (GetTickCount() - Start) , Input);
    
    Input = 1.0;
    Start = GetTickCount();
    Stop = Start + 100000;
        
        __asm push  ecx
        __asm mov   ecx , 0x01000000
begin:  __asm fld   Input
        __asm fld1
        __asm fpatan
        __asm fsin
        __asm fstp  Input
        __asm loop  begin
        __asm pop   ecx
 
    printf("%d\t%f\n" , (GetTickCount() - Start) , Input);
    
    
    
    
    return 0;
    }

the new results are:

4015 ms for the C/C++
3094 ms for the assembly

anyone care to suggest an optimized loop in C/C++ if you think thats the problem?

**swoopy** · 11-12-2007

To be honest, the C++ loop and asm loop don't even look the same. The C++ loop uses some magic number (16777216) for the number of iterations. I can't even tell what the asm loop uses for the number of iterations.

**CornedBee** · 11-12-2007

The magic number is the same as the hex number in the ASM loop, so that's fine.

Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.

**matsp** · 11-12-2007

Originally Posted by CornedBee

The magic number is the same as the hex number in the ASM loop, so that's fine.

Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.

Try adding "-ffast-math" to your compile command-line.

@abachler:
Using the same decimal constant would reduce the number of "queries" on the "the code doesn't look like it's doing the same thing.

Also, you don't need to push and pop ecx - the compiler will understand register usage in __asm statements.

I prefer to use

Code:

__asm { 
   ... multiple 
   ... lines 
   ... of
   ... code
}

If you want to improve a bit [by "cheating", don't do the fstp/fld inside the loop - just leave the result of the previous loop in st(0).

Your variable "Stop" is never used, right?

Have you checked the assembler output from the while-loop in C? Have you tried using a

Code:

for(temp=0; temp < 167777216; temp++)

instead? Usually, the compiler produces better code for for-loops than while-loops when it comes to constant limit expressions - it may of course be that it makes no difference here.

--
Mats

**CornedBee** · 11-12-2007

-ffast-math indeed uses the intrinsics, unless I use -mfpmath=sse. Over a whole program, the advantage of SSE over x87 may well offset this loss.

Another thing to note: According to some information I've found, software sine implementations using SSE math could well be faster than the CPU x87 sine instruction. Especially if you do it on multiple data, and especially if you then go on to do other FP operations on the data. (x87<->sse moves are costly).

Also, 64-bit CPUs might show generally better results, because the x86-64 ABI specifies that floating point parameters are passed in SSE registers, while in the x86-32 ABI all floating point parameters must be passed on the stack. This leads to a severe slowdown when calling external trig implementations. I can actually confirm that from comparing the code generated for my Athlon64 and the Core1 I'm writing this on.

Tricky business, optimizing.

**matsp** · 11-12-2007

Seeing as SSE doesn't HAVE a sin/atan function, it's quite difficult to make it "use intrinsics" - it needs to copy the SSE register to x87 register and then use fsin to perform the actual fpu operation and then reverse that register to register move. It's slow and not very neat - and you don't really care if it's inlined or not - the overhead of a function call is in the fractions of a percent compared to the 150+ cycles that this function will take.

--
Mats

**abachler** · 11-12-2007

Originally Posted by swoopy

To be honest, the C++ loop and asm loop don't even look the same. The C++ loop uses some magic number (16777216) for the number of iterations. I can't even tell what the asm loop uses for the number of iterations.

that would be

Code:

        __asm mov   ecx , 0x01000000
 
and
 
        __asm loop begin

0x01000000 == 16777216

Originally Posted by CornedBee

The magic number is the same as the hex number in the ASM loop, so that's fine.

Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.

Im using VC 6.0 Enterprise w/service pack 5 and the processor pack installed

You might have to create a custom EMIT
E2 cb LOOP rel8 Decrement count; jump short if count ≠ 0.

as for SSE not having trig functions, its stil so much faster that You can derive the functions form simpler methods faster than you could get them from FPU instructions.

**swoopy** · 11-12-2007

Originally Posted by abachler

that would be

Code:

        __asm mov   ecx , 0x01000000
 
and
 
        __asm loop begin

0x01000000 == 16777216

Thanks. After I read Corned Bee's post it made sense.

**iMalc** · 11-13-2007

Have you looked into higher-level optimisations first?
For example the trigonometric identity:

Code:

sin(arctan(x)) = x / sqrt(1 + x*x)

According to: http://en.wikipedia.org/wiki/List_of...ric_identities

Of course that may well be slower, but it depends on how it fits with the rest of the surrounding code. E.g. if you're by chance calculating that exact denominator for some other reason anyway, then it would have to be faster. There's also Carmack's fast rsqrtf function to speed up such things.
There's more than one way to skin a cat!

**CornedBee** · 11-13-2007

Or perhaps you take the square of that thing anyway - then you can omit the sqrt completely.

**Salem** · 11-13-2007

Now that you've spend a week on this micro-optimisation, how has this affected the overall result?

For a start, have you run a profile of the program to find out where the real hot-spots are, and not just where you've guessed them to be?

Or as others have started to hint, seek out alternative algorithms which do the same job in a more efficient manner.

**matsp** · 11-13-2007

Originally Posted by iMalc

Have you looked into higher-level optimisations first?
For example the trigonometric identity:

Code:

sin(arctan(x)) = x / sqrt(1 + x*x)

According to: http://en.wikipedia.org/wiki/List_of...ric_identities

Of course that may well be slower, but it depends on how it fits with the rest of the surrounding code. E.g. if you're by chance calculating that exact denominator for some other reason anyway, then it would have to be faster. There's also Carmack's fast rsqrtf function to speed up such things.
There's more than one way to skin a cat!

That should certainly execute faster than the 240 clocks that the original code takes.

Code:

       fld   input      ;; 4
       fmul input     ;; 6    x * x
       fld1               ;; 4
       fadd              ;; 4   x * x + 1
       fsqrt              ;; 35
       fdivr  input    ;; 22
       fstp   input    ;; 2
                            ;; ==
                            ;; 77 clocks - so approximately 3 times faster than the original code.

--
Mats