Odd assembly problem

**CornedBee** · 11-13-2007

Ah, now this is a nice inner loop.

Code:

.L2:
        movsd   &#37;xmm1, %xmm0
        decl    %eax
        mulsd   %xmm1, %xmm0
        addsd   %xmm2, %xmm0
        sqrtsd  %xmm0, %xmm0
        divsd   %xmm0, %xmm1
        jne     .L2

g++ -S -o asm.s -O3 -ffast-math -mfpmath=sse asm.cpp

Note that xmm2 simply holds the constant 1.0, whereas xmm1 is the home register of x.

**matsp** · 11-13-2007

Originally Posted by CornedBee

Ah, now this is a nice inner loop.

Code:

.L2:
        movsd   %xmm1, %xmm0   ;; 2
        decl    %eax
        mulsd   %xmm1, %xmm0     ;; 4 
        addsd   %xmm2, %xmm0    ;; 4
        sqrtsd  %xmm0, %xmm0     ;; 27
        divsd   %xmm0, %xmm1      ;; 20
        jne     .L2

g++ -S -o asm.s -O3 -ffast-math -mfpmath=sse asm.cpp

Note that xmm2 simply holds the constant 1.0, whereas xmm1 is the home register of x.

That makes 57 cycles (not counting the "looping" instructions).

Now, with "packed double":

Code:

.L2:
        movapd   %xmm1, %xmm0   ;; 2    [2]
        sub      $2, %eax
        mulpd   %xmm1, %xmm0     ;; 5    [4]
        addpd   %xmm2, %xmm0    ;; 5    [4]
        sqrtpd  %xmm0, %xmm0     ;; 51  [27]
        divpd   %xmm0, %xmm1      ;; 37  [20]
        jne     .L2

Exactly 100 cycles (again, not counting the "loop" part), a saving of 7 cycles per iteration - or about 6%. Not much, because the sqrt and divide aren't in parallel.

Numbers in brackets are for "new generation AMD" - the Quad Core processors. Now we are talking - 57 clocks for twice the processing [and the same speed for the original code].

If single precision is enough, we could do it even faster: First of all, we could double again the number of calculations, but also change the final calculation to use RSQRTPS - which calculates 1/sqrt(x) in 5 cycles instead of 27, and mulss instead of divps at the end, which is 4 cycles instead of 20, making the total count 19 for the loop, instead of 57. But the inverted square root is only available in single precision [and probably not quite as precise].

--
Mats

**CornedBee** · 11-13-2007

How can you double (or even quadruple) the instructions if one loop iteration depends on the result of the previous?

**matsp** · 11-13-2007

Originally Posted by CornedBee

How can you double (or even quadruple) the instructions if one loop iteration depends on the result of the previous?

True, you can't. However, if, as I think, the actual code [as opposed to this benchmark code] isn't iteratively calculating the value of x = sin(arctan(x)) for 16 million times, but rather calculating y = sin(arctan(x)) for a great number of x (and y) values - in which case you'd be gaining a fair bit by doubling or quadrupling.

--
Mats

**abachler** · 11-18-2007

Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

Code:

    double* pInput; // assigned a value previous to this code
    double* pWeight;
    DWORD dwCount = 1048576;
    __m128 Zero[] = {0.0 , 0.0};

    __asm push      eax
    __asm xor       eax , eax
    __asm movapd    xmm1 , Zero
theloop:
    __asm movapd    xmm0 , pInput 
    __asm add       pInput , 0x00000008
    __asm mulpd     xmm0 , pWeight
    __asm add       pWeight , 0x00000008
    __asm addpd     xmm1 , xmm0
    __asm add       eax , 0x00000002
    __asm cmp       eax , dwCount
    __asm jl        theloop
    __asm movapd    tempout , xmm1
    __asm fld       tempout
    tempout++;
    __asm fld       tempout
    __asm fadd
    __asm fstp      st(0)
    __asm fld1            
    __asm fpatan    
    __asm fsin
    __asm fstp      Output
    __asm pop       eax

so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.

**CornedBee** · 11-19-2007

Code:

 __m128 Zero[] = {0.0 , 0.0};

Be aware that VC++ doesn't automatically align this on 16-byte boundaries, as it should. At least, I've seen the issue come up many times.

**matsp** · 11-19-2007

Originally Posted by abachler

Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

Code:

    double* pInput; // assigned a value previous to this code
    double* pWeight;
    DWORD dwCount = 1048576;
    __m128 Zero[] = {0.0 , 0.0};
add "static const" to make sure this isn't being initialized every call - not that it's REALLY much time, but it's saves some and with no ill effects.

    __asm push      eax
    __asm xor       eax , eax
    __asm movapd    xmm1 , Zero
theloop:
    __asm movapd    xmm0 , pInput 
    __asm add       pInput , 0x00000008
Isn't an xmm register 16 bytes??
    __asm mulpd     xmm0 , pWeight
    __asm add       pWeight , 0x00000008
    __asm addpd     xmm1 , xmm0
    __asm add       eax , 0x00000002
    __asm cmp       eax , dwCount
Start eax with dwcount, and count it down instead of a two-step add/compare.
    __asm jl        theloop
    __asm movapd    tempout , xmm1
    __asm fld       tempout
    tempout++;
    __asm fld       tempout
Skip tempout++ and used tempout+4 instead?
    __asm fadd
    __asm fstp      st(0)
Unnecessary operation??
    __asm fld1            
    __asm fpatan    
    __asm fsin
    __asm fstp      Output
    __asm pop       eax

so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.

Comments in red above.

--
Mats

**matsp** · 11-19-2007

Assmung the proportion of positive and negative inputs to your function isn't mainly negative, I think the gain from using sqrt() for positive numbers is big enough that it's worth doing an "if (input >=0) ... else ..." for the two options. [Unless we can do "temp = sgn(input); input = abs(input); <calculate using sqrt>; output *= temp;" to "fix up" the sign - I don't know if -sin(atan(x)) == sin(atan(-x)) or not].

Both options should give a noticable benefit despite the overheads.

--
Mats

**abachler** · 11-19-2007

add "static const" to make sur

actually I use __declspec(align(16)) static , I just snipped a lot of the function to just show the part of concern

Isn't an xmm register 16 bytes??

Yes it is, but its exactly the same as 2 doubles right next to each other or 4 flots, depending on which functions you use. This is what makes it so fast at handling vectors, there is no conversion process needed.

Start eax with dwcount, and count it down instead of a two-step add/compare.

in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx

Skip tempout++ and used tempout+4 instead?

yeah, I missed that one, forgot that the r/m field has several common increments encoded in it. That will also gt rid of the sloppy C++ in the middle of an asm block

__asm fadd
__asm fstp st(0)
Unnecessary operation??

for some reason my compiler refuses to accept the faddp opcode without complaining, so i added this as a quick fix until I figure otu what the problem is, it was late when I wrote that code.

As for the portion of positive to negative inputs and wieghts, its completely unknown, it could be all negative or all positive, or somewhere in between.

**matsp** · 11-19-2007

Originally Posted by abachler

>Start eax with dwcount, and count it down instead of a two-step add/compare.
in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx

Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

My comment on xmm registers being 16 bytes:
Are you actually intending to calculate

Code:

{ pInput[0], pInput[1] }
{ pInput[1], pInput[2] }
{ pInput[2], pInput[3] }
{ pInput[3], pInput[4] }

The braces indicate the pair being calculated each iteration.

I would have thought you wanted:

Code:

{ pInput[0], pInput[1] }
{ pInput[2], pInput[3] }
{ pInput[4], pInput[5] }
{ pInput[6], pInput[7] }

Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.

And the same applies to pWeight, obviously.

Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.

--
Mats

**abachler** · 11-19-2007

Originally Posted by matsp

Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

My comment on xmm registers being 16 bytes:
Are you actually intending to calculate

Code:

{ pInput[0], pInput[1] }
{ pInput[1], pInput[2] }
{ pInput[2], pInput[3] }
{ pInput[3], pInput[4] }

The braces indicate the pair being calculated each iteration.

I would have thought you wanted:

Code:

{ pInput[0], pInput[1] }
{ pInput[2], pInput[3] }
{ pInput[4], pInput[5] }
{ pInput[6], pInput[7] }

Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.

And the same applies to pWeight, obviously.

Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.

--
Mats

dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.

And I do need to push and pop eax because these are __asm lines, not a single __asm {} block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.

You are correct about the 8, it should be 0x00000010 not 0x00000008 , again, sorry it was late when I wrote that code

**matsp** · 11-19-2007

Originally Posted by abachler

dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.

I'm not suggesting you change the content or meaning of dwCount itself. But if you decrement instead of increment, then you do not need a compare instruction - the processor sets the condition flags in decrement. If you have an odd value, you would have to "round up" before you start the loop [assuming it's valid to use the final, uninitialized element - otherwise your code is broken anyways], so that you always have an even value. If you do this, you can finish on zero, so you don't have to compare - and saving one instruction that is dependant on some other instruction is definitely a win - even if the individual instruction makes little difference.

You do not use any memory bandwidth by using register instead of constant for add or subtract - aside from POSSIBLY one byte extra code-space, but the code is in cache once it's been read anyways, so it does not compete with your data fetches.

Also, to clarify your misunderstanding: The INC/DEC instructions are still available in x86_64, but the simple opcode form is no longer available [there are two ways to produce exactly the same instruction in the x86 instruction set - in the 64-bit version, one of them was removed to make it a prefix instead, as there was no other "long sequence of single byte opcodes that can be changed", and double byte prefix quickly become a mess in the instruction decoder and code generators etc]. The assembler/compiler can still generate INC/DEC instructions, they just start with 0xFF ... [if my memory serves right].

And I do need to push and pop eax because these are __asm lines, not a single

Code:

__asm {}

block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.

Ok, fair enough - although "saves entire" state is not correct - it saves what it has to.

--
Mats

**abachler** · 11-19-2007

even if the const is in the cache, its still 4 bytes (its a DWORD const) of memory bandwidth that gets used, plus it competes with other instructions in the pipeline for access to the cache, whereas a register/register add only competes for the ALU. It could use the ADD r/m32 , imm8 opcode, but thats still 3 bytes versus 2 for teh register to register. plus it woudl be easier to realign other ops if the offset was 2 versus 3. true that I could use a NOP to realign, but id rather use some opcode that would improve performance by doing actual work. both ways are valid, it would depend on the specific tradeoffs you wanted to make. In this case I think Im goign to rewrite ti once again to improve performance even further, ill post the code later.

**abachler** · 11-19-2007

Code:

// this code is copyright 2007 Anthony Q. Bachler
// you may use this code free of charge for non-commercial purposes only
// for commercial use contact the author.  This code is not released under the GPL

void CNeuron::Process(double* Input , double* Output){
    double Scratch = 0.0;
    __declspec(align(16)) __m128 Zero[] = {0.0 , 0.0};
    __declspec(align(16)) __m128 tempout[] = {0.0 , 0.0};
    double* pInput;
    double* pWeight;
    DWORD dwCount;

    pInput = Input;
    pWeight = this->Weights;
    dwCount = this->dwNumberOfInputs;
    dwCount = ((dwCount+7)/8);  // calculate total iterations
    __asm push      eax
    __asm push      edx
    __asm push      ecx
    __asm xor       eax , eax
    __asm mov       edx , 0x00000040
    __asm mov       ecx , dwCount
    __asm movapd    xmm1 , Zero
    __asm movapd    xmm3 , Zero
    __asm movapd    xmm5 , Zero
    __asm movapd    xmm7 , Zero
theloop:
    __asm movapd    xmm0 , pInput
    __asm mulpd     xmm0 , pWeight
    __asm addpd     xmm1 , xmm0
    __asm movapd    xmm2 , pInput+16
    __asm mulpd     xmm2 , pWeight+16
    __asm addpd     xmm3 , xmm2
    __asm movapd    xmm4 , pInput+32
    __asm mulpd     xmm4 , pWeight+32
    __asm addpd     xmm5 , xmm4
    __asm movapd    xmm6 , pInput+48
    __asm add       pInput , edx
    __asm mulpd     xmm6 , pWeight+48
    __asm add       pWeight , edx
    __asm addpd     xmm7 , xmm6
    __asm loop      theloop
    __asm addpd     xmm1 , xmm3
    __asm addpd     xmm5 , xmm7
    __asm addpd     xmm1 , xmm5
    __asm movapd    tempout , xmm1
    __asm fld       tempout
    __asm fld       tempout+8
    __asm fadd
    __asm fstp      st(0)
    __asm fld1            
    __asm fpatan    
    __asm fsin
    __asm fstp      Output
    __asm pop       ecx
    __asm pop       edx
    __asm pop       eax
/*
    for(DWORD temp=0;temp<this->dwNumberOfInputs;temp++){
        Scratch += Input[temp] * this->Weights[temp];
        }
    Output[0] = sin(atan(Scratch));
//*/
    return;
    }

**matsp** · 11-20-2007

Have you disassembled the compiled code to see if pInput and pWeight are in registers? If they are not, you may wan to consider forcing that in your assembler code.

--
Mats

Thread: Odd assembly problem

Thread Tools

Search Thread

Display

Similar Threads

Bin packing problem....

Words and lines count problem

half ADT (nested struct) problem...

binary tree problem - help needed