Odd assembly problem

**abachler** · 11-20-2007

Unfortunately, this is written as a static library, so ill have to write an actual application to check how the compiler handles te pointers. Im pretty sure I can use EDI and ESI, the compiler doesnt toss a warning for either one of them. Ill hack that up later and post the new code, I heavily modified it last night to increase performance a little more, and have a few more optimizations in mind already.

**VirtualAce** · 11-20-2007

Why don't you guys use _asm blocks? All those __asm lines are ugly.

**matsp** · 11-21-2007

Originally Posted by Bubba

Why don't you guys use _asm blocks? All those __asm lines are ugly.

I completely agree. Abachler seems to think that it generates less good code because "it saves the entire CPU state", but in my experience, it saves "what needs saving", and as you say, it makes the code much tidier and easy to read.

--
Mats

**abachler** · 11-26-2007

tomato, tuhmahtoe. I find both methods equally easy to read. In any case, i hacked up a test program and it seems my original mthod of sequencing the operations using all 8 of the basic xmm registers actually slows the routine down for some reason. It took 65 seconds to do a run where the FPU version only took 5 seconds. I rewrote the code to only use the bare minimum of xmm registers adn now it seems to run much faster, about 15-20% faster than FPU. I think ultimately the real bottleneck is the memory bandwidth, but 15% speed increase is still pretty good. I also took matsp's suggestion about putting the pointers into registers, and that seems to have helped as well. I also fixed the bug with the original code in that I was asigning values to the pointer registers themselves, rather than to the memory they pointed to. Anyway, here is the new code

Code:

void CNeuron::Process(double* Input , double* Output){
    double Scratch = 0.0;
    __declspec(align(16)) __m128 Zero[] = {0.0 , 0.0};
    __declspec(align(16)) __m128 tempout[] = {0.0 , 0.0};
    double* pInput;
    double* pWeight;
    DWORD dwCount;
    DWORD dwExtra;

    pInput = Input;
    pWeight = this->Weights;
    dwCount = this->dwNumberOfInputs;
    dwCount = ((dwCount)/2);  // calculate total iterations
    dwExtra = this->dwNumberOfInputs - (dwCount * 2);
    __asm push      eax
    __asm push      edx
    __asm push      ecx
    __asm push      esi
    __asm push      edi
    __asm mov       eax , Zero
    __asm mov       edx , 0x00000010
    __asm mov       ecx , dwCount
    __asm mov       esi , pInput
    __asm mov       edi , pWeight
    __asm movapd    xmm1 , [eax]
theloop:
    __asm movapd    xmm0 , [esi]
    __asm add       esi , edx
    __asm mulpd     xmm0 , [edi]
    __asm add       edi , edx
    __asm addpd     xmm1 , xmm0
    __asm loop      theloop
    __asm mov       ecx , dwExtra
    __asm movapd    tempout , xmm1
    __asm fld       tempout
    __asm fld       tempout+8
    __asm faddp     st(1) , st(0)
    __asm cmp       ecx , 0x00000000
    __asm je        skiploop
extraloop:
    __asm fld       [esi]
    __asm add       esi , 8
    __asm fld       [edi]
    __asm add       edi , 8
    __asm fmulp     st(1) , st(0)
    __asm faddp     st(1) , st(0)
    __asm loop      extraloop
skiploop:
    __asm fld1            
    __asm fpatan    
    __asm fsin
    __asm fstp      Output
    __asm pop       edi
    __asm pop       esi
    __asm pop       ecx
    __asm pop       edx
    __asm pop       eax
/*
    for(DWORD temp=0;temp<this->dwNumberOfInputs;temp++){
        Scratch += Input[temp] * this->Weights[temp];
        }
    Output[0] = sin(atan(Scratch));
//*/
    return;
    }

**matsp** · 11-27-2007

Code:

    __asm movapd    tempout , xmm1
    __asm fld       tempout
    __asm fld       tempout+8

This is "bad" because it forces the processor to flush out the result from the xmm registers and load into the FPU registers, and those are separate units - it's much better if you can continue using XMM registers all the way through and then do the FPU math at the very end.

--
Mats

**abachler** · 11-27-2007

yeah i figure I can optimize it by doing a bitshift and add within the SSE. Nice call though. Im not sure that it will save any clock cycles, since the bitshift may take just as long as the write operation and the theoretical savings in memory bandwidth falls below the granularity of the cache anyway. It may even be better to switch to the FPU early, thus allowing other threads to use the SSE unit (aint hyperthreading great).

**matsp** · 11-27-2007

Originally Posted by abachler

yeah i figure I can optimize it by doing a bitshift and add within the SSE. Nice call though. Im not sure that it will save any clock cycles, since the bitshift may take just as long as the write operation and the theoretical savings in memory bandwidth falls below the granularity of the cache anyway. It may even be better to switch to the FPU early, thus allowing other threads to use the SSE unit (aint hyperthreading great).

Sorry, I mistakenly thought the "skiploop" part was part of the overall loop, it's not, so I think the difference between what you are doing and anything else will eb in the noise.

Since the x87 and SSE instructions use the same piece of hardware [and there's only one such unit], it won't make much difference what you do with hyperthreading. You are almost certainly filling the FPU/SSE unit fairly well either way.

--
Mats

**CornedBee** · 11-27-2007

x87 and SSE same piece of hardware? It's the MMX and the x87 that have to share; the SSE unit should be totally independent.

**matsp** · 11-27-2007

Originally Posted by CornedBee

x87 and SSE same piece of hardware? It's the MMX and the x87 that have to share; the SSE unit should be totally independent.

Well, depending on where you "draw the line". MMX and x87 use THE SAME REGISTERS, and SSE has it's own set, but I'm pretty sure that there aren't two sets of full-fledged floating point add & mul units. I know for a fact that AMD's processors are designed with a single set of floating point execution units that are shared between x87 and SSE.

--
Mats

**abachler** · 12-04-2007

no actually x87 and MMX use the same physical math circuitry, whereas SSE has its own seperate math processors. The intel optimization guide even suggests you find ways to keep both units busy as it effectively processes in parallel. as for AMD doing it with combined hardware, thats just one more reason R&D peeps prefer Intel, that and the fact that the Intel x87 is literally twice as fast as teh AMD. AMD's fpu only runs at half clock speed, whereas Intel runs at full clock speed. The intel also still supports 80 bit precision, while i believe 80 bit was deprecated in the AMD a long time ago, back when Cyrix was still in the game.

**matsp** · 12-05-2007

Originally Posted by abachler

no actually x87 and MMX use the same physical math circuitry, whereas SSE has its own seperate math processors. The intel optimization guide even suggests you find ways to keep both units busy as it effectively processes in parallel. as for AMD doing it with combined hardware, thats just one more reason R&D peeps prefer Intel, that and the fact that the Intel x87 is literally twice as fast as teh AMD.

Ok, I don't know the Intel architecture that well.

AMD's fpu only runs at half clock speed, whereas Intel runs at full clock speed.

This is incorrect for all AMD processors that I'm familiar with [and I have previously worked AT AMD], which is 486, 5x86, K5, K6, K7 and K8. I'm not sure if there was a 387 co-processor that may have had that feature - as I never really worked with ANY 387 processors.

The intel also still supports 80 bit precision, while i believe 80 bit was deprecated in the AMD a long time ago, back when Cyrix was still in the game.

That is also definitely incorrect. AMD supports all FPU formats, 32, 64 and 80 bits.

--
Mats

**abachler** · 12-05-2007

Im basing the half speed of AMD's FPU on their data, which shows that it takes twice as many clock cycles as the intel. So unless AMD is misrepresenting the performance of their chips ...

Maybe the data AMD is providing to customers is faulty. Whether the chips underperform, or the data is faulty, either way I would choose to do business with Intel.

In either case, i think Intel may be going the AMD route of integrating te x87 and SSE units. they both take up a lot of realestate on the die, and implementing code that keeps both units operating in a meaningful and productive way is rarely feasable. Direct SSE to x87 register to register MOV's would help, but the real bottleneck inmodern systems is the memory bandwidth itself.

I routinely do calculations on multiple sets that have a million or more members. So a given set might be 8MB, which needs to be processed with another set thats 8MB, so my data doesnt fit into the L2 which means that my calculations are determined more by the 800MHz memory than by the processing speed of the core itself.

**matsp** · 12-05-2007

Originally Posted by abachler

Im basing the half speed of AMD's FPU on their data, which shows that it takes twice as many clock cycles as the intel. So unless AMD is misrepresenting the performance of their chips ...

Is this for SSE or FPU operations? SSE operations are slower than the more recent Intel products because in AMD there's only one 64-bit ADD and one 64-bit MUL unit, where Intel has a 128-bit ADD & 128-bit MUL UNIT.

I don't beleive the data itself is flawed - although I have never made any effort to verify it [it wasn't part of my job to do that].

--
Mats

**abachler** · 12-05-2007

Its probably based on the SSE, the info I got wasnt that specific, only stating 'floating point'. although most developers that have a need for processing large arrays of data are now going over to SSE, so the slower performance on the amd would impact its value for those calculations. How much of an impact it woudl have would depend on the actual calculations themselves, so for short vectors, it may not impact the throughput that much, btu for large vectors, where the OOE really pays off for tight sequences of code, it woudl have a larger impact I would think.

**CornedBee** · 12-05-2007

Actually, most developers who need to process small amounts of data are going over to SSE, too. GCC generates SSE code by default for x64 targets. The reason is that the x87 execution model was always broken by design, and compiler writers detested the thing.

It's true that the Core2 is the first CPU to have a 128-bit execution bandwidth and thus to be able to do SSE operations in one step, whereas all previous CPUs, including AMD's, need to process them in two steps, thus taking twice as many cycles for all operations.
What is not true is that AMD's x87 unit is slower than Intel's.

Thread: Odd assembly problem

Thread Tools

Search Thread

Display

Similar Threads

Bin packing problem....

Words and lines count problem

half ADT (nested struct) problem...

binary tree problem - help needed