Thread: Odd assembly problem

  1. #46
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Unfortunately, this is written as a static library, so ill have to write an actual application to check how the compiler handles te pointers. Im pretty sure I can use EDI and ESI, the compiler doesnt toss a warning for either one of them. Ill hack that up later and post the new code, I heavily modified it last night to increase performance a little more, and have a few more optimizations in mind already.
    Last edited by abachler; 11-20-2007 at 07:37 AM.

  2. #47
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    Why don't you guys use _asm blocks? All those __asm lines are ugly.

  3. #48
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Bubba View Post
    Why don't you guys use _asm blocks? All those __asm lines are ugly.
    I completely agree. Abachler seems to think that it generates less good code because "it saves the entire CPU state", but in my experience, it saves "what needs saving", and as you say, it makes the code much tidier and easy to read.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  4. #49
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    tomato, tuhmahtoe. I find both methods equally easy to read. In any case, i hacked up a test program and it seems my original mthod of sequencing the operations using all 8 of the basic xmm registers actually slows the routine down for some reason. It took 65 seconds to do a run where the FPU version only took 5 seconds. I rewrote the code to only use the bare minimum of xmm registers adn now it seems to run much faster, about 15-20% faster than FPU. I think ultimately the real bottleneck is the memory bandwidth, but 15% speed increase is still pretty good. I also took matsp's suggestion about putting the pointers into registers, and that seems to have helped as well. I also fixed the bug with the original code in that I was asigning values to the pointer registers themselves, rather than to the memory they pointed to. Anyway, here is the new code

    Code:
    void CNeuron::Process(double* Input , double* Output){
        double Scratch = 0.0;
        __declspec(align(16)) __m128 Zero[] = {0.0 , 0.0};
        __declspec(align(16)) __m128 tempout[] = {0.0 , 0.0};
        double* pInput;
        double* pWeight;
        DWORD dwCount;
        DWORD dwExtra;
    
        pInput = Input;
        pWeight = this->Weights;
        dwCount = this->dwNumberOfInputs;
        dwCount = ((dwCount)/2);  // calculate total iterations
        dwExtra = this->dwNumberOfInputs - (dwCount * 2);
        __asm push      eax
        __asm push      edx
        __asm push      ecx
        __asm push      esi
        __asm push      edi
        __asm mov       eax , Zero
        __asm mov       edx , 0x00000010
        __asm mov       ecx , dwCount
        __asm mov       esi , pInput
        __asm mov       edi , pWeight
        __asm movapd    xmm1 , [eax]
    theloop:
        __asm movapd    xmm0 , [esi]
        __asm add       esi , edx
        __asm mulpd     xmm0 , [edi]
        __asm add       edi , edx
        __asm addpd     xmm1 , xmm0
        __asm loop      theloop
        __asm mov       ecx , dwExtra
        __asm movapd    tempout , xmm1
        __asm fld       tempout
        __asm fld       tempout+8
        __asm faddp     st(1) , st(0)
        __asm cmp       ecx , 0x00000000
        __asm je        skiploop
    extraloop:
        __asm fld       [esi]
        __asm add       esi , 8
        __asm fld       [edi]
        __asm add       edi , 8
        __asm fmulp     st(1) , st(0)
        __asm faddp     st(1) , st(0)
        __asm loop      extraloop
    skiploop:
        __asm fld1            
        __asm fpatan    
        __asm fsin
        __asm fstp      Output
        __asm pop       edi
        __asm pop       esi
        __asm pop       ecx
        __asm pop       edx
        __asm pop       eax
    /*
        for(DWORD temp=0;temp<this->dwNumberOfInputs;temp++){
            Scratch += Input[temp] * this->Weights[temp];
            }
        Output[0] = sin(atan(Scratch));
    //*/
        return;
        }
    Last edited by abachler; 11-26-2007 at 10:39 PM.

  5. #50
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Code:
        __asm movapd    tempout , xmm1
        __asm fld       tempout
        __asm fld       tempout+8
    This is "bad" because it forces the processor to flush out the result from the xmm registers and load into the FPU registers, and those are separate units - it's much better if you can continue using XMM registers all the way through and then do the FPU math at the very end.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #51
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    yeah i figure I can optimize it by doing a bitshift and add within the SSE. Nice call though. Im not sure that it will save any clock cycles, since the bitshift may take just as long as the write operation and the theoretical savings in memory bandwidth falls below the granularity of the cache anyway. It may even be better to switch to the FPU early, thus allowing other threads to use the SSE unit (aint hyperthreading great).

  7. #52
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    yeah i figure I can optimize it by doing a bitshift and add within the SSE. Nice call though. Im not sure that it will save any clock cycles, since the bitshift may take just as long as the write operation and the theoretical savings in memory bandwidth falls below the granularity of the cache anyway. It may even be better to switch to the FPU early, thus allowing other threads to use the SSE unit (aint hyperthreading great).
    Sorry, I mistakenly thought the "skiploop" part was part of the overall loop, it's not, so I think the difference between what you are doing and anything else will eb in the noise.

    Since the x87 and SSE instructions use the same piece of hardware [and there's only one such unit], it won't make much difference what you do with hyperthreading. You are almost certainly filling the FPU/SSE unit fairly well either way.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #53
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    x87 and SSE same piece of hardware? It's the MMX and the x87 that have to share; the SSE unit should be totally independent.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #54
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    x87 and SSE same piece of hardware? It's the MMX and the x87 that have to share; the SSE unit should be totally independent.
    Well, depending on where you "draw the line". MMX and x87 use THE SAME REGISTERS, and SSE has it's own set, but I'm pretty sure that there aren't two sets of full-fledged floating point add & mul units. I know for a fact that AMD's processors are designed with a single set of floating point execution units that are shared between x87 and SSE.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #55
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    no actually x87 and MMX use the same physical math circuitry, whereas SSE has its own seperate math processors. The intel optimization guide even suggests you find ways to keep both units busy as it effectively processes in parallel. as for AMD doing it with combined hardware, thats just one more reason R&D peeps prefer Intel, that and the fact that the Intel x87 is literally twice as fast as teh AMD. AMD's fpu only runs at half clock speed, whereas Intel runs at full clock speed. The intel also still supports 80 bit precision, while i believe 80 bit was deprecated in the AMD a long time ago, back when Cyrix was still in the game.
    Last edited by abachler; 12-04-2007 at 10:27 PM.

  11. #56
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    no actually x87 and MMX use the same physical math circuitry, whereas SSE has its own seperate math processors. The intel optimization guide even suggests you find ways to keep both units busy as it effectively processes in parallel. as for AMD doing it with combined hardware, thats just one more reason R&D peeps prefer Intel, that and the fact that the Intel x87 is literally twice as fast as teh AMD.
    Ok, I don't know the Intel architecture that well.

    AMD's fpu only runs at half clock speed, whereas Intel runs at full clock speed.
    This is incorrect for all AMD processors that I'm familiar with [and I have previously worked AT AMD], which is 486, 5x86, K5, K6, K7 and K8. I'm not sure if there was a 387 co-processor that may have had that feature - as I never really worked with ANY 387 processors.

    The intel also still supports 80 bit precision, while i believe 80 bit was deprecated in the AMD a long time ago, back when Cyrix was still in the game.
    That is also definitely incorrect. AMD supports all FPU formats, 32, 64 and 80 bits.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #57
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Im basing the half speed of AMD's FPU on their data, which shows that it takes twice as many clock cycles as the intel. So unless AMD is misrepresenting the performance of their chips ...

    Maybe the data AMD is providing to customers is faulty. Whether the chips underperform, or the data is faulty, either way I would choose to do business with Intel.

    In either case, i think Intel may be going the AMD route of integrating te x87 and SSE units. they both take up a lot of realestate on the die, and implementing code that keeps both units operating in a meaningful and productive way is rarely feasable. Direct SSE to x87 register to register MOV's would help, but the real bottleneck inmodern systems is the memory bandwidth itself.

    I routinely do calculations on multiple sets that have a million or more members. So a given set might be 8MB, which needs to be processed with another set thats 8MB, so my data doesnt fit into the L2 which means that my calculations are determined more by the 800MHz memory than by the processing speed of the core itself.

  13. #58
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    Im basing the half speed of AMD's FPU on their data, which shows that it takes twice as many clock cycles as the intel. So unless AMD is misrepresenting the performance of their chips ...
    Is this for SSE or FPU operations? SSE operations are slower than the more recent Intel products because in AMD there's only one 64-bit ADD and one 64-bit MUL unit, where Intel has a 128-bit ADD & 128-bit MUL UNIT.

    I don't beleive the data itself is flawed - although I have never made any effort to verify it [it wasn't part of my job to do that].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #59
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Its probably based on the SSE, the info I got wasnt that specific, only stating 'floating point'. although most developers that have a need for processing large arrays of data are now going over to SSE, so the slower performance on the amd would impact its value for those calculations. How much of an impact it woudl have would depend on the actual calculations themselves, so for short vectors, it may not impact the throughput that much, btu for large vectors, where the OOE really pays off for tight sequences of code, it woudl have a larger impact I would think.

  15. #60
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Actually, most developers who need to process small amounts of data are going over to SSE, too. GCC generates SSE code by default for x64 targets. The reason is that the x87 execution model was always broken by design, and compiler writers detested the thing.

    It's true that the Core2 is the first CPU to have a 128-bit execution bandwidth and thus to be able to do SSE operations in one step, whereas all previous CPUs, including AMD's, need to process them in two steps, thus taking twice as many cycles for all operations.
    What is not true is that AMD's x87 unit is slower than Intel's.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bin packing problem....
    By 81N4RY_DR460N in forum C++ Programming
    Replies: 0
    Last Post: 08-01-2005, 05:20 AM
  2. Words and lines count problem
    By emo in forum C Programming
    Replies: 1
    Last Post: 07-12-2005, 03:36 PM
  3. half ADT (nested struct) problem...
    By CyC|OpS in forum C Programming
    Replies: 1
    Last Post: 10-26-2002, 08:37 AM
  4. binary tree problem - help needed
    By sanju in forum C Programming
    Replies: 4
    Last Post: 10-16-2002, 05:18 AM