Thread: Odd assembly problem

  1. #31
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Ah, now this is a nice inner loop.
    Code:
    .L2:
            movsd   %xmm1, %xmm0
            decl    %eax
            mulsd   %xmm1, %xmm0
            addsd   %xmm2, %xmm0
            sqrtsd  %xmm0, %xmm0
            divsd   %xmm0, %xmm1
            jne     .L2
    g++ -S -o asm.s -O3 -ffast-math -mfpmath=sse asm.cpp

    Note that xmm2 simply holds the constant 1.0, whereas xmm1 is the home register of x.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  2. #32
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Ah, now this is a nice inner loop.
    Code:
    .L2:
            movsd   %xmm1, %xmm0   ;; 2
            decl    %eax
            mulsd   %xmm1, %xmm0     ;; 4 
            addsd   %xmm2, %xmm0    ;; 4
            sqrtsd  %xmm0, %xmm0     ;; 27
            divsd   %xmm0, %xmm1      ;; 20
            jne     .L2
    g++ -S -o asm.s -O3 -ffast-math -mfpmath=sse asm.cpp

    Note that xmm2 simply holds the constant 1.0, whereas xmm1 is the home register of x.
    That makes 57 cycles (not counting the "looping" instructions).

    Now, with "packed double":
    Code:
    .L2:
            movapd   %xmm1, %xmm0   ;; 2    [2]
            sub      $2, %eax
            mulpd   %xmm1, %xmm0     ;; 5    [4]
            addpd   %xmm2, %xmm0    ;; 5    [4]
            sqrtpd  %xmm0, %xmm0     ;; 51  [27]
            divpd   %xmm0, %xmm1      ;; 37  [20]
            jne     .L2
    Exactly 100 cycles (again, not counting the "loop" part), a saving of 7 cycles per iteration - or about 6%. Not much, because the sqrt and divide aren't in parallel.

    Numbers in brackets are for "new generation AMD" - the Quad Core processors. Now we are talking - 57 clocks for twice the processing [and the same speed for the original code].

    If single precision is enough, we could do it even faster: First of all, we could double again the number of calculations, but also change the final calculation to use RSQRTPS - which calculates 1/sqrt(x) in 5 cycles instead of 27, and mulss instead of divps at the end, which is 4 cycles instead of 20, making the total count 19 for the loop, instead of 57. But the inverted square root is only available in single precision [and probably not quite as precise].


    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #33
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    How can you double (or even quadruple) the instructions if one loop iteration depends on the result of the previous?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  4. #34
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    How can you double (or even quadruple) the instructions if one loop iteration depends on the result of the previous?
    True, you can't. However, if, as I think, the actual code [as opposed to this benchmark code] isn't iteratively calculating the value of x = sin(arctan(x)) for 16 million times, but rather calculating y = sin(arctan(x)) for a great number of x (and y) values - in which case you'd be gaining a fair bit by doubling or quadrupling.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #35
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

    Code:
        double* pInput; // assigned a value previous to this code
        double* pWeight;
        DWORD dwCount = 1048576;
        __m128 Zero[] = {0.0 , 0.0};
    
        __asm push      eax
        __asm xor       eax , eax
        __asm movapd    xmm1 , Zero
    theloop:
        __asm movapd    xmm0 , pInput 
        __asm add       pInput , 0x00000008
        __asm mulpd     xmm0 , pWeight
        __asm add       pWeight , 0x00000008
        __asm addpd     xmm1 , xmm0
        __asm add       eax , 0x00000002
        __asm cmp       eax , dwCount
        __asm jl        theloop
        __asm movapd    tempout , xmm1
        __asm fld       tempout
        tempout++;
        __asm fld       tempout
        __asm fadd
        __asm fstp      st(0)
        __asm fld1            
        __asm fpatan    
        __asm fsin
        __asm fstp      Output
        __asm pop       eax
    so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.
    Last edited by abachler; 11-18-2007 at 11:14 PM.

  6. #36
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Code:
     __m128 Zero[] = {0.0 , 0.0};
    Be aware that VC++ doesn't automatically align this on 16-byte boundaries, as it should. At least, I've seen the issue come up many times.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #37
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    Yes, the final code will of course be performing this calculation on vectors of data with several hundred thousand entries, but I need double precision, otherwise id use single and get the twice as any (and faster) ops per instruction.

    Code:
        double* pInput; // assigned a value previous to this code
        double* pWeight;
        DWORD dwCount = 1048576;
        __m128 Zero[] = {0.0 , 0.0};
    add "static const" to make sure this isn't being initialized every call - not that it's REALLY much time, but it's saves some and with no ill effects.
    
        __asm push      eax
        __asm xor       eax , eax
        __asm movapd    xmm1 , Zero
    theloop:
        __asm movapd    xmm0 , pInput 
        __asm add       pInput , 0x00000008
    Isn't an xmm register 16 bytes??
        __asm mulpd     xmm0 , pWeight
        __asm add       pWeight , 0x00000008
        __asm addpd     xmm1 , xmm0
        __asm add       eax , 0x00000002
        __asm cmp       eax , dwCount
    Start eax with dwcount, and count it down instead of a two-step add/compare.
        __asm jl        theloop
        __asm movapd    tempout , xmm1
        __asm fld       tempout
        tempout++;
        __asm fld       tempout
    Skip tempout++ and used tempout+4 instead?
        __asm fadd
        __asm fstp      st(0)
    Unnecessary operation??
        __asm fld1            
        __asm fpatan    
        __asm fsin
        __asm fstp      Output
        __asm pop       eax
    so as you can see, the time taken to perform the sin and tan are trivial when compared to the overall calculation. cant really use teh sqrt method as it only hold approx in the positive domain, and there will be results in the negative domain. For those of you that havent picked up on what the above code does, its an SSE implementation of a simple perceptron, using the sin(atan(x)) as the sigmoid.
    Comments in red above.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #38
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Assmung the proportion of positive and negative inputs to your function isn't mainly negative, I think the gain from using sqrt() for positive numbers is big enough that it's worth doing an "if (input >=0) ... else ..." for the two options. [Unless we can do "temp = sgn(input); input = abs(input); <calculate using sqrt>; output *= temp;" to "fix up" the sign - I don't know if -sin(atan(x)) == sin(atan(-x)) or not].

    Both options should give a noticable benefit despite the overheads.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #39
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    add "static const" to make sur


    actually I use __declspec(align(16)) static , I just snipped a lot of the function to just show the part of concern

    Isn't an xmm register 16 bytes??
    Yes it is, but its exactly the same as 2 doubles right next to each other or 4 flots, depending on which functions you use. This is what makes it so fast at handling vectors, there is no conversion process needed.

    Start eax with dwcount, and count it down instead of a two-step add/compare.
    in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx

    Skip tempout++ and used tempout+4 instead?
    yeah, I missed that one, forgot that the r/m field has several common increments encoded in it. That will also gt rid of the sloppy C++ in the middle of an asm block


    __asm fadd
    __asm fstp st(0)
    Unnecessary operation??
    for some reason my compiler refuses to accept the faddp opcode without complaining, so i added this as a quick fix until I figure otu what the problem is, it was late when I wrote that code.

    As for the portion of positive to negative inputs and wieghts, its completely unknown, it could be all negative or all positive, or somewhere in between.
    Last edited by abachler; 11-19-2007 at 07:38 AM.

  10. #40
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    >Start eax with dwcount, and count it down instead of a two-step add/compare.
    in which case it just becomes a 2 step subtract and compare, and it causes cache issues with the read-ahead logic. remember, we ar processing more than one vector member at a time, so a simple LOOP wont work here, although I will optimize the add by putting 0x00000002 into ebx and doing a register to register add, 0x00000008 into ecx and dwCount into edx
    Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

    But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

    You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

    My comment on xmm registers being 16 bytes:
    Are you actually intending to calculate
    Code:
    { pInput[0], pInput[1] }
    { pInput[1], pInput[2] }
    { pInput[2], pInput[3] }
    { pInput[3], pInput[4] }
    The braces indicate the pair being calculated each iteration.

    I would have thought you wanted:
    Code:
    { pInput[0], pInput[1] }
    { pInput[2], pInput[3] }
    { pInput[4], pInput[5] }
    { pInput[6], pInput[7] }
    Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.

    And the same applies to pWeight, obviously.

    Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  11. #41
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by matsp View Post
    Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.

    But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.

    You can also improve the speed by unrolling the loop [calculating more than one set of values each time].

    My comment on xmm registers being 16 bytes:
    Are you actually intending to calculate
    Code:
    { pInput[0], pInput[1] }
    { pInput[1], pInput[2] }
    { pInput[2], pInput[3] }
    { pInput[3], pInput[4] }
    The braces indicate the pair being calculated each iteration.

    I would have thought you wanted:
    Code:
    { pInput[0], pInput[1] }
    { pInput[2], pInput[3] }
    { pInput[4], pInput[5] }
    { pInput[6], pInput[7] }
    Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.

    And the same applies to pWeight, obviously.

    Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.

    --
    Mats
    dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.

    And I do need to push and pop eax because these are __asm lines, not a single __asm {} block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.

    You are correct about the 8, it should be 0x00000010 not 0x00000008 , again, sorry it was late when I wrote that code
    Last edited by abachler; 11-19-2007 at 07:56 AM.

  12. #42
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    dwCount can be an odd number, and while the vector will always contain th eproper even number of members, the actual dwCount must be preserved elsewhere in the program. You cannot count down because for one, an odd number of dwCount would cause the eax to underrun to 0xffffffff and cause a serious memory problem since eax could never be zero it woudl keep going until it produced a protection fault, adn two there woudl still be the compare, since LOOP uses ECX and decrements it by one on its own, so there woudl still be two instructions to dec it by 2. Not to mention that INC/DEC are deprecated in x64 which would make this a non-portable solution. The primary issue with puttign the constants into registers isnt to speed up the operation, but to reduce the memory bandwidth usage, which in large vector calculations is the REAL bottleneck.
    I'm not suggesting you change the content or meaning of dwCount itself. But if you decrement instead of increment, then you do not need a compare instruction - the processor sets the condition flags in decrement. If you have an odd value, you would have to "round up" before you start the loop [assuming it's valid to use the final, uninitialized element - otherwise your code is broken anyways], so that you always have an even value. If you do this, you can finish on zero, so you don't have to compare - and saving one instruction that is dependant on some other instruction is definitely a win - even if the individual instruction makes little difference.

    You do not use any memory bandwidth by using register instead of constant for add or subtract - aside from POSSIBLY one byte extra code-space, but the code is in cache once it's been read anyways, so it does not compete with your data fetches.

    Also, to clarify your misunderstanding: The INC/DEC instructions are still available in x86_64, but the simple opcode form is no longer available [there are two ways to produce exactly the same instruction in the x86 instruction set - in the 64-bit version, one of them was removed to make it a prefix instead, as there was no other "long sequence of single byte opcodes that can be changed", and double byte prefix quickly become a mess in the instruction decoder and code generators etc]. The assembler/compiler can still generate INC/DEC instructions, they just start with 0xFF ... [if my memory serves right].

    And I do need to push and pop eax because these are __asm lines, not a single
    Code:
    __asm {}
    block. I do this because it improves performance, since an __asm block preserves the entire machine state, whereas I can just preserve the part im going to modify.
    Ok, fair enough - although "saves entire" state is not correct - it saves what it has to.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  13. #43
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    even if the const is in the cache, its still 4 bytes (its a DWORD const) of memory bandwidth that gets used, plus it competes with other instructions in the pipeline for access to the cache, whereas a register/register add only competes for the ALU. It could use the ADD r/m32 , imm8 opcode, but thats still 3 bytes versus 2 for teh register to register. plus it woudl be easier to realign other ops if the offset was 2 versus 3. true that I could use a NOP to realign, but id rather use some opcode that would improve performance by doing actual work. both ways are valid, it would depend on the specific tradeoffs you wanted to make. In this case I think Im goign to rewrite ti once again to improve performance even further, ill post the code later.
    Last edited by abachler; 11-19-2007 at 06:52 PM.

  14. #44
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Code:
    // this code is copyright 2007 Anthony Q. Bachler
    // you may use this code free of charge for non-commercial purposes only
    // for commercial use contact the author.  This code is not released under the GPL
    
    void CNeuron::Process(double* Input , double* Output){
        double Scratch = 0.0;
        __declspec(align(16)) __m128 Zero[] = {0.0 , 0.0};
        __declspec(align(16)) __m128 tempout[] = {0.0 , 0.0};
        double* pInput;
        double* pWeight;
        DWORD dwCount;
    
        pInput = Input;
        pWeight = this->Weights;
        dwCount = this->dwNumberOfInputs;
        dwCount = ((dwCount+7)/8);  // calculate total iterations
        __asm push      eax
        __asm push      edx
        __asm push      ecx
        __asm xor       eax , eax
        __asm mov       edx , 0x00000040
        __asm mov       ecx , dwCount
        __asm movapd    xmm1 , Zero
        __asm movapd    xmm3 , Zero
        __asm movapd    xmm5 , Zero
        __asm movapd    xmm7 , Zero
    theloop:
        __asm movapd    xmm0 , pInput
        __asm mulpd     xmm0 , pWeight
        __asm addpd     xmm1 , xmm0
        __asm movapd    xmm2 , pInput+16
        __asm mulpd     xmm2 , pWeight+16
        __asm addpd     xmm3 , xmm2
        __asm movapd    xmm4 , pInput+32
        __asm mulpd     xmm4 , pWeight+32
        __asm addpd     xmm5 , xmm4
        __asm movapd    xmm6 , pInput+48
        __asm add       pInput , edx
        __asm mulpd     xmm6 , pWeight+48
        __asm add       pWeight , edx
        __asm addpd     xmm7 , xmm6
        __asm loop      theloop
        __asm addpd     xmm1 , xmm3
        __asm addpd     xmm5 , xmm7
        __asm addpd     xmm1 , xmm5
        __asm movapd    tempout , xmm1
        __asm fld       tempout
        __asm fld       tempout+8
        __asm fadd
        __asm fstp      st(0)
        __asm fld1            
        __asm fpatan    
        __asm fsin
        __asm fstp      Output
        __asm pop       ecx
        __asm pop       edx
        __asm pop       eax
    /*
        for(DWORD temp=0;temp<this->dwNumberOfInputs;temp++){
            Scratch += Input[temp] * this->Weights[temp];
            }
        Output[0] = sin(atan(Scratch));
    //*/
        return;
        }
    Last edited by abachler; 11-19-2007 at 07:48 PM.

  15. #45
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Have you disassembled the compiled code to see if pInput and pWeight are in registers? If they are not, you may wan to consider forcing that in your assembler code.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bin packing problem....
    By 81N4RY_DR460N in forum C++ Programming
    Replies: 0
    Last Post: 08-01-2005, 05:20 AM
  2. Words and lines count problem
    By emo in forum C Programming
    Replies: 1
    Last Post: 07-12-2005, 03:36 PM
  3. half ADT (nested struct) problem...
    By CyC|OpS in forum C Programming
    Replies: 1
    Last Post: 10-26-2002, 08:37 AM
  4. binary tree problem - help needed
    By sanju in forum C Programming
    Replies: 4
    Last Post: 10-16-2002, 05:18 AM