Thread: Odd assembly problem

  1. #16
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    except that the results seem to indicate that the MS generated code with the inc in the loop runs slower, even though it 'should' run faster. Im just testing right now, i will probably write the final code in SSE, since speed and accuracy are both important.
    Last edited by abachler; 11-07-2007 at 03:00 PM.

  2. #17
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    You do have compiler optimisations enabled, right? With GCC, there is a huge difference even between "gcc" and "gcc -O1".
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  3. #18
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Yes, and while the GetTickCount() code is non-trivial, it is consistent, so the results woudl still be valid for a matrix solution to find the relative speeds of the routines. Now if i coudl only remember how you do those matrix solutions, heh...

  4. #19
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    I retouched the assembly to perform an optimized loop

    Code:
    int main(int argc, char* argv[]){
    
        double Input = 1.000;
        __declspec(align(16))  double SIMD_Input[] = {1.0 , 1.0};
        DWORD temp;
        Input = 1.0;
        Start = GetTickCount();
        Stop = Start + 100000;
        temp = 0;
        while(temp < 16777216){
           
            
            Input = sin(atan(Input));
          
            temp++;
            
            }
        printf("%d\t%f\n" , (GetTickCount() - Start) , Input);
        
        Input = 1.0;
        Start = GetTickCount();
        Stop = Start + 100000;
            
            __asm push  ecx
            __asm mov   ecx , 0x01000000
    begin:  __asm fld   Input
            __asm fld1
            __asm fpatan
            __asm fsin
            __asm fstp  Input
            __asm loop  begin
            __asm pop   ecx
     
        printf("%d\t%f\n" , (GetTickCount() - Start) , Input);
        
        
        
        
        return 0;
        }
    the new results are:

    4015 ms for the C/C++
    3094 ms for the assembly

    anyone care to suggest an optimized loop in C/C++ if you think thats the problem?

  5. #20
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    To be honest, the C++ loop and asm loop don't even look the same. The C++ loop uses some magic number (16777216) for the number of iterations. I can't even tell what the asm loop uses for the number of iterations.

  6. #21
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    The magic number is the same as the hex number in the ASM loop, so that's fine.

    Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #22
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    The magic number is the same as the hex number in the ASM loop, so that's fine.

    Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.
    Try adding "-ffast-math" to your compile command-line.

    @abachler:
    Using the same decimal constant would reduce the number of "queries" on the "the code doesn't look like it's doing the same thing.

    Also, you don't need to push and pop ecx - the compiler will understand register usage in __asm statements.

    I prefer to use
    Code:
    __asm { 
       ... multiple 
       ... lines 
       ... of
       ... code
    }
    If you want to improve a bit [by "cheating", don't do the fstp/fld inside the loop - just leave the result of the previous loop in st(0).

    Your variable "Stop" is never used, right?

    Have you checked the assembler output from the while-loop in C? Have you tried using a
    Code:
    for(temp=0; temp < 167777216; temp++)
    instead? Usually, the compiler produces better code for for-loops than while-loops when it comes to constant limit expressions - it may of course be that it makes no difference here.


    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #23
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    -ffast-math indeed uses the intrinsics, unless I use -mfpmath=sse. Over a whole program, the advantage of SSE over x87 may well offset this loss.

    Another thing to note: According to some information I've found, software sine implementations using SSE math could well be faster than the CPU x87 sine instruction. Especially if you do it on multiple data, and especially if you then go on to do other FP operations on the data. (x87<->sse moves are costly).

    Also, 64-bit CPUs might show generally better results, because the x86-64 ABI specifies that floating point parameters are passed in SSE registers, while in the x86-32 ABI all floating point parameters must be passed on the stack. This leads to a severe slowdown when calling external trig implementations. I can actually confirm that from comparing the code generated for my Athlon64 and the Core1 I'm writing this on.

    Tricky business, optimizing.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #24
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Seeing as SSE doesn't HAVE a sin/atan function, it's quite difficult to make it "use intrinsics" - it needs to copy the SSE register to x87 register and then use fsin to perform the actual fpu operation and then reverse that register to register move. It's slow and not very neat - and you don't really care if it's inlined or not - the overhead of a function call is in the fractions of a percent compared to the 150+ cycles that this function will take.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #25
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by swoopy View Post
    To be honest, the C++ loop and asm loop don't even look the same. The C++ loop uses some magic number (16777216) for the number of iterations. I can't even tell what the asm loop uses for the number of iterations.
    that would be
    Code:
            __asm mov   ecx , 0x01000000
     
    and
     
            __asm loop begin
    0x01000000 == 16777216

    Quote Originally Posted by CornedBee View Post
    The magic number is the same as the hex number in the ASM loop, so that's fine.



    Try as I might, I can't get GCC to generate a loop instruction for the C++ version, nor get it to use the math intrinsics. Quite frustrating.
    Im using VC 6.0 Enterprise w/service pack 5 and the processor pack installed

    You might have to create a custom EMIT
    E2 cb LOOP rel8 Decrement count; jump short if count ≠ 0.

    as for SSE not having trig functions, its stil so much faster that You can derive the functions form simpler methods faster than you could get them from FPU instructions.
    Last edited by abachler; 11-12-2007 at 10:01 PM.

  11. #26
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    Quote Originally Posted by abachler View Post
    that would be
    Code:
            __asm mov   ecx , 0x01000000
     
    and
     
            __asm loop begin
    0x01000000 == 16777216
    Thanks. After I read Corned Bee's post it made sense.

  12. #27
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Have you looked into higher-level optimisations first?
    For example the trigonometric identity:
    Code:
    sin(arctan(x)) = x / sqrt(1 + x*x)
    According to: http://en.wikipedia.org/wiki/List_of...ric_identities

    Of course that may well be slower, but it depends on how it fits with the rest of the surrounding code. E.g. if you're by chance calculating that exact denominator for some other reason anyway, then it would have to be faster. There's also Carmack's fast rsqrtf function to speed up such things.
    There's more than one way to skin a cat!
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  13. #28
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Or perhaps you take the square of that thing anyway - then you can omit the sqrt completely.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  14. #29
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Now that you've spend a week on this micro-optimisation, how has this affected the overall result?

    For a start, have you run a profile of the program to find out where the real hot-spots are, and not just where you've guessed them to be?

    Or as others have started to hint, seek out alternative algorithms which do the same job in a more efficient manner.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  15. #30
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by iMalc View Post
    Have you looked into higher-level optimisations first?
    For example the trigonometric identity:
    Code:
    sin(arctan(x)) = x / sqrt(1 + x*x)
    According to: http://en.wikipedia.org/wiki/List_of...ric_identities

    Of course that may well be slower, but it depends on how it fits with the rest of the surrounding code. E.g. if you're by chance calculating that exact denominator for some other reason anyway, then it would have to be faster. There's also Carmack's fast rsqrtf function to speed up such things.
    There's more than one way to skin a cat!
    That should certainly execute faster than the 240 clocks that the original code takes.

    Code:
           fld   input      ;; 4
           fmul input     ;; 6    x * x
           fld1               ;; 4
           fadd              ;; 4   x * x + 1
           fsqrt              ;; 35
           fdivr  input    ;; 22
           fstp   input    ;; 2
                                ;; ==
                                ;; 77 clocks - so approximately 3 times faster than the original code.
    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bin packing problem....
    By 81N4RY_DR460N in forum C++ Programming
    Replies: 0
    Last Post: 08-01-2005, 05:20 AM
  2. Words and lines count problem
    By emo in forum C Programming
    Replies: 1
    Last Post: 07-12-2005, 03:36 PM
  3. half ADT (nested struct) problem...
    By CyC|OpS in forum C Programming
    Replies: 1
    Last Post: 10-26-2002, 08:37 AM
  4. binary tree problem - help needed
    By sanju in forum C Programming
    Replies: 4
    Last Post: 10-16-2002, 05:18 AM