Thread: Odd assembly problem

  1. #1
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195

    Odd assembly problem

    This code

    Code:
    int main(int argc, char* argv[]){
     
        double Input = 1.000;
        DWORD temp;
        Start = GetTickCount();
        Stop = Start + 10000;
        temp = 0;
        while(GetTickCount() < Stop){
            __asm fld   Input
            __asm fld1
            __asm fpatan
            __asm fsin
            __asm fst   Input
     
     
            temp++;
     
            }
        printf("%d\n" , (temp/10000));
     
     
     
        return 0;
        }
    Runs about 4-5 slower than this code

    Code:
    int main(int argc, char* argv[]){
    
        double Input = 1.000;
        DWORD temp;
        Start = GetTickCount();
        Stop = Start + 10000;
        temp = 0;
        while(GetTickCount() < Stop){
            Input = sin(atan(Input));            
    
            temp++;
            
            }
        printf("%d\n" , (temp/10000));
        
        
        
        return 0;
        }
    SHouldnt the assembly run faster?

  2. #2
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    You could print out the assembly generated by the second version to compare, but wouldn't this instruction:
    Code:
            __asm fst   Input
    be optimised away by the compiler? Assuming fst is some sort of store instruction.

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Yes, look at the generated asm for the C code.

    > Input = sin(atan(Input));
    The result is never used, so the compiler optimiser could just eliminate the entire statement.

    The optimiser will leave any __asm well alone.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Having written similar things, I agree with Salem, the reason the assembler code is slower is simply that the compiler optimizes out the unused calculations all together. Add a print of the Input at the end of the function, and you should get a different result.

    I expect that there will be little difference once you do that - but perhaps the compiler will be more clever with ordering/forms of the instructions - although I'm not sure about that - compilers often have detailed understanding of FPU optimization that can be found in the public optimization guides from AMD and Intel. But I can't see anything obvious in your code to improve.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Here's another thing: once there's assembly in your code, the optimizer has to be far more conservative about what it does.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  6. #6
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    It seems the compiler is generaing this code for the loop

    Code:
    0040103E   fld         qword ptr [esp+8]
    00401042   inc         edi
    00401043   fld1
    00401045   fpatan
    00401047   fsin
    00401049   fstp        qword ptr [esp+8]
    0040104D   call        esi
    0040104F   cmp         eax,dword ptr ds:[408FE0h]
    00401055   jb          0040103E
    whereas the inline assembly compiles to this

    Code:
    00401039   fld         qword ptr [ebp-8]
    0040103C   fld1
    0040103E   fpatan
    00401040   fsin
    00401042   fst         qword ptr [ebp-8]
    00401045   inc         edi
    00401046   call        esi
    00401048   cmp         eax,dword ptr ds:[408FE0h]
    0040104E   jb          00401039
    omfg, now that I look at it I see the problem. FPU stack over-run. the compiler was correctly using fstp to pop the value from st(0) to clear the stack, whereas I used the fst, which leaves the value in st(0), which bogs the FPU down with FPU exception after 16 ops.

    My code now runs slightly faster than the compiler generated code, over a 100 second run of each routine, the compilers code gave me 5316 ops/msec, my assembly gave 5351 ops/msec thats a 0.6&#37; increase in speed for this one operation woot.
    Last edited by abachler; 11-07-2007 at 08:05 AM.

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Yes, and with the inc edi between the fld and fld1, the MS generated code should be marginally faster [probably not a lot tho'], because it overlaps some of the float instructins.

    Of course, teh GetTicksCount() call will probably take more time than your entire loop, I suspect. Running a good number of iterations and measuring the time would probably be a better way to measure.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Oh i ran a good number of iterations, Like I said, I ran the loop for 100 seconds ( 10 seconds in the originally posted code). I think the total run was something like 50 million iterations, I used a specific time length and counted iterations, rather than run specific number of iterations than calculate time, becuase the RTC isnt that acurate, but a counter definately keeps track of how many times its been ++ed
    Last edited by abachler; 11-07-2007 at 08:15 AM.

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    Oh i ran a good number of iterations, Like I said, I ran the loop for 100 seconds ( 10 seconds in the originally posted code). I think the total run was something like 50 million iterations.
    That's not what I meant: in a single iteration, you are executing 6 fpu instructions, then calling GetTickCount() which is a kernel system call. The Kernel system call will probably take about the same number of clock-cycles as the FPU instructions (or more - I don't know quite how fast/slow the fpatan and fsin really are).

    At least, do an inner loop that does 100 or 1000 loops per call to GetTickCount().

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    thats a 0.6&#37; increase in speed
    In my (admittedly limited) experience, that falls under "statistical anomaly".
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I agree that the 0.6% difference is possibly statistical noise. However:
    On a 2GHz Athlon64, this loop should take about 120 ns [240 clocks @ 2GHz -> 120ns]. This is under "ideal circumstances" - but I don't think it's very far from that. Of course, to optimize this particular loop, one could move the fld input, fstp out of the loop, as they are only needed at the first and last iteration, so you could do them "out of loop". [which would save 6 theortical clock-cycles, and probably 0 practical clock-cycles, as there is register forwarding and such in the processor architecture, which allows the store->load to be "delay-less"].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Actually, if I run the loop without any fpu or maths, I get a result of about 250000, so I woudl say that the GetTickCount() is having a minimal impact on performance. In either case, the results are valid since both programs execute the exact same overhead functions. 0.6% is better than nothing, escpecially considering this is only a small test.

  13. #13
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > My code now runs slightly faster than the compiler generated code, over a 100 second run of each routine
    But in that 100 seconds, how much other work is your desktop machine performing in the background? It wouldn't take much to account for the minute difference you see.

    You could look at QueryPerformanceCounter() for a very precise clock.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  14. #14
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    That's OK then.

    Code:
            __asm fld   Input	; 4
            __asm fld1		; 4
            __asm fpatan		; 136
            __asm fsin		; 93
            __asm fst   Input	; 2
    ;;; 240 clocks * 0.5ns = 120 ns -> 8.3 loops per microsecond/ 8333 loops per millisecond.
    Here's a summary of the time in clockcycles each instruction takes according to the Athlon64 Optimization guide. Those instructions are about as long as you can get in x87 assembler.

    If you are not particulary needing precise results, you may find that doing the math for sin/atan yourself will be quicker - maybe a polynomial of 4-6 steps will be sufficient to get the precision you need.

    fdiv takes about 20 clocks.
    fmul takes 11 clocks.
    fadd/fsub 6 clocks.
    So you could easily do a few of these in less time than the fsin/fpatan time. Using fmul to do divide by pre-calculating the inverse is even better, and I think that can be done for most of the sin/atan calculations.

    --
    Mats

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  15. #15
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    Of course, if you're looking for performance, you should start with a profile run to determine the real hot spots, then review your algorithm and data structure choices in the light of that.

    Tinkering with the detail like this will only net you the 1&#37;. If you want the 100% or 1000% improvements, you need to look at the bigger picture.

    Besides, a good optimising compiler will do a lot of this stuff for you, as evidenced by the movement of "inc edi" to take advantage of parallel execution between the integer and FP units.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bin packing problem....
    By 81N4RY_DR460N in forum C++ Programming
    Replies: 0
    Last Post: 08-01-2005, 05:20 AM
  2. Words and lines count problem
    By emo in forum C Programming
    Replies: 1
    Last Post: 07-12-2005, 03:36 PM
  3. half ADT (nested struct) problem...
    By CyC|OpS in forum C Programming
    Replies: 1
    Last Post: 10-26-2002, 08:37 AM
  4. binary tree problem - help needed
    By sanju in forum C Programming
    Replies: 4
    Last Post: 10-16-2002, 05:18 AM