Thread: Assembly optimization?

  1. #16
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    For the outer loop (in debug):
    Code:
     	Address  	Line 	Trace 	Source                         	Code Bytes      	Total % 	Timer samples 	
     	0x41180c 	73   	      		for (int i = 0; i < max; i++) 	                1.17    	380           	
     	0x411826 	74   	      			Array[i] = 12; //12345678;   	    1.94    	633           	
     	0x411826 	     	      	mov eax,[ebp-68h]              	8B 45 98        0.00    	1             	
     	0x411829 	     	      	push eax                       	50              	        	              	
     	0x41182a 	     	      	lea ecx,[ebp-34h]              	8D 4D CC        	0.60    	195           	
     	0x41182d 	     	      	call $-000006fch (0x100411131) 	E8 FF F8 FF FF  	        	              	
     	0x411832 	     	      	mov [eax],0ch                  	C6 00 0C        	0.60    	195           	
     	0x411835 	     	      	jmp $-20h (0x100411815)        	EB DE           0.74    	242           	
    
    1 file, 1 function, 2 lines, 6 instructions, Summary: 1013 samples, 6.95% of module samples, 3.11% of total samples
    For Array[], you have it, I posted it above.
    There are no samples for Realloc (which would make sense, since it's never called due that I set the size before starting the loop).
    If I don't set size, it takes about 400ms longer, but no samples for Realloc. I'm not sure if I'm just using the profiler wrong in this case. I use Codeanalyst in case anyone needs to know.

    I found out why it wasn't getting symbols for operator [] in Release. Simple. The compiler inlined the code, saving 1000ms. If I don't, here's the samples for Release:
    Code:
     	Address  	Line 	Trace 	Source                                       	Code Bytes 	Total % 	Timer samples 	
     	         	49   	      		CArrayImpl(T&)::operator [] (DWORD dwIndex) 	           	        	              	
     	0x401070 	50   	      		{                                           	           	7.57    	399           	
     	0x401073 	51   	      			if (dwIndex >= m_dwSize)                   	           3.23    	170           	
     	0x401079 	52   	      				Realloc(dwIndex - m_dwSize);              	           	        	              	
     	0x401083 	53   	      			return m_pArray[dwIndex];                  	           4.82    	254           	
     	0x401089 	54   	      		}                                           	           	3.15    	166           	
    
    1 file, 1 function, 6 lines, 0 instructions, Summary: 989 samples, 52.83% of module samples, 18.77% of total samples
    Last edited by Elysia; 12-09-2007 at 04:15 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  2. #17
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> There are no samples for Realloc...
    Well, will Realloc() not being called and Array[] being inlined, there's not much more you can do except add a "memset" style of method that utilizes a highly optimized version of memset() with SIMD instructions.

    Keeping your original for loop, you could try manually unrolling the loop.

    gg

  3. #18
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    I'm back with my little off-topic.

    The test i did was done using MSVC++ 6.0, with build configuration set to "Release" and optimization set to "maximize speed" (default with release configuration). Couldn't do the test using gcc since i don't know how to write assembly for it. I redid the test in 3 separate runs (4 runs for each version) and got similar results:

    Code:
    Classic (for loop)  :   2213, 2173, 2163, 2233
    Assembly            :   260, 250, 250, 250
    Classic (while loop):   2203, 2133, 2133, 2143
    Note that it's possible to tweak the assembly version even more by writing doubleword instead of byte at a time. (but you could also do that in C)

    If you want, you could test it on your own comp. Would be curious of the result if someone does so. But i'm not really surprise of the result, since the IA-32 instruction set comes with some really efficient string instruction that i bet compiler make only minor use (because of their really specific nature).

    I mean, just take a look at the code generated by the compiler (i added comments):

    Code:
    193:      for (i = 0; i < TAILLE; i++)
    0040C6AA C7 45 F0 00 00 00 00 mov         dword ptr [ebp-10h],0           // i = 0
    0040C6B1 EB 09                jmp         main+5Ch (0040c6bc)
    0040C6B3 8B 45 F0             mov         eax,dword ptr [ebp-10h]         // EAX = i
    0040C6B6 83 C0 01             add         eax,1                           // EAX++
    0040C6B9 89 45 F0             mov         dword ptr [ebp-10h],eax         // i = EAX
    0040C6BC 81 7D F0 00 E1 F5 05 cmp         dword ptr [ebp-10h],5F5E100h    // if (i >= TAILLE)
    0040C6C3 7D 0B                jge         main+70h (0040c6d0)             // jump out of the loop
    194:      {
    195:          tab[i] = VALEUR;
    0040C6C5 8B 4D F4             mov         ecx,dword ptr [ebp-0Ch]       // ECX = tab
    0040C6C8 03 4D F0             add         ecx,dword ptr [ebp-10h]       //  ECX = ECX + i
    0040C6CB C6 01 1E             mov         byte ptr [ecx],1Eh            //  [ecx] = 12
    196:      }
    0040C6CE EB E3                jmp         main+53h (0040c6b3)           // jmp at start of the loop
    vs

    Code:
    rep stosb
    I think it's pretty clear why it's so fast.
    Last edited by foxman; 12-09-2007 at 06:17 PM. Reason: Forgot equal sign

  4. #19
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Out of curiosity: Do a run using just memset(). Also, in project settings -> C/C++ -> Code Generation, set the Processor to pentium pro.

    gg

  5. #20
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    Haha. Well, this is a bit ankward. From the start i was in debug mode, i didn't switch configuration correctly. Sorry, my bad.

    Well, in fact, everything is quite similar in release mode, there's no really significative difference between the different version, except for the while version who is considerably slower (close to 2 times slower). Do you know if there's a way to see dissassembly in release mode ? I'm curious to know what the compiler generates.

    But in debug mode, memset is as fast as the assembly version. What's also strange, is that both memset and assembly are a bit slower in release mode.

    Anyway. I won't investigate any further.

  6. #21
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    Quote Originally Posted by Elysia View Post
    Just for fun, if anyone wants to help.
    So I have this little code:

    Code:
    	Stuff::CArray<char> Array;
    	double max_ = pow(10.0, 8.6);
    	DWORD dwTick1 = GetTickCount();
    	int max = (int)max_;
    
    	for (int i = 0; i < max; i++)
    		Array[i] = 12; //12345678;
    
    	DWORD dwTick2 = GetTickCount();
    	DWORD dwTick3 = dwTick2 - dwTick1;
    	cout << "Took " << dwTick3 << " ms total.\n";
    ...And I want to optimize it a little further, but it's kind of out of my hands right now.
    It's blazingly fast and according to the profiler, the loop is the culprit.
    Just for my own curiosity, what happens to the speed if you replace Stuff::CArray<char> with std::vector<char> and change the Array[i] = 12; to Array.push_back( (char)12 );

  7. #22
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by foxman View Post
    Haha. Well, this is a bit ankward. From the start i was in debug mode, i didn't switch configuration correctly. Sorry, my bad.

    Well, in fact, everything is quite similar in release mode, there's no really significative difference between the different version, except for the while version who is considerably slower (close to 2 times slower). Do you know if there's a way to see dissassembly in release mode ? I'm curious to know what the compiler generates.

    But in debug mode, memset is as fast as the assembly version. What's also strange, is that both memset and assembly are a bit slower in release mode.

    Anyway. I won't investigate any further.
    Actually, MSVC6 generates pretty poor code. I could try on 2008 and see what happens.

    Quote Originally Posted by cpjust View Post
    Just for my own curiosity, what happens to the speed if you replace Stuff::CArray<char> with std::vector<char> and change the Array[i] = 12; to Array.push_back( (char)12 );
    For vector: best time was 3218 ms.
    For non-inline release CArray - 2500 ms.
    For inline release CArray - 1600 ms.
    If I set size before loop, inline release CArray - 1200 ms.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #23
    Banned
    Join Date
    Nov 2007
    Posts
    678
    Quote Originally Posted by foxman View Post
    Sorry for this little off topic, but i wanted to compare between C and Assembly on a simple string operation. What's interesting is that Assembly was roughly 7.5 times faster. Here's the code snippet

    Code:
    #include <time.h>
    #include <stdlib.h>
    #include <stdio.h>
    #define TAILLE 100000000
    #define VALEUR 30
    int main()
    {
        clock_t debut, fin;
        char *tab = malloc(TAILLE);
        int i;
        if (tab == NULL)
            exit(1);
    
        // Classic
        debut = clock();
        for (i = 0; i < TAILLE; i++)
        {
            tab[i] = VALEUR;
        }
        fin = clock();
        printf(" &#37;u\n", fin - debut);
    
        // Assembleur
        debut = clock();
        __asm
        {
           push eax
           push ecx
           push edi
    
           mov ecx, TAILLE - 1
           mov al, VALEUR
           mov edi, tab
           rep stosb
           stosb
    
           pop edi
           pop ecx
           pop eax
        }
        fin = clock();
        printf(" %u\n", fin - debut);
    
        // Quasi classique
        debut = clock();
        i = TAILLE;
        while (i)
        {
            tab[--i] = VALEUR;
        }
        fin = clock();
        printf(" %u\n", fin - debut);
    
        free(tab);
        return 0;
    }
    Every "loop" does the same thing. Test was done on my old Pentium 3 (which has enough memory for this test, don't worry). Value returned (average): 2200, 280, 2200

    I wanted to share that...
    Here are my results of 5 runs on MSVC .Net 2003 with /Og /O2
    78 47 78
    93 32 78
    78 47 78
    93 32 93
    93 47 78

  9. #24
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    I enabled every possible optimization I could find and the result was... *drum roll please*

    109
    94
    109

    Assembly for classic loop:

    Code:
    	debut = clock();
    00401827  mov         esi,dword ptr [__imp__clock (40209Ch)] 
    0040182D  call        esi  
    	for (i = 0; i < TAILLE; i++)
    	{
    		tab[i] = VALEUR;
    0040182F  push        5F5E100h 
    00401834  push        1Eh  
    00401836  push        ebx  
    00401837  mov         edi,eax 
    00401839  call        memset (4018B0h) 
    	}
    	fin = clock();
    0040183E  call        esi
    It even optimizes away your loop with memset.

    Second loop assembly:

    Code:
    0040187E  call        esi  
    00401880  mov         edi,eax 
    	i = TAILLE;
    00401882  mov         eax,5F5E100h 
    	while (i)
    	{
    		tab[--i] = VALEUR;
    00401887  sub         eax,1 
    0040188A  mov         byte ptr [eax+ebx],1Eh 
    0040188E  jne         $LN14+60h (401887h) 
    	}
    	fin = clock();
    00401890  call        esi
    There's not much of a difference on today's compilers. Sure, there's some, so when really, really time critical, it might help, but otherwise...
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Learning Assembly
    By mrafcho001 in forum Tech Board
    Replies: 5
    Last Post: 03-12-2006, 05:00 PM
  2. C to assembly interface
    By Roaring_Tiger in forum C Programming
    Replies: 4
    Last Post: 02-04-2005, 03:51 PM
  3. assembly language...the best tool for game programming?
    By silk.odyssey in forum Game Programming
    Replies: 50
    Last Post: 06-22-2004, 01:11 PM
  4. True ASM vs. Fake ASM ????
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 04-02-2003, 04:28 AM
  5. C,C++,Perl,Java
    By brusli in forum C Programming
    Replies: 9
    Last Post: 12-31-2001, 03:35 AM