Thread: Assembly optimization?

  1. #1
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654

    Assembly optimization?

    Just for fun, if anyone wants to help.
    So I have this little code:

    Code:
    	Stuff::CArray<char> Array;
    	double max_ = pow(10.0, 8.6);
    	DWORD dwTick1 = GetTickCount();
    	int max = (int)max_;
    
    	for (int i = 0; i < max; i++)
    		Array[i] = 12; //12345678;
    
    	DWORD dwTick2 = GetTickCount();
    	DWORD dwTick3 = dwTick2 - dwTick1;
    	cout << "Took " << dwTick3 << " ms total.\n";
    ...And I want to optimize it a little further, but it's kind of out of my hands right now.
    It's blazingly fast and according to the profiler, the loop is the culprit.

    Code:
     	Address  	Line 	Trace 	Source                         	Code Bytes      	Total &#37; 	Timer samples 	
     	0x401de6 	     	      	cmp edi,[esp+14h]              	3B 7C 24 14     	        	              	
     	0x401dea 	     	      	jnz $+15h (0x401dff)           	75 13           	5.76    	218           	
     	0x401dec 	     	      	lea esp,[esp+00h]              	8D 64 24 00     	        	              	
     	0x401df0 	     	      	lea esi,[esp+14h]              	8D 74 24 14     	        	              	
     	0x401df4 	     	      	call $-00000c24h (0x1004011d0) 	E8 D7 F3 FF FF  	        	              	
     	0x401df9 	     	      	cmp [esp+14h],edi              	39 7C 24 14     	        	              	
     	0x401dfd 	     	      	jbe $-0dh (0x100401df0)        	76 F1           	        	              	
     	0x401dff 	     	      	mov ecx,[esp+24h]              	8B 4C 24 24     	        	              	
     	0x401e03 	     	      	mov [ecx+edi],0ch              	C6 04 39 0C     	5.61    	212           	
     	0x401e07 	     	      	inc edi                        	47              	11.03   	417           	
     	0x401e08 	     	      	cmp edi,ebx                    	3B FB           	0.03    	1             	
     	0x401e0a 	     	      	jl $-24h (0x100401de6)         	7C DA           	5.24    	198           	
    
    1 file, 1 function, 1 line, 12 instructions, Summary: 1046 samples, 100.00% of module samples, 27.66% of total samples
    If anyone have any pointers on how to optimize the loop in assembly? It's blazingly fast considering what it's actually doing, so I just wanted to see how much more performance can be gained through this code, to decrease the execution time from 1.6s to lower!

    What it does is a dynamic array that allocates more memory on-the-fly as you insert elements. Considering it's actually adding 10 ^ 8.6 bytes of memory, I say it's pretty efficient!

    Pretty please? ^_^
    I'm not an assembly expert.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  2. #2
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    A few things off the top of my head:
    - Set the array size first - assuming MFC, that would be Array.SetSize(max).
    - If you can access Array's buffer directly, use memset() on it. memset() should already be somewhat optimized.

    gg

  3. #3
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    It's not an MFC class (otherwise it wouldn't be in the namespace Stuff, would it?). It's my homebrew class and it does not require me to set a size first. It changes the size if the current buffer in not big enough to hold all the data automatically. And besides that, I don't think it will matter if I do or not since it seems the time is spent in the loop and the processor driver and not the class implementation itself.
    It's not possible to access the buffer directly. Even doing so, I'd have to do some precautions... Dunno if I actually might do that. But I'll think about it...
    I would also add additional functionality such as writing and serializing into it.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #4
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    5.76 + 5.61 + 11.03 + 0.03 + 5.24 == 27.67&#37; (or 27.66% according to your output).

    Looks to me like the other 72.34% of the time is being spent in your CArray[] operator.

    gg

  5. #5
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Actually not. Here's what the profiler says (again):

    Code:
    Module Name              	64-bit 	Total &#37; 	Timer samples 	
    Help.exe                 	       	24.87   	1038          	
    AmdK8.sys                	       	21.45   	895           	
    ntkrnlpa.exe             	       	18.72   	781           	
    msvcr90.dll              	       	6.76    	282           	
    hal.dll                  	       	6.37    	266           	
    win32k.sys               	       	4.86    	203           	
    unknown module pid(2464) 	       	1.82    	76            	
    ntdll.dll                	       	1.82    	76            	
    SandBox.sys              	       	1.65    	69            	
    Unknown Kernel Samples   	       	1.46    	61            	
    qt-mt333.dll             	       	1.13    	47            	
    
    11 modules, Total: 3794 samples, 90.92% of total session samples
    For Help.exe:

    Code:
    CS:EIP   	Symbol + Offset 	64-bit 	Total % 	Timer samples 	
    0x401d30 	wmain           	       	24.87   	1038          	
    0x401e07 	wmain+215       	       	10.02   	418           	
    0x401e0a 	wmain+218       	       	5.18    	216           	
    0x401e03 	wmain+211       	       	4.94    	206           	
    0x401dea 	wmain+186       	       	4.65    	194           	
    0x401e08 	wmain+216       	       	0.05    	2             	
    0x401dff 	wmain+207       	       	0.02    	1             	
    0x401de6 	wmain+182       	       	0.02    	1             	
    
    1 function, 7 instructions, Total: 1038 samples, 100.00% of samples in the module, 24.87% of total session samples
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    I don't see what that is supposed to show.

    I still don't see how much time is being spent in CArray[]. Your original profile data suggests to me that the CArray[] call wasn't profiled at all - since there's no data in the Total&#37; and TimerSamples columns.

    gg

  7. #7
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Now there's just the thing - it shows that all the time spent in Help.exe is spent in the for loop.
    The rest of the time is spent outside the program in API calls and such, which is not calculated towards Help.exe.
    I can post code for the array too, if that's required, but as I mentioned, the only samples are for Help.exe.

    Here is for Debug:
    Code:
     	Address  	Line 	Trace 	Source                                       	Code Bytes 			Total &#37; 	Timer samples 	
     	         	46   	      		CArrayImpl(T&)::operator [] (DWORD dwIndex) 	           	        	              	
     	0x412430 	47   	      		{                                           	           		27.14   	9134          	
     	0x412453 	48   	      			if (dwIndex == m_dwSize)                   	           	5.80    	1953          	
     	0x41245d 	49   	      				while (m_dwSize <= dwIndex) Realloc();    	           	        	              	
     	0x412471 	50   	      			return m_pArray[dwIndex];                  	           	1.11    	372           	
     	0x41247a 	51   	      		}                                           	           		4.64    	1563          	
    
    1 file, 1 function, 5 lines, 0 instructions, Summary: 3888 samples, 26.51% of module samples, 11.55% of total samples
    And
    Code:
    CS:EIP   	Symbol + Offset                    	64-bit 	Total % 	Timer samples 	
    0x412430 	Stuff::CArray<char>::operator[]    	       	38.70   	13022         	
    0x411720 	wmain                              	       	3.16    	1064          	
    0x411271 	ILT+620(__RTC_CheckEsp)            	       	0.64    	214           	
    0x411131 	ILT+300(??A?$CArrayDStuffQAEAADKZ) 	       	0.57    	193           	
    0x413990 	_RTC_CheckEsp                      	       	0.51    	173           	
    
    5 functions, 29 instructions, Total: 14666 samples, 100.00% of samples in the module, 43.58% of total session samples
    Last edited by Elysia; 12-09-2007 at 02:10 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Code:
    0x411720 	wmain                              	       	3.16    	1064
    Was your for loop done in wmain() in this case?

    gg

  9. #9
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    Code:
    lea esp,[esp+00h]
    This is one funny line. It does absolutely nothing. (it says, store in ESP the adress of the element at [esp + 0], basically, store in esp the value of esp).

    Anyway, you won't win much by removing only a single instruction heh. But i have to say, reading assembly produced by compiler is rather confusing, i mean it doesn't look really natural.

    Writing your loop in assembly would be rather straightforward if the code was written in C, but since it's not, with that [] operator whos translated in a function call, well, i just don't know (i'm really far from being a C++ expert). Could only help you if it was a "c style" array where the for loop could be write as...

    Code:
    __asm
    {
       // Assume that your array is char (1 byte) array
       push eax
       push ecx
       push edi
    
       mov ecx, max
       dec ecx
       mov al, 12
       mov edi, array
       rep stosb
       stosb
    
       pop edi
       pop ecx
       pop eax
    }
    Anyway.
    Last edited by foxman; 12-09-2007 at 02:39 PM. Reason: Small error on usage of rep instruction

  10. #10
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by Codeplug View Post
    Code:
    0x411720 	wmain                              	       	3.16    	1064
    Was your for loop done in wmain() in this case?

    gg
    Yes, the code is the same:

    Code:
     	Address  	Line 	Trace 	Source                         	Code Bytes 	Total &#37; 	Timer samples 	
     	0x411800 	72   	      		for (int i = 0; i < max; i++) 	           	1.14    	383           	
     	0x41181a 	73   	      			Array[i] = 12; //12345678;   	           	2.02    	681           	
    
    1 file, 1 function, 2 lines, 0 instructions, Summary: 1064 samples, 7.25% of module samples, 3.16% of total samples
    As for C++... Well, basically ecx contains the address of the class (the this pointer) and arguments are simply pushed on the stack (this one's __cdecl.)
    Code:
     	Address  	Line 	Trace 	Source                         	Code Bytes      	Total % 	Timer samples 	
     	0x41181a 	     	      	mov eax,[ebp-68h]              	8B 45 98        	        	              	
     	0x41181d 	     	      	push eax                       	50              	        	              	
     	0x41181e 	     	      	lea ecx,[ebp-34h]              	8D 4D CC        	0.54    	181           	
     	0x411821 	     	      	call $-000006f0h (0x100411131) 	E8 0B F9 FF FF  	        	              	
    
    1 file, 1 function, 1 line, 4 instructions, Summary: 181 samples, 1.23% of module samples, 0.54% of total samples
    From this sample code we can see that it pushes the argument, the index to the stack and puts the this pointer (the class address) into ecx and calls the operator [].
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Code:
    CS:EIP   	Symbol + Offset                    	64-bit 	Total &#37; 	Timer samples 	
    0x412430 	Stuff::CArray<char>::operator[]    	       	38.70   	13022         	
    0x411720 	wmain                              	       	3.16    	1064
    >>>> Was your for loop done in wmain() in this case?
    >> Yes, the code is the same

    If your for loop was in wmain(), then wouldn't it contributed to the 3.16%?. Still looks like most of your time is spent in CArray[].

    Code:
    CArrayImpl(T&)::operator [] (DWORD dwIndex)
    {
        if (dwIndex == m_dwSize)
            while (m_dwSize <= dwIndex) 
                Realloc();
        return m_pArray[dwIndex];
    }
    Get rid of the 'if'. A 'compare/jump' is going to be done regardless. Adding the 'if' only introduces the possibility of a bug (by calling with a dwIndex > m_dwSize).

    I would optimize this by reducing the number of possible iterations that the while loop has to make. One way to do that is to pass a minimum-size "hint" to Realloc(). Then you operator[] would be reduced to:
    Code:
    CArrayImpl(T&)::operator [] (DWORD dwIndex)
    {
        if (dwIndex >= m_dwSize)
            Realloc(dwIndex);
        return m_pArray[dwIndex];
    }
    Then with that in place. you can simulate the MFC CArray equivalent of SetSize() by calling "Array[max] = 12;" prior to looping through the rest of the elements. Or add an explicit SetSize()-type of member function for pre-allocation.

    Another option is to add a "memset" member function to CArray, which could then be optimized accordingly based on the underlying implementation. This member function could even Realloc() as needed.

    In the end, I think your point of optimization is going to lie within Realloc().

    gg

  12. #12
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    Sorry for this little off topic, but i wanted to compare between C and Assembly on a simple string operation. What's interesting is that Assembly was roughly 7.5 times faster. Here's the code snippet

    Code:
    #include <time.h>
    #include <stdlib.h>
    #include <stdio.h>
    #define TAILLE 100000000
    #define VALEUR 30
    int main()
    {
        clock_t debut, fin;
        char *tab = malloc(TAILLE);
        int i;
        if (tab == NULL)
            exit(1);
    
        // Classic
        debut = clock();
        for (i = 0; i < TAILLE; i++)
        {
            tab[i] = VALEUR;
        }
        fin = clock();
        printf(" %u\n", fin - debut);
    
        // Assembleur
        debut = clock();
        __asm
        {
           push eax
           push ecx
           push edi
    
           mov ecx, TAILLE - 1
           mov al, VALEUR
           mov edi, tab
           rep stosb
           stosb
    
           pop edi
           pop ecx
           pop eax
        }
        fin = clock();
        printf(" %u\n", fin - debut);
    
        // Quasi classique
        debut = clock();
        i = TAILLE;
        while (i)
        {
            tab[--i] = VALEUR;
        }
        fin = clock();
        printf(" %u\n", fin - debut);
    
        free(tab);
        return 0;
    }
    Every "loop" does the same thing. Test was done on my old Pentium 3 (which has enough memory for this test, don't worry). Value returned (average): 2200, 280, 2200

    I wanted to share that...

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Code:
    // Classic
    // Assembleur
    // Quasi classique
    These should be done in 3 separate runs. Otherwise, CPU cache hits/misses could skew your results. Also be sure to turn on all possible compiler optimizations - to be fair

    gg

  14. #14
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by Codeplug View Post
    >>>> Was your for loop done in wmain() in this case?
    >> Yes, the code is the same

    If your for loop was in wmain(), then wouldn't it contributed to the 3.16&#37;?. Still looks like most of your time is spent in CArray[].

    Code:
    CArrayImpl(T&)::operator [] (DWORD dwIndex)
    {
        if (dwIndex == m_dwSize)
            while (m_dwSize <= dwIndex) 
                Realloc();
        return m_pArray[dwIndex];
    }
    Get rid of the 'if'. A 'compare/jump' is going to be done regardless. Adding the 'if' only introduces the possibility of a bug (by calling with a dwIndex > m_dwSize).
    I missed that. Initially, I did >= I think. But it hit a speed barrier, so I did only == since it worked just as fine in my code. Didn't consider calling a higher index, though.
    Still, I removed the IF and left the while. The Realloc function commits one page (as I like to call it) of memory and not more. Further, it does not hurt performance very much due to the function is not called often. Even removing the if and letting while be, I saw no performance degradation (and no performance gain either). I have the check there to make sure you don't call an index that's higher than the buffer and so if I allocate the buffer and it turns out it's not enough...

    I would optimize this by reducing the number of possible iterations that the while loop has to make. One way to do that is to pass a minimum-size "hint" to Realloc(). Then you operator[] would be reduced to:
    Code:
    CArrayImpl(T&)::operator [] (DWORD dwIndex)
    {
        if (dwIndex >= m_dwSize)
            Realloc(dwIndex);
        return m_pArray[dwIndex];
    }
    Then with that in place. you can simulate the MFC CArray equivalent of SetSize() by calling "Array[max] = 12;" prior to looping through the rest of the elements. Or add an explicit SetSize()-type of member function for pre-allocation.
    I could take that into consideration since the function is fully designed now. But I don't think it will have much of an impact at all. If you call a high index, pages will be allocated (committed) and it won't happen again. Tests show that when using an initial size, I save 500ms. So the Realloc isn't that expensive.

    Another option is to add a "memset" member function to CArray, which could then be optimized accordingly based on the underlying implementation. This member function could even Realloc() as needed.

    In the end, I think your point of optimization is going to lie within Realloc().

    gg
    Will take into consideration. Will also do a new profile analyze pass.

    Quote Originally Posted by foxman View Post
    Sorry for this little off topic, but i wanted to compare between C and Assembly on a simple string operation. What's interesting is that Assembly was roughly 7.5 times faster. Here's the code snippet
    Are you sure you enabled all optimizations for the compiler? It should be pretty much equal in speed for that little sample I would believe.

    UPDATE:
    OK, so new data:

    Code:
     	Address  	Line 	Trace 	Source                                                          	Code Bytes 	Total % 	Timer samples 	
     	         	49   	      		CArrayImpl(T&)::operator [] (DWORD dwIndex)                    	           	        	              	
     	0x412430 	50   	      		{                                                              	           	27.56   	8989          	
     	0x412453 	51   	      			if (dwIndex >= m_dwSize)                                	        6.45    	2105          	
     	0x41245d 	52   	      				/*while (m_dwSize <= dwIndex) */Realloc(dwIndex - m_dwSize); 	           	        	              	
     	0x41246e 	53   	      			return m_pArray[dwIndex];                                     	        1.14    	373           	
     	0x412477 	54   	      		}                                                              	           	4.68    	1525          	
    
    1 file, 1 function, 6 lines, 0 instructions, Summary: 12992 samples, 89.16% of module samples, 39.84% of total samples
    The evil function call is sucking up the CPU time here and Release only shows the loop as the bottleneck.
    Last edited by Elysia; 12-09-2007 at 03:37 PM.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  15. #15
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Where are the timer samples for the call to Realloc()? Can you run a profile session that shows samples and percentages for:
    - the outer for loop
    - the call to Array[]
    - the call to Realloc()
    So we can get a "big picture" of where time is being spent?

    [Edit]
    In release mode.
    [/Edit]

    gg
    Last edited by Codeplug; 12-09-2007 at 03:54 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Learning Assembly
    By mrafcho001 in forum Tech Board
    Replies: 5
    Last Post: 03-12-2006, 05:00 PM
  2. C to assembly interface
    By Roaring_Tiger in forum C Programming
    Replies: 4
    Last Post: 02-04-2005, 03:51 PM
  3. assembly language...the best tool for game programming?
    By silk.odyssey in forum Game Programming
    Replies: 50
    Last Post: 06-22-2004, 01:11 PM
  4. True ASM vs. Fake ASM ????
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 04-02-2003, 04:28 AM
  5. C,C++,Perl,Java
    By brusli in forum C Programming
    Replies: 9
    Last Post: 12-31-2001, 03:35 AM