Assembly optimization?

**Elysia** · 12-09-2007

For the outer loop (in debug):

Code:

 	Address  	Line 	Trace 	Source                         	Code Bytes      	Total &#37; 	Timer samples 	
 	0x41180c 	73   	      		for (int i = 0; i < max; i++) 	                1.17    	380           	
 	0x411826 	74   	      			Array[i] = 12; //12345678;   	    1.94    	633           	
 	0x411826 	     	      	mov eax,[ebp-68h]              	8B 45 98        0.00    	1             	
 	0x411829 	     	      	push eax                       	50              	        	              	
 	0x41182a 	     	      	lea ecx,[ebp-34h]              	8D 4D CC        	0.60    	195           	
 	0x41182d 	     	      	call $-000006fch (0x100411131) 	E8 FF F8 FF FF  	        	              	
 	0x411832 	     	      	mov [eax],0ch                  	C6 00 0C        	0.60    	195           	
 	0x411835 	     	      	jmp $-20h (0x100411815)        	EB DE           0.74    	242           	

1 file, 1 function, 2 lines, 6 instructions, Summary: 1013 samples, 6.95% of module samples, 3.11% of total samples

For Array[], you have it, I posted it above.
There are no samples for Realloc (which would make sense, since it's never called due that I set the size before starting the loop).
If I don't set size, it takes about 400ms longer, but no samples for Realloc. I'm not sure if I'm just using the profiler wrong in this case. I use Codeanalyst in case anyone needs to know.

I found out why it wasn't getting symbols for operator [] in Release. Simple. The compiler inlined the code, saving 1000ms. If I don't, here's the samples for Release:

Code:

 	Address  	Line 	Trace 	Source                                       	Code Bytes 	Total % 	Timer samples 	
 	         	49   	      		CArrayImpl(T&)::operator [] (DWORD dwIndex) 	           	        	              	
 	0x401070 	50   	      		{                                           	           	7.57    	399           	
 	0x401073 	51   	      			if (dwIndex >= m_dwSize)                   	           3.23    	170           	
 	0x401079 	52   	      				Realloc(dwIndex - m_dwSize);              	           	        	              	
 	0x401083 	53   	      			return m_pArray[dwIndex];                  	           4.82    	254           	
 	0x401089 	54   	      		}                                           	           	3.15    	166           	

1 file, 1 function, 6 lines, 0 instructions, Summary: 989 samples, 52.83% of module samples, 18.77% of total samples

**Codeplug** · 12-09-2007

>> There are no samples for Realloc...
Well, will Realloc() not being called and Array[] being inlined, there's not much more you can do except add a "memset" style of method that utilizes a highly optimized version of memset() with SIMD instructions.

Keeping your original for loop, you could try manually unrolling the loop.

gg

**foxman** · 12-09-2007

I'm back with my little off-topic.

The test i did was done using MSVC++ 6.0, with build configuration set to "Release" and optimization set to "maximize speed" (default with release configuration). Couldn't do the test using gcc since i don't know how to write assembly for it. I redid the test in 3 separate runs (4 runs for each version) and got similar results:

Code:

Classic (for loop)  :   2213, 2173, 2163, 2233
Assembly            :   260, 250, 250, 250
Classic (while loop):   2203, 2133, 2133, 2143

Note that it's possible to tweak the assembly version even more by writing doubleword instead of byte at a time. (but you could also do that in C)

If you want, you could test it on your own comp. Would be curious of the result if someone does so. But i'm not really surprise of the result, since the IA-32 instruction set comes with some really efficient string instruction that i bet compiler make only minor use (because of their really specific nature).

I mean, just take a look at the code generated by the compiler (i added comments):

Code:

193:      for (i = 0; i < TAILLE; i++)
0040C6AA C7 45 F0 00 00 00 00 mov         dword ptr [ebp-10h],0           // i = 0
0040C6B1 EB 09                jmp         main+5Ch (0040c6bc)
0040C6B3 8B 45 F0             mov         eax,dword ptr [ebp-10h]         // EAX = i
0040C6B6 83 C0 01             add         eax,1                           // EAX++
0040C6B9 89 45 F0             mov         dword ptr [ebp-10h],eax         // i = EAX
0040C6BC 81 7D F0 00 E1 F5 05 cmp         dword ptr [ebp-10h],5F5E100h    // if (i >= TAILLE)
0040C6C3 7D 0B                jge         main+70h (0040c6d0)             // jump out of the loop
194:      {
195:          tab[i] = VALEUR;
0040C6C5 8B 4D F4             mov         ecx,dword ptr [ebp-0Ch]       // ECX = tab
0040C6C8 03 4D F0             add         ecx,dword ptr [ebp-10h]       //  ECX = ECX + i
0040C6CB C6 01 1E             mov         byte ptr [ecx],1Eh            //  [ecx] = 12
196:      }
0040C6CE EB E3                jmp         main+53h (0040c6b3)           // jmp at start of the loop

vs

Code:

rep stosb

I think it's pretty clear why it's so fast.

**Codeplug** · 12-09-2007

Out of curiosity: Do a run using just memset(). Also, in project settings -> C/C++ -> Code Generation, set the Processor to pentium pro.

gg

**foxman** · 12-09-2007

Haha. Well, this is a bit ankward. From the start i was in debug mode, i didn't switch configuration correctly. Sorry, my bad.

Well, in fact, everything is quite similar in release mode, there's no really significative difference between the different version, except for the while version who is considerably slower (close to 2 times slower). Do you know if there's a way to see dissassembly in release mode ? I'm curious to know what the compiler generates.

But in debug mode, memset is as fast as the assembly version. What's also strange, is that both memset and assembly are a bit slower in release mode.

Anyway. I won't investigate any further.

**cpjust** · 12-09-2007

Originally Posted by Elysia

Just for fun, if anyone wants to help.
So I have this little code:

Code:

	Stuff::CArray<char> Array;
	double max_ = pow(10.0, 8.6);
	DWORD dwTick1 = GetTickCount();
	int max = (int)max_;

	for (int i = 0; i < max; i++)
		Array[i] = 12; //12345678;

	DWORD dwTick2 = GetTickCount();
	DWORD dwTick3 = dwTick2 - dwTick1;
	cout << "Took " << dwTick3 << " ms total.\n";

...And I want to optimize it a little further, but it's kind of out of my hands right now.
It's blazingly fast and according to the profiler, the loop is the culprit.

Just for my own curiosity, what happens to the speed if you replace Stuff::CArray<char> with std::vector<char> and change the Array[i] = 12; to Array.push_back( (char)12 );

**Elysia** · 12-10-2007

Originally Posted by foxman

Haha. Well, this is a bit ankward. From the start i was in debug mode, i didn't switch configuration correctly. Sorry, my bad.

Well, in fact, everything is quite similar in release mode, there's no really significative difference between the different version, except for the while version who is considerably slower (close to 2 times slower). Do you know if there's a way to see dissassembly in release mode ? I'm curious to know what the compiler generates.

But in debug mode, memset is as fast as the assembly version. What's also strange, is that both memset and assembly are a bit slower in release mode.

Anyway. I won't investigate any further.

Actually, MSVC6 generates pretty poor code. I could try on 2008 and see what happens.

Originally Posted by cpjust

Just for my own curiosity, what happens to the speed if you replace Stuff::CArray<char> with std::vector<char> and change the Array[i] = 12; to Array.push_back( (char)12 );

For vector: best time was 3218 ms.
For non-inline release CArray - 2500 ms.
For inline release CArray - 1600 ms.
If I set size before loop, inline release CArray - 1200 ms.

**~~manav~~** · 12-10-2007

Originally Posted by foxman

Sorry for this little off topic, but i wanted to compare between C and Assembly on a simple string operation. What's interesting is that Assembly was roughly 7.5 times faster. Here's the code snippet

Code:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define TAILLE 100000000
#define VALEUR 30
int main()
{
    clock_t debut, fin;
    char *tab = malloc(TAILLE);
    int i;
    if (tab == NULL)
        exit(1);

    // Classic
    debut = clock();
    for (i = 0; i < TAILLE; i++)
    {
        tab[i] = VALEUR;
    }
    fin = clock();
    printf(" &#37;u\n", fin - debut);

    // Assembleur
    debut = clock();
    __asm
    {
       push eax
       push ecx
       push edi

       mov ecx, TAILLE - 1
       mov al, VALEUR
       mov edi, tab
       rep stosb
       stosb

       pop edi
       pop ecx
       pop eax
    }
    fin = clock();
    printf(" %u\n", fin - debut);

    // Quasi classique
    debut = clock();
    i = TAILLE;
    while (i)
    {
        tab[--i] = VALEUR;
    }
    fin = clock();
    printf(" %u\n", fin - debut);

    free(tab);
    return 0;
}

Every "loop" does the same thing. Test was done on my old Pentium 3 (which has enough memory for this test, don't worry). Value returned (average): 2200, 280, 2200

I wanted to share that...

Here are my results of 5 runs on MSVC .Net 2003 with /Og /O2
78 47 78
93 32 78
78 47 78
93 32 93
93 47 78

**Elysia** · 12-10-2007

I enabled every possible optimization I could find and the result was... *drum roll please*

109
94
109

Assembly for classic loop:

Code:

	debut = clock();
00401827  mov         esi,dword ptr [__imp__clock (40209Ch)] 
0040182D  call        esi  
	for (i = 0; i < TAILLE; i++)
	{
		tab[i] = VALEUR;
0040182F  push        5F5E100h 
00401834  push        1Eh  
00401836  push        ebx  
00401837  mov         edi,eax 
00401839  call        memset (4018B0h) 
	}
	fin = clock();
0040183E  call        esi

It even optimizes away your loop with memset.

Second loop assembly:

Code:

0040187E  call        esi  
00401880  mov         edi,eax 
	i = TAILLE;
00401882  mov         eax,5F5E100h 
	while (i)
	{
		tab[--i] = VALEUR;
00401887  sub         eax,1 
0040188A  mov         byte ptr [eax+ebx],1Eh 
0040188E  jne         $LN14+60h (401887h) 
	}
	fin = clock();
00401890  call        esi

There's not much of a difference on today's compilers. Sure, there's some, so when really, really time critical, it might help, but otherwise...

Thread: Assembly optimization?

Thread Tools

Search Thread

Display

Similar Threads

Learning Assembly

C to assembly interface

assembly language...the best tool for game programming?

True ASM vs. Fake ASM ????

C,C++,Perl,Java