Just a quick analyzis of the vector vs. array code generated shows:
Code:
// array code
; COMDAT ?sum@A@@UAEHXZ
_TEXT SEGMENT
?sum@A@@UAEHXZ PROC NEAR ; A::sum, COMDAT
; _this$ = ecx
push ebx
push esi
xor eax, eax
push edi
add ecx, 32 ; 00000020H
mov esi, 128 ; 00000080H
$L11447:
mov edx, 16 ; 00000010H
$L11451:
mov ebx, DWORD PTR [ecx-24]
mov edi, DWORD PTR [ecx-28]
add edi, ebx
add edi, DWORD PTR [ecx-20]
add edi, DWORD PTR [ecx-16]
add edi, DWORD PTR [ecx-12]
add edi, DWORD PTR [ecx-8]
add edi, DWORD PTR [ecx-4]
add edi, DWORD PTR [ecx]
add eax, edi
add ecx, 32 ; 00000020H
dec edx
jne SHORT $L11451
dec esi
jne SHORT $L11447
pop edi
pop esi
pop ebx
ret 0
?sum@A@@UAEHXZ ENDP ; A::sum
_TEXT ENDS
// vector code:
_TEXT SEGMENT
?sum@V@@UAEHXZ PROC NEAR ; V::sum, COMDAT
; _this$ = ecx
push esi
mov esi, DWORD PTR [ecx+8]
push edi
xor eax, eax
add esi, 4
mov edi, 128 ; 00000080H
npad 1
$L11413:
mov ecx, DWORD PTR [esi]
add ecx, 8
mov edx, 16 ; 00000010H
npad 6
$L11417:
add eax, DWORD PTR [ecx-8]
add eax, DWORD PTR [ecx-4]
add eax, DWORD PTR [ecx]
add eax, DWORD PTR [ecx+4]
add eax, DWORD PTR [ecx+8]
add eax, DWORD PTR [ecx+12]
add eax, DWORD PTR [ecx+16]
add eax, DWORD PTR [ecx+20]
add ecx, 32 ; 00000020H
dec edx
jne SHORT $L11417
add esi, 16 ; 00000010H
dec edi
jne SHORT $L11413
pop edi
pop esi
ret 0
?sum@V@@UAEHXZ ENDP ; V::sum
The main difference, as I see it, is the vector has to do another indirection in the outer loop, vs. the array just doing a simple decrement operation in the second loop. I would actually expect the compiler to merge the two loops, since loading edx with 16 then using edi to loop the outer loop seems excessive, why not just load edx with 16 * 128?
The above code is with "size" set to 128 rather than 100 as in the posted code - I doubt it makes much difference in the overall code generated, I changed to 128 to see if gcc would do some better code generation - currenly, the MS compiler beats the gcc version by running BOTH loops quicker than one of the gcc functions.
--
Mats