tomato, tuhmahtoe. I find both methods equally easy to read. In any case, i hacked up a test program and it seems my original mthod of sequencing the operations using all 8 of the basic xmm registers actually slows the routine down for some reason. It took 65 seconds to do a run where the FPU version only took 5 seconds. I rewrote the code to only use the bare minimum of xmm registers adn now it seems to run much faster, about 15-20% faster than FPU. I think ultimately the real bottleneck is the memory bandwidth, but 15% speed increase is still pretty good. I also took matsp's suggestion about putting the pointers into registers, and that seems to have helped as well. I also fixed the bug with the original code in that I was asigning values to the pointer registers themselves, rather than to the memory they pointed to. Anyway, here is the new code
Code:
void CNeuron::Process(double* Input , double* Output){
double Scratch = 0.0;
__declspec(align(16)) __m128 Zero[] = {0.0 , 0.0};
__declspec(align(16)) __m128 tempout[] = {0.0 , 0.0};
double* pInput;
double* pWeight;
DWORD dwCount;
DWORD dwExtra;
pInput = Input;
pWeight = this->Weights;
dwCount = this->dwNumberOfInputs;
dwCount = ((dwCount)/2); // calculate total iterations
dwExtra = this->dwNumberOfInputs - (dwCount * 2);
__asm push eax
__asm push edx
__asm push ecx
__asm push esi
__asm push edi
__asm mov eax , Zero
__asm mov edx , 0x00000010
__asm mov ecx , dwCount
__asm mov esi , pInput
__asm mov edi , pWeight
__asm movapd xmm1 , [eax]
theloop:
__asm movapd xmm0 , [esi]
__asm add esi , edx
__asm mulpd xmm0 , [edi]
__asm add edi , edx
__asm addpd xmm1 , xmm0
__asm loop theloop
__asm mov ecx , dwExtra
__asm movapd tempout , xmm1
__asm fld tempout
__asm fld tempout+8
__asm faddp st(1) , st(0)
__asm cmp ecx , 0x00000000
__asm je skiploop
extraloop:
__asm fld [esi]
__asm add esi , 8
__asm fld [edi]
__asm add edi , 8
__asm fmulp st(1) , st(0)
__asm faddp st(1) , st(0)
__asm loop extraloop
skiploop:
__asm fld1
__asm fpatan
__asm fsin
__asm fstp Output
__asm pop edi
__asm pop esi
__asm pop ecx
__asm pop edx
__asm pop eax
/*
for(DWORD temp=0;temp<this->dwNumberOfInputs;temp++){
Scratch += Input[temp] * this->Weights[temp];
}
Output[0] = sin(atan(Scratch));
//*/
return;
}