I am trying to get the following code as fast as possible:
Code:
void Foo( double *dest, const double *source, double *vec1, const double *vec2 )
{
  // source and dest are [64], vec1 and vec2 are [64*32]

  for (int j=0; j<64; j++)
  {
    double total = 0;
    static const int n = 32;
    for (int i=0; i<n; i++)
    {
    	total += vec1[i] * vec2[i]; // vector inproduct
    }
    total += source[j];
    dest[j] = 1/(1+exp(-total));
    vec1 += n;
    vec2 += n;
  }
}
Note how the i loop essentially does an inproduct between vec1 and vec2 (and adds a constant and applies a logistic curve), which smells like SSE / SIMD. However, I don't get faster results when I set Enhanced Instruction Set to either SIMD or SIMD 2 in msvc8's project settings (release mode of course). In fact with SSE2 it even became slower (!)
Also changing the floating point model to fast doesn't make any noticable improvement.



Is there any special trickery I can use to optimize this? Anything with unrolling for example.

(PS: using floats instead of doubles would also suit me fine, but so far I noticed it's faster with doubles)