Optimized inproduct of floats/doubles? (sse?)
I am trying to get the following code as fast as possible:
Code:
void Foo( double *dest, const double *source, double *vec1, const double *vec2 )
{
// source and dest are [64], vec1 and vec2 are [64*32]
for (int j=0; j<64; j++)
{
double total = 0;
static const int n = 32;
for (int i=0; i<n; i++)
{
total += vec1[i] * vec2[i]; // vector inproduct
}
total += source[j];
dest[j] = 1/(1+exp(-total));
vec1 += n;
vec2 += n;
}
}
Note how the i loop essentially does an inproduct between vec1 and vec2 (and adds a constant and applies a logistic curve), which smells like SSE / SIMD. However, I don't get faster results when I set Enhanced Instruction Set to either SIMD or SIMD 2 in msvc8's project settings (release mode of course). In fact with SSE2 it even became slower (!)
Also changing the floating point model to fast doesn't make any noticable improvement.
:confused:
Is there any special trickery I can use to optimize this? Anything with unrolling for example.
(PS: using floats instead of doubles would also suit me fine, but so far I noticed it's faster with doubles)