Optimized inproduct of floats/doubles? (sse?)

Printable View

05-13-2009
Phuzzillogic

Optimized inproduct of floats/doubles? (sse?)

I am trying to get the following code as fast as possible:

Code:

void Foo( double *dest, const double *source, double *vec1, const double *vec2 ) { // source and dest are [64], vec1 and vec2 are [64*32] for (int j=0; j<64; j++) { double total = 0; static const int n = 32; for (int i=0; i<n; i++) { total += vec1[i] * vec2[i]; // vector inproduct } total += source[j]; dest[j] = 1/(1+exp(-total)); vec1 += n; vec2 += n; } }

Note how the i loop essentially does an inproduct between vec1 and vec2 (and adds a constant and applies a logistic curve), which smells like SSE / SIMD. However, I don't get faster results when I set Enhanced Instruction Set to either SIMD or SIMD 2 in msvc8's project settings (release mode of course). In fact with SSE2 it even became slower (!)
Also changing the floating point model to fast doesn't make any noticable improvement.

:confused:

Is there any special trickery I can use to optimize this? Anything with unrolling for example.

(PS: using floats instead of doubles would also suit me fine, but so far I noticed it's faster with doubles)
05-13-2009
brewbuck

I haven't had good experiences with MSVC's auto-SIMD. I would expect SIMD would make this quite a bit faster, but you might be stuck coding it yourself using SSE intrinsics.

Care to tell what sort of project this is a part of?

EDIT: If it's for work, and you might be able to drop some cash on it, Intel's IPP library kicks ass.
05-13-2009
phantomotap

O_o

cache miss?

Depending on exactly what you are up to you may get a performance boost by processing the various vectors--interleaving the data maybe--to be a little more friendly to the processor.

Soma
05-13-2009
iMalc

To get SIMD optimisations to kick in I've read that you need to write code more like this:

Code:

double total[4] = {}; for (int i=0; i<32;) { total[0] += vec1[i] * vec2[i]; ++i; // vector inproduct total[1] += vec1[i] * vec2[i]; ++i; // vector inproduct total[2] += vec1[i] * vec2[i]; ++i; // vector inproduct total[3] += vec1[i] * vec2[i]; ++i; // vector inproduct } total[0] += total[1] + total[2] + total[3] + source[j];

Might not be exactly right, but you get the idea.
There is an example somewhere on Microsoft's website of this.
05-14-2009
Phuzzillogic

Just realized it's commonly called DOTproduct, not inproduct, sorry for the confusion :)

I'm using all this to pull out some statistics from large amounts of data (typically structured in matrix/vectors). I'm mainly experimenting now and I'll need to repeat these calculations on quite some datasets overnight.

I will look into the unroll/split method and Intel IPP lib. I assume the IPP lib doesn't come with source code? Dunno how flexible that lib is, but I've read that IPP especially performs well on large vectors (much larger than 32) and I do have quite a lot of these vectors but I need that logistics thing (1/exp) after every 32. So guess I'd need some modifications.
05-14-2009
JVene

I've had good luck with inline assembler for SSEn, but then I've only done that for non-portable work, and usually in VC++ - though lately I've had to work this in ARM VFP inline under GCC 4, which isn't as 'friendly' to inline assembler techniques IMO.

I too haven't quite found the 'trick' to reliable SSEn optimization from C/C++ code with any VC compiler settings, and perhaps the only really good way other than inline assembler is the intrinsic function API (I'm not a particular customer candidate for Intel IPP myself).

I can say this, however: If you chose inline assembler as the implementation to a Vector/Matrix/Quaternion class - the results are definately worth it!
05-14-2009
Salem

> In fact with SSE2 it even became slower
Did you take note of the alignment requirements for the input data?

Code:

for (int i=0; i<32; i += 4) { total[0] += vec1[i] * vec2[i]; total[1] += vec1[i+1] * vec2[i+1]; total[2] += vec1[i+2] * vec2[i+2]; total[3] += vec1[i+3] * vec2[i+3]; }

Making use of displacement addressing modes
Art of Assembly: Chapter Four-2
and saving 3 R-M-W cycles on the index variable.