Thread: Optimized inproduct of floats/doubles? (sse?)

  1. #1
    Registered User
    Join Date
    Apr 2009
    Posts
    8

    Optimized inproduct of floats/doubles? (sse?)

    I am trying to get the following code as fast as possible:
    Code:
    void Foo( double *dest, const double *source, double *vec1, const double *vec2 )
    {
      // source and dest are [64], vec1 and vec2 are [64*32]
    
      for (int j=0; j<64; j++)
      {
        double total = 0;
        static const int n = 32;
        for (int i=0; i<n; i++)
        {
        	total += vec1[i] * vec2[i]; // vector inproduct
        }
        total += source[j];
        dest[j] = 1/(1+exp(-total));
        vec1 += n;
        vec2 += n;
      }
    }
    Note how the i loop essentially does an inproduct between vec1 and vec2 (and adds a constant and applies a logistic curve), which smells like SSE / SIMD. However, I don't get faster results when I set Enhanced Instruction Set to either SIMD or SIMD 2 in msvc8's project settings (release mode of course). In fact with SSE2 it even became slower (!)
    Also changing the floating point model to fast doesn't make any noticable improvement.



    Is there any special trickery I can use to optimize this? Anything with unrolling for example.

    (PS: using floats instead of doubles would also suit me fine, but so far I noticed it's faster with doubles)

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I haven't had good experiences with MSVC's auto-SIMD. I would expect SIMD would make this quite a bit faster, but you might be stuck coding it yourself using SSE intrinsics.

    Care to tell what sort of project this is a part of?

    EDIT: If it's for work, and you might be able to drop some cash on it, Intel's IPP library kicks ass.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    cache miss?

    Depending on exactly what you are up to you may get a performance boost by processing the various vectors--interleaving the data maybe--to be a little more friendly to the processor.

    Soma

  4. #4
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    To get SIMD optimisations to kick in I've read that you need to write code more like this:
    Code:
        double total[4] = {};
        for (int i=0; i<32;)
        {
        	total[0] += vec1[i] * vec2[i]; ++i; // vector inproduct
        	total[1] += vec1[i] * vec2[i]; ++i; // vector inproduct
        	total[2] += vec1[i] * vec2[i]; ++i; // vector inproduct
        	total[3] += vec1[i] * vec2[i]; ++i; // vector inproduct
        }
        total[0] += total[1] + total[2] + total[3] + source[j];
    Might not be exactly right, but you get the idea.
    There is an example somewhere on Microsoft's website of this.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  5. #5
    Registered User
    Join Date
    Apr 2009
    Posts
    8
    Just realized it's commonly called DOTproduct, not inproduct, sorry for the confusion

    I'm using all this to pull out some statistics from large amounts of data (typically structured in matrix/vectors). I'm mainly experimenting now and I'll need to repeat these calculations on quite some datasets overnight.

    I will look into the unroll/split method and Intel IPP lib. I assume the IPP lib doesn't come with source code? Dunno how flexible that lib is, but I've read that IPP especially performs well on large vectors (much larger than 32) and I do have quite a lot of these vectors but I need that logistics thing (1/exp) after every 32. So guess I'd need some modifications.

  6. #6
    Registered User
    Join Date
    May 2007
    Posts
    147
    I've had good luck with inline assembler for SSEn, but then I've only done that for non-portable work, and usually in VC++ - though lately I've had to work this in ARM VFP inline under GCC 4, which isn't as 'friendly' to inline assembler techniques IMO.

    I too haven't quite found the 'trick' to reliable SSEn optimization from C/C++ code with any VC compiler settings, and perhaps the only really good way other than inline assembler is the intrinsic function API (I'm not a particular customer candidate for Intel IPP myself).

    I can say this, however: If you chose inline assembler as the implementation to a Vector/Matrix/Quaternion class - the results are definately worth it!

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > In fact with SSE2 it even became slower
    Did you take note of the alignment requirements for the input data?

    Code:
        for (int i=0; i<32; i += 4)
        {
        	total[0] += vec1[i] * vec2[i];
        	total[1] += vec1[i+1] * vec2[i+1];
        	total[2] += vec1[i+2] * vec2[i+2];
        	total[3] += vec1[i+3] * vec2[i+3];
        }
    Making use of displacement addressing modes
    Art of Assembly: Chapter Four-2
    and saving 3 R-M-W cycles on the index variable.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Optimized clipping of floats/doubles ?
    By Phuzzillogic in forum C++ Programming
    Replies: 11
    Last Post: 05-14-2009, 09:19 AM
  2. GCC and SSE multplication
    By Kernel Sanders in forum C Programming
    Replies: 5
    Last Post: 09-19-2008, 05:08 PM

Tags for this Thread