Faster squaring for a vector of floats

This is a discussion on Faster squaring for a vector of floats within the C++ Programming forums, part of the General Programming Boards category; I wonder whether there is a faster way (faster then the obvious one a=a*a) to square the content of a ...

  1. #1
    Registered User
    Join Date
    Mar 2010
    Posts
    43

    Faster squaring for a vector of floats

    I wonder whether there is a faster way (faster then the obvious one a=a*a) to square the content of a vector of floats. I remember that bit shifting could be used somehow, but I'm not sure whether this works for floats. Any suggestions on how to achieve this are welcome (perhaps with the code). If you are not sure about the floats, you might assume unsigned int.
    Thanks

  2. #2
    Registered User
    Join Date
    Jun 2005
    Posts
    6,528
    Bit shifting can be used for multiplication of (unsigned) integer types (eg left/right shift n is equivalent to multiplying/dividing by 2 to the n). It won't work for floats - which have a different structure (eg fields that represent mantissa and exponent). The results of twiddling bits in a floating point type (assuming you do some hackery, as C/C++ don't allow it) are specific to your machine's floating point representation.

    Have you actually determined that your code for squaring the elements of your vector is something that will benefit from hand-optimisation? By that, have you run a profiler and found the multiplication occurs in the middle of a tight loop, and determined (through analysis) that multiplication can't be moved out of the loop? If you haven't considered this type of thing, I'm concerned you are doing premature optimisation (spending a lot of time optimising code to achieve a very small real benefit).
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy in reply to you, it is likely you deserve it. Suck it up, sunshine, and read this, this, and this before posting again.

  3. #3
    Registered User
    Join Date
    Mar 2010
    Posts
    43
    Actually, I'm working with external memory computations; that is the main reason small improvements could be of great benefit. I do not use a profiler (dont know how to use it). Do you think using for_each (or transform) with the operation a*=a would be faster? If so, it would if I wanted to square the content of a vector (change the primary vector), but what if I want to transfer the content to some other vector (a*=a would not be an optimization)

  4. #4
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,752
    How old is your machine?
    The Intel 8086 / 8088/ 80186 / 80286 / 80386 / 80486 Instruction Set
    An actual fmul instruction on anything better than a pentium (must be about a decade old now) is the least of your worries.

    > I do not use a profiler (dont know how to use it).
    Is that a question?
    Which compiler do you use?

    If it's gcc (a good choice), then look up 'gprof' and 'gcov'.
    If it's some freebie commercial thing, then you could well be SoL. Profilers tend to be expensive additions.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  5. #5
    Registered User Danielm103's Avatar
    Join Date
    Jul 2010
    Location
    Shanghai
    Posts
    2
    how about using SIMD?

    a quick hack with VS
    Code:
    #include "stdafx.h"
    #include <iostream>
    #include <vector>
    
    //0.0965613
    void sqrmmv(std::vector<float> &vin)
    {
      __m128* pSrc = (__m128*)&vin[0];
      size_t nLoop = vin.size() / 4;
      for (size_t i = 0; i < nLoop; i++ )
      {
        *pSrc = _mm_mul_ps(*pSrc, *pSrc);
        pSrc++;
      }
    }
    
    //0.242307
    void sqrnv(std::vector<float> &vin)
    {
      for(size_t idx = 0 ; idx < vin.size();idx++)
      {
        vin[idx] = vin[idx]*vin[idx];
      }
    }
    
    void test1(void)
    {
      const size_t kCnt = 160000000;
      std::vector<float> vin(kCnt);
    
      for(size_t idx = 0 ; idx < kCnt;idx++)
      {
        vin[idx]= (float)idx;
      }
    
      LARGE_INTEGER freq,start,end;   
      QueryPerformanceFrequency(&freq);
      QueryPerformanceCounter(&start);
    
      sqrmmv(vin);
    
      QueryPerformanceCounter(&end);
      std::wcout << (double)(end.QuadPart-start.QuadPart)/freq.QuadPart << std::endl;
    
      if(kCnt == 16)
      {
        for(size_t idx = 0 ; idx < kCnt;idx++)
        {
          std::wcout <<  " " << vin[idx];
        }
      }
      std::wcout << std::endl;
    }
    
    void test2(void)
    {
      const size_t kCnt = 160000000;
      std::vector<float> vin(kCnt);
    
      for(size_t idx = 0 ; idx < kCnt;idx++)
      {
        vin[idx]= (float)idx;
      }
    
      LARGE_INTEGER freq,start,end;   
      QueryPerformanceFrequency(&freq);
      QueryPerformanceCounter(&start);
    
      sqrnv(vin);
    
      QueryPerformanceCounter(&end);
      std::wcout << (double)(end.QuadPart-start.QuadPart)/freq.QuadPart << std::endl;
    
      if(kCnt == 16)
      {
        for(size_t idx = 0 ; idx < kCnt;idx++)
        {
          std::wcout <<  " " << vin[idx];
        }
      }
      std::wcout << std::endl;
    }
    
    int _tmain(int argc, _TCHAR* argv[])
    {
      test1();
      test2();
      system("pause");
      return 0;
    }

  6. #6
    Super Moderator VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,596
    void sqrmmv(std::vector<float> &vin)
    I believe Visual Studio 2003+ have intrinsic functions that do much the same. However in this case I cannot see what benefit there is to doing all of that when we are just talking about simple FMULs which are lightning fast on current hardware.

  7. #7
    Registered User Danielm103's Avatar
    Join Date
    Jul 2010
    Location
    Shanghai
    Posts
    2
    Right the intrinsic instruction is _mm_mul_ps, the benefit is being able to working on 4 floats as a time, of course ymmv : )

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. A doubt about precision in Floats
    By avi2886 in forum C Programming
    Replies: 3
    Last Post: 11-27-2009, 03:05 PM
  2. Resource ICONs
    By gbaker in forum Windows Programming
    Replies: 4
    Last Post: 12-15-2003, 07:18 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21