# Faster squaring for a vector of floats

• 09-04-2010
onako
Faster squaring for a vector of floats
I wonder whether there is a faster way (faster then the obvious one a=a*a) to square the content of a vector of floats. I remember that bit shifting could be used somehow, but I'm not sure whether this works for floats. Any suggestions on how to achieve this are welcome (perhaps with the code). If you are not sure about the floats, you might assume unsigned int.
Thanks
• 09-04-2010
grumpy
Bit shifting can be used for multiplication of (unsigned) integer types (eg left/right shift n is equivalent to multiplying/dividing by 2 to the n). It won't work for floats - which have a different structure (eg fields that represent mantissa and exponent). The results of twiddling bits in a floating point type (assuming you do some hackery, as C/C++ don't allow it) are specific to your machine's floating point representation.

Have you actually determined that your code for squaring the elements of your vector is something that will benefit from hand-optimisation? By that, have you run a profiler and found the multiplication occurs in the middle of a tight loop, and determined (through analysis) that multiplication can't be moved out of the loop? If you haven't considered this type of thing, I'm concerned you are doing premature optimisation (spending a lot of time optimising code to achieve a very small real benefit).
• 09-04-2010
onako
Actually, I'm working with external memory computations; that is the main reason small improvements could be of great benefit. I do not use a profiler (dont know how to use it). Do you think using for_each (or transform) with the operation a*=a would be faster? If so, it would if I wanted to square the content of a vector (change the primary vector), but what if I want to transfer the content to some other vector (a*=a would not be an optimization)
• 09-04-2010
Salem
The Intel 8086 / 8088/ 80186 / 80286 / 80386 / 80486 Instruction Set
An actual fmul instruction on anything better than a pentium (must be about a decade old now) is the least of your worries.

> I do not use a profiler (dont know how to use it).
Is that a question?
Which compiler do you use?

If it's gcc (a good choice), then look up 'gprof' and 'gcov'.
If it's some freebie commercial thing, then you could well be SoL. Profilers tend to be expensive additions.
• 09-04-2010
Danielm103

a quick hack with VS
Code:

```#include "stdafx.h" #include <iostream> #include <vector> //0.0965613 void sqrmmv(std::vector<float> &vin) {   __m128* pSrc = (__m128*)&vin[0];   size_t nLoop = vin.size() / 4;   for (size_t i = 0; i < nLoop; i++ )   {     *pSrc = _mm_mul_ps(*pSrc, *pSrc);     pSrc++;   } } //0.242307 void sqrnv(std::vector<float> &vin) {   for(size_t idx = 0 ; idx < vin.size();idx++)   {     vin[idx] = vin[idx]*vin[idx];   } } void test1(void) {   const size_t kCnt = 160000000;   std::vector<float> vin(kCnt);   for(size_t idx = 0 ; idx < kCnt;idx++)   {     vin[idx]= (float)idx;   }   LARGE_INTEGER freq,start,end;    QueryPerformanceFrequency(&freq);   QueryPerformanceCounter(&start);   sqrmmv(vin);   QueryPerformanceCounter(&end);   std::wcout << (double)(end.QuadPart-start.QuadPart)/freq.QuadPart << std::endl;   if(kCnt == 16)   {     for(size_t idx = 0 ; idx < kCnt;idx++)     {       std::wcout <<  " " << vin[idx];     }   }   std::wcout << std::endl; } void test2(void) {   const size_t kCnt = 160000000;   std::vector<float> vin(kCnt);   for(size_t idx = 0 ; idx < kCnt;idx++)   {     vin[idx]= (float)idx;   }   LARGE_INTEGER freq,start,end;    QueryPerformanceFrequency(&freq);   QueryPerformanceCounter(&start);   sqrnv(vin);   QueryPerformanceCounter(&end);   std::wcout << (double)(end.QuadPart-start.QuadPart)/freq.QuadPart << std::endl;   if(kCnt == 16)   {     for(size_t idx = 0 ; idx < kCnt;idx++)     {       std::wcout <<  " " << vin[idx];     }   }   std::wcout << std::endl; } int _tmain(int argc, _TCHAR* argv[]) {   test1();   test2();   system("pause");   return 0; }```
• 09-04-2010
VirtualAce
Quote:

void sqrmmv(std::vector<float> &vin)
I believe Visual Studio 2003+ have intrinsic functions that do much the same. However in this case I cannot see what benefit there is to doing all of that when we are just talking about simple FMULs which are lightning fast on current hardware.
• 09-04-2010
Danielm103
Right the intrinsic instruction is _mm_mul_ps, the benefit is being able to working on 4 floats as a time, of course ymmv : )