# Thread: Faster squaring for a vector of floats

1. ## Faster squaring for a vector of floats

I wonder whether there is a faster way (faster then the obvious one a=a*a) to square the content of a vector of floats. I remember that bit shifting could be used somehow, but I'm not sure whether this works for floats. Any suggestions on how to achieve this are welcome (perhaps with the code). If you are not sure about the floats, you might assume unsigned int.
Thanks

2. Bit shifting can be used for multiplication of (unsigned) integer types (eg left/right shift n is equivalent to multiplying/dividing by 2 to the n). It won't work for floats - which have a different structure (eg fields that represent mantissa and exponent). The results of twiddling bits in a floating point type (assuming you do some hackery, as C/C++ don't allow it) are specific to your machine's floating point representation.

Have you actually determined that your code for squaring the elements of your vector is something that will benefit from hand-optimisation? By that, have you run a profiler and found the multiplication occurs in the middle of a tight loop, and determined (through analysis) that multiplication can't be moved out of the loop? If you haven't considered this type of thing, I'm concerned you are doing premature optimisation (spending a lot of time optimising code to achieve a very small real benefit).

3. Actually, I'm working with external memory computations; that is the main reason small improvements could be of great benefit. I do not use a profiler (dont know how to use it). Do you think using for_each (or transform) with the operation a*=a would be faster? If so, it would if I wanted to square the content of a vector (change the primary vector), but what if I want to transfer the content to some other vector (a*=a would not be an optimization)

4. How old is your machine?
The Intel 8086 / 8088/ 80186 / 80286 / 80386 / 80486 Instruction Set
An actual fmul instruction on anything better than a pentium (must be about a decade old now) is the least of your worries.

> I do not use a profiler (dont know how to use it).
Is that a question?
Which compiler do you use?

If it's gcc (a good choice), then look up 'gprof' and 'gcov'.
If it's some freebie commercial thing, then you could well be SoL. Profilers tend to be expensive additions.

a quick hack with VS
Code:
```#include "stdafx.h"
#include <iostream>
#include <vector>

//0.0965613
void sqrmmv(std::vector<float> &vin)
{
__m128* pSrc = (__m128*)&vin[0];
size_t nLoop = vin.size() / 4;
for (size_t i = 0; i < nLoop; i++ )
{
*pSrc = _mm_mul_ps(*pSrc, *pSrc);
pSrc++;
}
}

//0.242307
void sqrnv(std::vector<float> &vin)
{
for(size_t idx = 0 ; idx < vin.size();idx++)
{
vin[idx] = vin[idx]*vin[idx];
}
}

void test1(void)
{
const size_t kCnt = 160000000;
std::vector<float> vin(kCnt);

for(size_t idx = 0 ; idx < kCnt;idx++)
{
vin[idx]= (float)idx;
}

LARGE_INTEGER freq,start,end;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);

sqrmmv(vin);

QueryPerformanceCounter(&end);

if(kCnt == 16)
{
for(size_t idx = 0 ; idx < kCnt;idx++)
{
std::wcout <<  " " << vin[idx];
}
}
std::wcout << std::endl;
}

void test2(void)
{
const size_t kCnt = 160000000;
std::vector<float> vin(kCnt);

for(size_t idx = 0 ; idx < kCnt;idx++)
{
vin[idx]= (float)idx;
}

LARGE_INTEGER freq,start,end;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&start);

sqrnv(vin);

QueryPerformanceCounter(&end);

if(kCnt == 16)
{
for(size_t idx = 0 ; idx < kCnt;idx++)
{
std::wcout <<  " " << vin[idx];
}
}
std::wcout << std::endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
test1();
test2();
system("pause");
return 0;
}```

6. void sqrmmv(std::vector<float> &vin)
I believe Visual Studio 2003+ have intrinsic functions that do much the same. However in this case I cannot see what benefit there is to doing all of that when we are just talking about simple FMULs which are lightning fast on current hardware.

7. Right the intrinsic instruction is _mm_mul_ps, the benefit is being able to working on 4 floats as a time, of course ymmv : )