One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.
For example, using
Code:
float fInput1[3] = {30.3F, 100.0F, 140.1F};
will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.
So, to eliminate this spike, you have to submit the following for square root calculation of three floats:
Code:
float fInput2[4] = {30.3F, 100.0F, 140.1F, 0.0F};