Originally Posted by
matsp
In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".
Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.
Being a high level code jockey, the above just flew over my head.
In general, SIMD operations is only "meaningful" if there is a complete set of data.
--
Mats
One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.
For example, using
Code:
float fInput1[3] = {30.3F, 100.0F, 140.1F};
will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.
So, to eliminate this spike, you have to submit the following for square root calculation of three floats:
Code:
float fInput2[4] = {30.3F, 100.0F, 140.1F, 0.0F};