code to find th square root of a number

**matsp** · 11-10-2008

In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.

In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats

**BobS0327** · 11-10-2008

Originally Posted by matsp

In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.

Being a high level code jockey, the above just flew over my head.

In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats

One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using

Code:

 float fInput1[3] = {30.3F, 100.0F, 140.1F};

will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:

float fInput2[4]  = {30.3F, 100.0F, 140.1F, 0.0F};

**matsp** · 11-10-2008

Originally Posted by BobS0327

Being a high level code jockey, the above just flew over my head.

Basically: If you take the address of a float, and then process the data in that float as an integer, and then put it back into a float, the processor will shout "Hey, FPU, hang on a bit, can you STOP after THIS instruciton, and don't move finger until I say so" (or worse, the two processor units doing integer and float don't realize they are working on the same data until later on, and one of them has to "back up" - a bit like a pit-stop in Formula One or Indy-car where the driver doesn't stop in time, and has to go back a bit - which usually causes a big extra delay).

Code:

    lTemp  = * ( long * ) &fY;
    lTemp  = 0x5f3759df - ( lTemp >> 1 );

is the relevant code.

On older processors, things didn't happen much in parallel, so there was less of a problem with this style of code. Modern processors definitely execute a lot of instructions in parallel, and there's some pretty complicated logic to prevent one or the other unit from getting it wrong when overlapping work between two units - and one possible scenario is "speculative execution and throwing away the results".

Did I mention that modern processors are quite complicated?

One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using

Code:

 float fInput1[3] = {30.3F, 100.0F, 140.1F};

will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:

float fInput2[4]  = {30.3F, 100.0F, 140.1F, 0.0F};

There are two potential reasons for this:
1. The [3] array is misaligned, which causes the processor to use a unaligned version of the "load" instructions.
2. The [3] array overlaps the result array by 1 element, which means that the processor gets confused as to the content (and must wait for other operations before it can continue, to make sure it doesn't "get it wrong".

--
Mats

**amerjamil** · 10-29-2010

any 1 has solution for it:, In the triangle Pythagoras theorem is used to find out the length of any in known side
If the base is 4 cm and height is 7cm find out the hypotheses by using the Pythagoras
(Hypotenuse) 2 = (base) 2+ (perpendicular )2

, i need C++ code of it without header file of <math.h> , need to put only conio.h ,iostream.h header file only,??