code to find th square root of a number

Printable View

11-03-2008
CODY21

code to find th square root of a number

I'm a newbie and I'm writing a program preform Pythagoras’ theorem but I cant work out the code to square a number.
If anyone out there knows it please help me.
Thanks
11-03-2008
Akkernight

Ehh... I think it's in math.h header, and I think it was something like... sqroot(), but probably not, try searching "Square Root in C++" or something on Google ;)
11-03-2008
EVOEx

It's sqrt for doubles, sqrtf for a double and sqrtl for a long double...
And they reside in the "cmath" header.
11-03-2008
matsp

Quote:

Originally Posted by EVOEx

It's sqrt for doubles, sqrtf for a double and sqrtl for a long double...
And they reside in the "cmath" header.

Correction: sqrtf() would be the function for float.

Although in C++ I think there are wrappers to handle float, double and long double under a single name.

--
Mats
11-03-2008
C_ntua

Code:

template<class T) class T mySqrt(T x, double precision) { if (x <= 0) return 0; T r = x; while (x - x/r > precision) r = (r + x/r) / 2; return r; }

or for obvious reasons just use the ready sqrt() in math.h which handles double, long double and float http://www.cplusplus.com/reference/c...math/sqrt.html
11-03-2008
abachler

or...

Code:

double Number; __asm { FLD Number FSQRT Number FSTP Number }

which is far faster than calling the sqrt() function
11-03-2008
master5001

I find the hypot() function useful. By the way, depending on what you are doing, its usually optimal to NOT square root your result and just always work with (x^2). It makes collision detection algorithms work oh so much faster.
11-05-2008
Saimadhav

both the functions are in cmath header file.
to find the squre root

Code:

cout<<sqrt(variable);

to find the square

Code:

cout<<pow(a,2);
11-05-2008
g4j31a5

Sorry, a little OOT. After some thorough browsing, I found this snippet:

Code:

float InvSqrt (float x) { float xhalf = 0.5f*x; int i = *(int*)&x; i = 0x5f3759df - (i>>1); x = *(float*)&i; x = x*(1.5f - xhalf*x*x); return x; }

FYI, it is Quake's inverse square root. But I sure can't understand it clearly. Especially the " i = 0x5f3759df - (i>>1);" part. I think it's much faster than the cmath's version.

EDIT: My mistake. You needed square and square root. Not inverse square root. :D
11-05-2008
BobS0327

Quote:

But I sure can't understand it clearly. Especially the " i = 0x5f3759df - (i>>1);"

0x5F3759DF
11-05-2008
Raigne

That was very insightful, and surprising to me.
11-05-2008
master5001

Very interesting, Bob. Where on earth did you dig that up from?
11-05-2008
BobS0327

Quote:

Originally Posted by master5001

Very interesting, Bob. Where on earth did you dig that up from?

From Chris Lomont's home page. Unfortunately, the link to his papers from his home page has some isues at times. So, I linked directly to the PDF.
11-05-2008
CornedBee

Quote:

Originally Posted by abachler

or...

Code:

double Number; __asm { FLD Number FSQRT Number FSTP Number }

which is far faster than calling the sqrt() function

And far less portable. Not to mention that, with the right switches (e.g. -ffast-math in GCC), your compiler might just optimize the sqrt call to the FSQRT instruction anyway. Which wouldn't break the compiler's register management code like the ASM snippet does. Or, in a loop, the compiler might parallelize a sqrt algorithm into a SIMD implementation with four times the throughput of the x87 (and a non-braindead architecture).
11-05-2008
g4j31a5

Quote:

Originally Posted by BobS0327

0x5F3759DF

Thanks. :)
11-05-2008
A10

Quote:

Originally Posted by CornedBee

Or, in a loop, the compiler might parallelize a sqrt algorithm into a SIMD implementation with four times the throughput of the x87 (and a non-braindead architecture).

What on earth does that sentance mean o_0
11-05-2008
BobS0327

Quote:

Originally Posted by A10

What on earth does that sentance mean o_0

SIMD is an acronym for Single Instruction Multiple Data. This is a "feature" that is found on performance workstations such as the IBM PowerPc G4 and higher. Primarily used for intensive real time graphics stuff etc. I believe Intel has platform extensions to do the same thing.

But anyway, the compiler doesn't "parallelize" anything into a SIMD implementation. The "parallelizing" is part of SIMD. That is, the MD part as in Multiple Data. The OS is executing one instruction as in SI (Single Instruction) but is working on Multiple Data, or working on all the data in parallel simultaneously to accomplish the end task.
11-06-2008
abachler

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.
11-06-2008
BobS0327

Quote:

Originally Posted by abachler

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.

Well, I have absolutely no experience writing this type of code since this stuff is always, without exception, contracted out to computer scientists for the obvious O&E reasons. So, it's quite possible that I may have misinterpreted our current contract computer scientist's explanation of this issue. After all, I don't have a PHD after my name which makes it all the more difficult to understand these folks.

But anyway, would you have any third party information to support the following statement such as a published paper to support this statement?

Quote:

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions.
11-06-2008
CornedBee

Quote:

Originally Posted by abachler

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

The AMD CPUs do work on multiple data in parallel. With the K10, they will match the parallelization of the Core2, i.e. 2 128-bit execution engines.
11-06-2008
matsp

The older types of AMD processors has an FPU that can handle 64 bits per operation, just like the early P4 processors did. However, the older AMD processor also (like the newer ones) has two separate 64-bit Floating Point units, so you can, given the right mix of instructions, execute two instructions at a time. Yes, both instructions will take two clock-cycles to complete. Oh, and there is a separate unit for loading/storing data, so a load or store operation (MOV) will not occupy space in the two FPU's. These FPU's are also used for x87 operations.

I don't have a PhD either, but I did work for AMD in the past, and I have a fair understanding of the Processor architecture, as I spent quite some time optimizing various bits of customer or "partner" code.

--
Mats
11-06-2008
BobS0327

Maybe I can't make the connection between higher level coding and what's happening on the lower chip level.

The topic has intrigued me. So, I started reviewing a lot of parallel computing books on Safari from the perspective of writing POC code.

But getting back to SIMD and parallel computing, the books indicate that one instruction (calculation) is executed for multiple data. For example, let's take a Xeon dual processor and the square root example. In this case, the same one instruction would be executed on both processors but with different data. In other words, I could calculate the square root of 24 and 49 simultaneously. What throws me off is the above reference to multiple instructions being executed. The books reinforce what our "Einstein" explained to me.

I'm not a low level type. So, I just can't make the connection between what's in the books and whats happening at chip level.
11-06-2008
matsp

On a dual core processor, the two cores (generally speaking) execute different threads or processses. With SIMD, you have ONE processor perform the same operation (say add) on two or more pieces of the same type of data. So we could do:

Code:

a += b; c += d;

as four instructions (whcih can be further optimized by keepine result in a register if we continue calculating with a and c later one):

Code:

load a and c into register1 load b and d into register2 add register2 to register1 store a and c

This happens in a "two-lane" ALU - just like with big roads where multiple cars can travel in parallel side by side, the ALU takes the two sets of values in, adds the values, and outputs the sum.

Oh, and modern processor are also "superscalar", whcih essentially means that there are several units that can each execute instructions, independently of each other. So if we take the above C-code and compile it without SIMD instructions, the two additions could be executed on two different ALU's in parallel.

Two cores is more like having two distinct roads that don't necessarily reach the same destination.

Edit: And by the way, if we load 25 (or 24) and 49 into a SSE register as "double" values, then it can use a SQRTPD to calculate squre root of both those values at once.

--
Mats
11-07-2008
BobS0327

Ahhh, I got it. What is being discussed in this thread is referred to as SIMD intrinsics. It's not the same as SIMD as found in today's parallel computing paradigm. Well, on second thought, this could be considered the "Model T" of parallel computing. I found the following description in a footnote:

Quote:

SIMD intrinsincs are primitive functions to parallelize data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.

I was trying to make a direct one to one connection between a parallel computing concept and a specific low level technique which I now realize is impossible to do. So, in the context of this thread, there isn't any performance to be gained if only the square root of one number is needed using intrinsic SIMD. Performance is improved if the square root of more than one number is required. SIMD intrinsics is just one instruction with one or more data items.
11-07-2008
CornedBee

Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.
11-07-2008
BobS0327

Quote:

Originally Posted by CornedBee

Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.

Well, I'm intrigued by the performance of SIMD at this low level. This is new territory for me since I just discovered SIMD at this level the other day. So, let me put forth a scenario:

Let us assume that we have 128 scalars, floats for example and we wrote a performance profiling app. This app would time the computation of the square root of each scalar using the SIMD instructions and also time the calculation of the sqare root for each scalar using the "plain vanilla" SISD sqrt method. Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time. The SIMD is NOT allowed to "pipeline" calculate all 128 scalars at once. Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?
11-08-2008
CornedBee

Quote:

Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time.

Why only one? That defeats the purpose of SIMD instructions.
11-08-2008
matsp

The number of cycles PER Square root calculation is slightly lower in SSE2 on AMD processors compared to the FPU version - about 15 cycles or so better than the x87 operations.

On the most recent AMD processors, 2 sqrt in parallel takes exactly the same amount of time as one single sqrt in SSE (24 clock-cycles, double precision and 18 cycles single precision), whilst using the older takes 19, 27 or 35 cycles depending on size of the operands (float, double or long double).

I don't know how this compares to Intels numbers, and I couldn't find their optimization guides with two attempts in google (only really old ones, which is no good for latest versions of processors).

--
Mats
11-08-2008
brewbuck

Quote:

Originally Posted by BobS0327

Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?

So the question is, which component can physically do square roots faster? The answer is a moving target given how the processors are continually evolving.
11-09-2008
BobS0327

And the winner is....

Well, this was mostly a learning experience for me. But I thought that I would post some code that I ran in profiler for a few hours to determine the performance of Abachler's suggestion, a modified version of Carmack's inverse sqare root function that was previoulsy posted, the sqrt function and the SIMD method.

I've found that ABachler's suggestion is the fastest with Carmack and SIMD for the most part tied for second and sqrt coming in dead last. I put the posted code in a loop for a few hours in a profiler. The profiler was checking the total time each option was taking to calculate the square root of four scalars. I used four scalars because this is where SIMD is really effficient, giving the best average calculation time. SIMD is roughly twice as fast as the sqrt method when calculating the square root of four scalars. At three scalars the performance diminishes slightly. At one or two scalars, you're probably better off using sqrt. When SIMD calculates the square root of two scalars, the probabilty that sqrt function would be faster is around 60% Thus, a two scalar calculation by SIMD would be faster 4 out of 10 times over sqrt.

You'll obviously have to modify the code if you want to profile it. At the very minimum, you'll need a loop.

But here's the code for anybody to evaluate by running it thru their own profiler and/or question it.

Code:

#include <xmmintrin.h> #include <stdio.h> #include <math.h> __m128 m128Input, m128Output; float fInput2[4] = {30.3F, 100.0F, 140.1F, 400.65F}; float fOutput2[4] = {0.0F, 0.0F, 0.0F, 0.0F}; float fInput1[1] = {30.3F}; float fOutput1[1] = {0.0F}; float CarmackSqrt(float fInput) { long lTemp; float fX, fY; const float iFloat = 1.5F; fX = fInput * 0.5F; fY = fInput; lTemp = * ( long * ) &fY; lTemp = 0x5f3759df - ( lTemp >> 1 ); fY = * ( float * ) &lTemp; fY = fY * ( iFloat - ( fX * fY * fY ) ); fY = fY * ( iFloat - ( fX * fY * fY ) ); return fInput * fY; } float ABachlerSqrt(float iFloat) { __asm fld iFloat __asm fsqrt __asm fstp iFloat return iFloat; } int main(void) { int iIndex; printf("SQRT %.2f\n", sqrt(fInput1[0])); m128Input = _mm_load_ps(fInput1); m128Output = _mm_sqrt_ps(m128Input); _mm_store_ps(fOutput1,m128Output); printf("SIMD %.2f\n\n",fOutput1[0]); printf("SQRT %.2f\nSIMD ", sqrt(fInput2[0])); m128Input = _mm_load_ps(fInput2); m128Output = _mm_sqrt_ps(m128Input); _mm_store_ps(fOutput2,m128Output); for(iIndex = 0; iIndex <4; iIndex++) printf(" %.2f",fOutput2[iIndex]); printf("\nABachler %f\n",ABachlerSqrt(fInput1[0])); printf("Carmack %f\n",CarmackSqrt(fInput1[0])); return 0; }
11-10-2008
matsp

In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.

In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats
11-10-2008
BobS0327

Quote:

Originally Posted by matsp

In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.

Being a high level code jockey, the above just flew over my head.

Quote:

In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats

One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using

Code:

float fInput1[3] = {30.3F, 100.0F, 140.1F};

will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:

float fInput2[4] = {30.3F, 100.0F, 140.1F, 0.0F};
11-10-2008
matsp

Quote:

Originally Posted by BobS0327

Being a high level code jockey, the above just flew over my head.

Basically: If you take the address of a float, and then process the data in that float as an integer, and then put it back into a float, the processor will shout "Hey, FPU, hang on a bit, can you STOP after THIS instruciton, and don't move finger until I say so" (or worse, the two processor units doing integer and float don't realize they are working on the same data until later on, and one of them has to "back up" - a bit like a pit-stop in Formula One or Indy-car where the driver doesn't stop in time, and has to go back a bit - which usually causes a big extra delay).

Code:

lTemp = * ( long * ) &fY; lTemp = 0x5f3759df - ( lTemp >> 1 );

is the relevant code.

On older processors, things didn't happen much in parallel, so there was less of a problem with this style of code. Modern processors definitely execute a lot of instructions in parallel, and there's some pretty complicated logic to prevent one or the other unit from getting it wrong when overlapping work between two units - and one possible scenario is "speculative execution and throwing away the results".

Did I mention that modern processors are quite complicated? ;)

Quote:

One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using

Code:

float fInput1[3] = {30.3F, 100.0F, 140.1F};

will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:

float fInput2[4] = {30.3F, 100.0F, 140.1F, 0.0F};

There are two potential reasons for this:
1. The [3] array is misaligned, which causes the processor to use a unaligned version of the "load" instructions.
2. The [3] array overlaps the result array by 1 element, which means that the processor gets confused as to the content (and must wait for other operations before it can continue, to make sure it doesn't "get it wrong".

--
Mats
10-29-2010
amerjamil

any 1 has solution for it:, In the triangle Pythagoras theorem is used to find out the length of any in known side
If the base is 4 cm and height is 7cm find out the hypotheses by using the Pythagoras
(Hypotenuse) 2 = (base) 2+ (perpendicular )2

, i need C++ code of it without header file of <math.h> , need to put only conio.h ,iostream.h header file only,??
10-29-2010
CornedBee

Wow!
Violated guidelines: homework, old posts
Bad practices: totally outdated headers