code to find th square root of a number

**A10** · 11-05-2008

Originally Posted by CornedBee

Or, in a loop, the compiler might parallelize a sqrt algorithm into a SIMD implementation with four times the throughput of the x87 (and a non-braindead architecture).

What on earth does that sentance mean o_0

**BobS0327** · 11-05-2008

Originally Posted by A10

What on earth does that sentance mean o_0

SIMD is an acronym for Single Instruction Multiple Data. This is a "feature" that is found on performance workstations such as the IBM PowerPc G4 and higher. Primarily used for intensive real time graphics stuff etc. I believe Intel has platform extensions to do the same thing.

But anyway, the compiler doesn't "parallelize" anything into a SIMD implementation. The "parallelizing" is part of SIMD. That is, the MD part as in Multiple Data. The OS is executing one instruction as in SI (Single Instruction) but is working on Multiple Data, or working on all the data in parallel simultaneously to accomplish the end task.

**abachler** · 11-06-2008

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.

**BobS0327** · 11-06-2008

Originally Posted by abachler

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.

Well, I have absolutely no experience writing this type of code since this stuff is always, without exception, contracted out to computer scientists for the obvious O&E reasons. So, it's quite possible that I may have misinterpreted our current contract computer scientist's explanation of this issue. After all, I don't have a PHD after my name which makes it all the more difficult to understand these folks.

But anyway, would you have any third party information to support the following statement such as a published paper to support this statement?

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions.

**CornedBee** · 11-06-2008

Originally Posted by abachler

It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

The AMD CPUs do work on multiple data in parallel. With the K10, they will match the parallelization of the Core2, i.e. 2 128-bit execution engines.

**matsp** · 11-06-2008

The older types of AMD processors has an FPU that can handle 64 bits per operation, just like the early P4 processors did. However, the older AMD processor also (like the newer ones) has two separate 64-bit Floating Point units, so you can, given the right mix of instructions, execute two instructions at a time. Yes, both instructions will take two clock-cycles to complete. Oh, and there is a separate unit for loading/storing data, so a load or store operation (MOV) will not occupy space in the two FPU's. These FPU's are also used for x87 operations.

I don't have a PhD either, but I did work for AMD in the past, and I have a fair understanding of the Processor architecture, as I spent quite some time optimizing various bits of customer or "partner" code.

--
Mats

**BobS0327** · 11-06-2008

Maybe I can't make the connection between higher level coding and what's happening on the lower chip level.

The topic has intrigued me. So, I started reviewing a lot of parallel computing books on Safari from the perspective of writing POC code.

But getting back to SIMD and parallel computing, the books indicate that one instruction (calculation) is executed for multiple data. For example, let's take a Xeon dual processor and the square root example. In this case, the same one instruction would be executed on both processors but with different data. In other words, I could calculate the square root of 24 and 49 simultaneously. What throws me off is the above reference to multiple instructions being executed. The books reinforce what our "Einstein" explained to me.

I'm not a low level type. So, I just can't make the connection between what's in the books and whats happening at chip level.

**matsp** · 11-06-2008

On a dual core processor, the two cores (generally speaking) execute different threads or processses. With SIMD, you have ONE processor perform the same operation (say add) on two or more pieces of the same type of data. So we could do:

Code:

a += b;
c += d;

as four instructions (whcih can be further optimized by keepine result in a register if we continue calculating with a and c later one):

Code:

load a and c into register1
load b and d into register2
add register2 to register1
store a and c

This happens in a "two-lane" ALU - just like with big roads where multiple cars can travel in parallel side by side, the ALU takes the two sets of values in, adds the values, and outputs the sum.

Oh, and modern processor are also "superscalar", whcih essentially means that there are several units that can each execute instructions, independently of each other. So if we take the above C-code and compile it without SIMD instructions, the two additions could be executed on two different ALU's in parallel.

Two cores is more like having two distinct roads that don't necessarily reach the same destination.

Edit: And by the way, if we load 25 (or 24) and 49 into a SSE register as "double" values, then it can use a SQRTPD to calculate squre root of both those values at once.

--
Mats

**BobS0327** · 11-07-2008

Ahhh, I got it. What is being discussed in this thread is referred to as SIMD intrinsics. It's not the same as SIMD as found in today's parallel computing paradigm. Well, on second thought, this could be considered the "Model T" of parallel computing. I found the following description in a footnote:

SIMD intrinsincs are primitive functions to parallelize data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.

I was trying to make a direct one to one connection between a parallel computing concept and a specific low level technique which I now realize is impossible to do. So, in the context of this thread, there isn't any performance to be gained if only the square root of one number is needed using intrinsic SIMD. Performance is improved if the square root of more than one number is required. SIMD intrinsics is just one instruction with one or more data items.

**CornedBee** · 11-07-2008

Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.

**BobS0327** · 11-07-2008

Originally Posted by CornedBee

Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.

Well, I'm intrigued by the performance of SIMD at this low level. This is new territory for me since I just discovered SIMD at this level the other day. So, let me put forth a scenario:

Let us assume that we have 128 scalars, floats for example and we wrote a performance profiling app. This app would time the computation of the square root of each scalar using the SIMD instructions and also time the calculation of the sqare root for each scalar using the "plain vanilla" SISD sqrt method. Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time. The SIMD is NOT allowed to "pipeline" calculate all 128 scalars at once. Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?

**CornedBee** · 11-08-2008

Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time.

Why only one? That defeats the purpose of SIMD instructions.

**matsp** · 11-08-2008

The number of cycles PER Square root calculation is slightly lower in SSE2 on AMD processors compared to the FPU version - about 15 cycles or so better than the x87 operations.

On the most recent AMD processors, 2 sqrt in parallel takes exactly the same amount of time as one single sqrt in SSE (24 clock-cycles, double precision and 18 cycles single precision), whilst using the older takes 19, 27 or 35 cycles depending on size of the operands (float, double or long double).

I don't know how this compares to Intels numbers, and I couldn't find their optimization guides with two attempts in google (only really old ones, which is no good for latest versions of processors).

--
Mats

**brewbuck** · 11-08-2008

Originally Posted by BobS0327

Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?

So the question is, which component can physically do square roots faster? The answer is a moving target given how the processors are continually evolving.

**BobS0327** · 11-09-2008

Well, this was mostly a learning experience for me. But I thought that I would post some code that I ran in profiler for a few hours to determine the performance of Abachler's suggestion, a modified version of Carmack's inverse sqare root function that was previoulsy posted, the sqrt function and the SIMD method.

I've found that ABachler's suggestion is the fastest with Carmack and SIMD for the most part tied for second and sqrt coming in dead last. I put the posted code in a loop for a few hours in a profiler. The profiler was checking the total time each option was taking to calculate the square root of four scalars. I used four scalars because this is where SIMD is really effficient, giving the best average calculation time. SIMD is roughly twice as fast as the sqrt method when calculating the square root of four scalars. At three scalars the performance diminishes slightly. At one or two scalars, you're probably better off using sqrt. When SIMD calculates the square root of two scalars, the probabilty that sqrt function would be faster is around 60% Thus, a two scalar calculation by SIMD would be faster 4 out of 10 times over sqrt.

You'll obviously have to modify the code if you want to profile it. At the very minimum, you'll need a loop.

But here's the code for anybody to evaluate by running it thru their own profiler and/or question it.

Code:

#include <xmmintrin.h> 
#include <stdio.h>
#include <math.h>

__m128 m128Input, m128Output;

float fInput2[4]  = {30.3F, 100.0F, 140.1F, 400.65F};
float fOutput2[4] = {0.0F, 0.0F, 0.0F, 0.0F};

float fInput1[1] = {30.3F};
float fOutput1[1] = {0.0F};

float CarmackSqrt(float fInput) {
    long lTemp;
    float fX, fY;
    const float iFloat = 1.5F;

    fX = fInput * 0.5F;
    fY  = fInput;
    lTemp  = * ( long * ) &fY;
    lTemp  = 0x5f3759df - ( lTemp >> 1 );
    fY  = * ( float * ) &lTemp;
    fY  = fY * ( iFloat - ( fX * fY * fY ) );
    fY  = fY * ( iFloat - ( fX * fY * fY ) );
    return fInput * fY;
}

float ABachlerSqrt(float iFloat)
{
	__asm	fld iFloat
	__asm	fsqrt
	__asm	fstp iFloat
	return iFloat;
}

int main(void)
{
    int iIndex;
	printf("SQRT %.2f\n", sqrt(fInput1[0]));
    m128Input = _mm_load_ps(fInput1);
    m128Output = _mm_sqrt_ps(m128Input);
    _mm_store_ps(fOutput1,m128Output);
    printf("SIMD  %.2f\n\n",fOutput1[0]);
	printf("SQRT %.2f\nSIMD ", sqrt(fInput2[0]));
    m128Input = _mm_load_ps(fInput2);
    m128Output = _mm_sqrt_ps(m128Input);
    _mm_store_ps(fOutput2,m128Output);
	for(iIndex = 0; iIndex <4; iIndex++)
        printf(" %.2f",fOutput2[iIndex]);
   printf("\nABachler %f\n",ABachlerSqrt(fInput1[0]));
   printf("Carmack %f\n",CarmackSqrt(fInput1[0]));
    return 0;
}