Thread: code to find th square root of a number

  1. #16
    Registered User
    Join Date
    Nov 2006
    Posts
    85
    Quote Originally Posted by CornedBee View Post
    Or, in a loop, the compiler might parallelize a sqrt algorithm into a SIMD implementation with four times the throughput of the x87 (and a non-braindead architecture).
    What on earth does that sentance mean o_0

  2. #17
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Quote Originally Posted by A10 View Post
    What on earth does that sentance mean o_0

    SIMD is an acronym for Single Instruction Multiple Data. This is a "feature" that is found on performance workstations such as the IBM PowerPc G4 and higher. Primarily used for intensive real time graphics stuff etc. I believe Intel has platform extensions to do the same thing.

    But anyway, the compiler doesn't "parallelize" anything into a SIMD implementation. The "parallelizing" is part of SIMD. That is, the MD part as in Multiple Data. The OS is executing one instruction as in SI (Single Instruction) but is working on Multiple Data, or working on all the data in parallel simultaneously to accomplish the end task.

  3. #18
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

    Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.
    Last edited by abachler; 11-06-2008 at 06:40 AM.

  4. #19
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Quote Originally Posted by abachler View Post
    It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.

    Oh, and its not a feature found on performance workstations. Its a feature found on all modern processors since the Pentium Pro. Unless of course you think the Pentium Pro is a performance processor.
    Well, I have absolutely no experience writing this type of code since this stuff is always, without exception, contracted out to computer scientists for the obvious O&E reasons. So, it's quite possible that I may have misinterpreted our current contract computer scientist's explanation of this issue. After all, I don't have a PHD after my name which makes it all the more difficult to understand these folks.

    But anyway, would you have any third party information to support the following statement such as a published paper to support this statement?
    It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions.

  5. #20
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by abachler View Post
    It is in fact parallelizing the operation, the SIMD opcodes are decoded into multiple instructions. On Intel processors it is executed in parallel, while on AMD prcodessors it performs multiple operations, since the AMD does not have multiple ALU's.
    The AMD CPUs do work on multiple data in parallel. With the K10, they will match the parallelization of the Core2, i.e. 2 128-bit execution engines.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  6. #21
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    The older types of AMD processors has an FPU that can handle 64 bits per operation, just like the early P4 processors did. However, the older AMD processor also (like the newer ones) has two separate 64-bit Floating Point units, so you can, given the right mix of instructions, execute two instructions at a time. Yes, both instructions will take two clock-cycles to complete. Oh, and there is a separate unit for loading/storing data, so a load or store operation (MOV) will not occupy space in the two FPU's. These FPU's are also used for x87 operations.

    I don't have a PhD either, but I did work for AMD in the past, and I have a fair understanding of the Processor architecture, as I spent quite some time optimizing various bits of customer or "partner" code.

    --
    Mats
    Last edited by matsp; 11-06-2008 at 12:37 PM.
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  7. #22
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Maybe I can't make the connection between higher level coding and what's happening on the lower chip level.

    The topic has intrigued me. So, I started reviewing a lot of parallel computing books on Safari from the perspective of writing POC code.

    But getting back to SIMD and parallel computing, the books indicate that one instruction (calculation) is executed for multiple data. For example, let's take a Xeon dual processor and the square root example. In this case, the same one instruction would be executed on both processors but with different data. In other words, I could calculate the square root of 24 and 49 simultaneously. What throws me off is the above reference to multiple instructions being executed. The books reinforce what our "Einstein" explained to me.

    I'm not a low level type. So, I just can't make the connection between what's in the books and whats happening at chip level.

  8. #23
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    On a dual core processor, the two cores (generally speaking) execute different threads or processses. With SIMD, you have ONE processor perform the same operation (say add) on two or more pieces of the same type of data. So we could do:
    Code:
    a += b;
    c += d;
    as four instructions (whcih can be further optimized by keepine result in a register if we continue calculating with a and c later one):
    Code:
    load a and c into register1
    load b and d into register2
    add register2 to register1
    store a and c
    This happens in a "two-lane" ALU - just like with big roads where multiple cars can travel in parallel side by side, the ALU takes the two sets of values in, adds the values, and outputs the sum.

    Oh, and modern processor are also "superscalar", whcih essentially means that there are several units that can each execute instructions, independently of each other. So if we take the above C-code and compile it without SIMD instructions, the two additions could be executed on two different ALU's in parallel.

    Two cores is more like having two distinct roads that don't necessarily reach the same destination.


    Edit: And by the way, if we load 25 (or 24) and 49 into a SSE register as "double" values, then it can use a SQRTPD to calculate squre root of both those values at once.

    --
    Mats
    Last edited by matsp; 11-06-2008 at 02:20 PM.
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #24
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Ahhh, I got it. What is being discussed in this thread is referred to as SIMD intrinsics. It's not the same as SIMD as found in today's parallel computing paradigm. Well, on second thought, this could be considered the "Model T" of parallel computing. I found the following description in a footnote:

    SIMD intrinsincs are primitive functions to parallelize data processing on the CPU register level. For example, the addition of two unsigned char will take the whole register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4 shorts or 2 integers) to be executed in parallel on the register level. Available in Visual Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.
    I was trying to make a direct one to one connection between a parallel computing concept and a specific low level technique which I now realize is impossible to do. So, in the context of this thread, there isn't any performance to be gained if only the square root of one number is needed using intrinsic SIMD. Performance is improved if the square root of more than one number is required. SIMD intrinsics is just one instruction with one or more data items.
    Last edited by BobS0327; 11-07-2008 at 12:19 PM.

  10. #25
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #26
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058
    Quote Originally Posted by CornedBee View Post
    Not quite. We're talking about SIMD instructions. SIMD intrinsics are special functions that the compiler recognizes and replaces by the appropriate instructions, so that the programmer doesn't have to resort to inline assembly.
    Well, I'm intrigued by the performance of SIMD at this low level. This is new territory for me since I just discovered SIMD at this level the other day. So, let me put forth a scenario:

    Let us assume that we have 128 scalars, floats for example and we wrote a performance profiling app. This app would time the computation of the square root of each scalar using the SIMD instructions and also time the calculation of the sqare root for each scalar using the "plain vanilla" SISD sqrt method. Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time. The SIMD is NOT allowed to "pipeline" calculate all 128 scalars at once. Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?

  12. #27
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Now, the key here is that both the sqrt method and the SIMD method will calculate the sqare root of only ONE scalar at a time.
    Why only one? That defeats the purpose of SIMD instructions.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  13. #28
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    The number of cycles PER Square root calculation is slightly lower in SSE2 on AMD processors compared to the FPU version - about 15 cycles or so better than the x87 operations.

    On the most recent AMD processors, 2 sqrt in parallel takes exactly the same amount of time as one single sqrt in SSE (24 clock-cycles, double precision and 18 cycles single precision), whilst using the older takes 19, 27 or 35 cycles depending on size of the operands (float, double or long double).

    I don't know how this compares to Intels numbers, and I couldn't find their optimization guides with two attempts in google (only really old ones, which is no good for latest versions of processors).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #29
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by BobS0327 View Post
    Under this scenario, which method would offer the optimal performance in terms of minimal processing time, the SIMD instructions or the SISD sqrt method?
    So the question is, which component can physically do square roots faster? The answer is a moving target given how the processors are continually evolving.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #30
    Registered User
    Join Date
    Mar 2005
    Location
    Mountaintop, Pa
    Posts
    1,058

    And the winner is....

    Well, this was mostly a learning experience for me. But I thought that I would post some code that I ran in profiler for a few hours to determine the performance of Abachler's suggestion, a modified version of Carmack's inverse sqare root function that was previoulsy posted, the sqrt function and the SIMD method.

    I've found that ABachler's suggestion is the fastest with Carmack and SIMD for the most part tied for second and sqrt coming in dead last. I put the posted code in a loop for a few hours in a profiler. The profiler was checking the total time each option was taking to calculate the square root of four scalars. I used four scalars because this is where SIMD is really effficient, giving the best average calculation time. SIMD is roughly twice as fast as the sqrt method when calculating the square root of four scalars. At three scalars the performance diminishes slightly. At one or two scalars, you're probably better off using sqrt. When SIMD calculates the square root of two scalars, the probabilty that sqrt function would be faster is around 60% Thus, a two scalar calculation by SIMD would be faster 4 out of 10 times over sqrt.

    You'll obviously have to modify the code if you want to profile it. At the very minimum, you'll need a loop.

    But here's the code for anybody to evaluate by running it thru their own profiler and/or question it.

    Code:
    #include <xmmintrin.h> 
    #include <stdio.h>
    #include <math.h>
    
    __m128 m128Input, m128Output;
    
    float fInput2[4]  = {30.3F, 100.0F, 140.1F, 400.65F};
    float fOutput2[4] = {0.0F, 0.0F, 0.0F, 0.0F};
    
    float fInput1[1] = {30.3F};
    float fOutput1[1] = {0.0F};
    
    float CarmackSqrt(float fInput) {
        long lTemp;
        float fX, fY;
        const float iFloat = 1.5F;
    
        fX = fInput * 0.5F;
        fY  = fInput;
        lTemp  = * ( long * ) &fY;
        lTemp  = 0x5f3759df - ( lTemp >> 1 );
        fY  = * ( float * ) &lTemp;
        fY  = fY * ( iFloat - ( fX * fY * fY ) );
        fY  = fY * ( iFloat - ( fX * fY * fY ) );
        return fInput * fY;
    }
    
    float ABachlerSqrt(float iFloat)
    {
    	__asm	fld iFloat
    	__asm	fsqrt
    	__asm	fstp iFloat
    	return iFloat;
    }
    
    int main(void)
    {
        int iIndex;
    	printf("SQRT %.2f\n", sqrt(fInput1[0]));
        m128Input = _mm_load_ps(fInput1);
        m128Output = _mm_sqrt_ps(m128Input);
        _mm_store_ps(fOutput1,m128Output);
        printf("SIMD  %.2f\n\n",fOutput1[0]);
    	printf("SQRT %.2f\nSIMD ", sqrt(fInput2[0]));
        m128Input = _mm_load_ps(fInput2);
        m128Output = _mm_sqrt_ps(m128Input);
        _mm_store_ps(fOutput2,m128Output);
    	for(iIndex = 0; iIndex <4; iIndex++)
            printf(" %.2f",fOutput2[iIndex]);
       printf("\nABachler %f\n",ABachlerSqrt(fInput1[0]));
       printf("Carmack %f\n",CarmackSqrt(fInput1[0]));
        return 0;
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Pointer confusion
    By Blackroot in forum C++ Programming
    Replies: 11
    Last Post: 09-12-2007, 12:44 AM
  2. Issue w/ Guess My Number Program
    By mkylman in forum C++ Programming
    Replies: 5
    Last Post: 08-23-2007, 01:31 AM
  3. Finding the square root! Not Working!
    By Lah in forum C Programming
    Replies: 5
    Last Post: 09-14-2003, 07:28 PM
  4. Square Root
    By Kyoto Oshiro in forum C++ Programming
    Replies: 5
    Last Post: 09-05-2002, 01:22 AM
  5. can anyone find the problem in my code
    By ArseMan in forum C++ Programming
    Replies: 2
    Last Post: 09-20-2001, 09:02 PM

Tags for this Thread