C Board  

Go Back   C Board > General Programming Boards > C++ Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 11-10-2008, 08:13 AM   #31
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.

In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Old 11-10-2008, 12:51 PM   #32
Registered User
 
Join Date: Mar 2005
Location: Mountaintop, Pa
Posts: 1,059
Quote:
Originally Posted by matsp View Post
In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".

Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.
Being a high level code jockey, the above just flew over my head.
Quote:
In general, SIMD operations is only "meaningful" if there is a complete set of data.

--
Mats
One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using
Code:
 float fInput1[3] = {30.3F, 100.0F, 140.1F};
will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:
float fInput2[4]  = {30.3F, 100.0F, 140.1F, 0.0F};
BobS0327 is offline   Reply With Quote
Old 11-10-2008, 01:25 PM   #33
Kernel hacker
 
Join Date: Jul 2007
Location: Farncombe, Surrey, England
Posts: 15,686
Quote:
Originally Posted by BobS0327 View Post
Being a high level code jockey, the above just flew over my head.
Basically: If you take the address of a float, and then process the data in that float as an integer, and then put it back into a float, the processor will shout "Hey, FPU, hang on a bit, can you STOP after THIS instruciton, and don't move finger until I say so" (or worse, the two processor units doing integer and float don't realize they are working on the same data until later on, and one of them has to "back up" - a bit like a pit-stop in Formula One or Indy-car where the driver doesn't stop in time, and has to go back a bit - which usually causes a big extra delay).
Code:
    lTemp  = * ( long * ) &fY;
    lTemp  = 0x5f3759df - ( lTemp >> 1 );
is the relevant code.

On older processors, things didn't happen much in parallel, so there was less of a problem with this style of code. Modern processors definitely execute a lot of instructions in parallel, and there's some pretty complicated logic to prevent one or the other unit from getting it wrong when overlapping work between two units - and one possible scenario is "speculative execution and throwing away the results".

Did I mention that modern processors are quite complicated?

Quote:

One thing I did notice about SIMD, is that it requires full vectorization (Needs all four inputs) for it to execute efficiently.

For example, using
Code:
 float fInput1[3] = {30.3F, 100.0F, 140.1F};
will cause the execution time of _mm_load_ps and _mm_store_ps functions to spike, thus increasing the overall time for computation of the square root.

So, to eliminate this spike, you have to submit the following for square root calculation of three floats:

Code:
float fInput2[4]  = {30.3F, 100.0F, 140.1F, 0.0F};
There are two potential reasons for this:
1. The [3] array is misaligned, which causes the processor to use a unaligned version of the "load" instructions.
2. The [3] array overlaps the result array by 1 element, which means that the processor gets confused as to the content (and must wait for other operations before it can continue, to make sure it doesn't "get it wrong".

--
Mats
__________________
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
matsp is offline   Reply With Quote
Reply

Tags
c++ code, code, square root

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Pointer confusion Blackroot C++ Programming 11 09-12-2007 12:44 AM
Issue w/ Guess My Number Program mkylman C++ Programming 5 08-23-2007 01:31 AM
Finding the square root! Not Working! Lah C Programming 5 09-14-2003 07:28 PM
Square Root Kyoto Oshiro C++ Programming 5 09-05-2002 01:22 AM
can anyone find the problem in my code ArseMan C++ Programming 2 09-20-2001 09:02 PM


All times are GMT -6. The time now is 04:33 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22