Thread: Compiling c++ intrinsics commands

  1. #16
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by h3ro View Post
    Thanks again Mats, im starting to get the picture now. I even have something working
    Congratulations.
    One last question:
    >>You can just divide each byte value by 255 to get a 0.0 .. 1.0 value range, that makes all the >>calculations work better.
    Why would 0.0 to 1.0 be better to work with?
    Because you can use natural math as you expect it to happen, and only at the end of the calculation translate it back to 0 .. 255 value.
    Also, for the conversions, I assume its better to use the SSE2 commands for going from float to byte/int then simply casting in normal c++?
    I would think so, but benchmarking is the only way to actually know what's better [unless you are so experienced you can inspect the code and say from that].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  2. #17
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    >>[unless you are so experienced you can inspect the code and say from that].
    I think its fairly safe to say im not :P

    Thanks

  3. #18
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    OK, I have a new question again.

    When working with the __m128 data type I found something weird.

    If you stop your program in debug an move your mouse over the __m128 variable, you get can read it like an array(which makes sense), but there are several ways the array can be read. Im loading in 4 floats, so the m128_f32 is the only one that has useful information in it(which again makes sense)

    But I would very much like to get access to the data in m128_i8 or m128_u8 as the data there is just 8 bit (16 spaces in the array). I have been looking around and there is not much information about it.

    from the MSDN
    Floating-point data loaded stored as __m128 objects must be generally 16-byte aligned.
    Why make it possible to see the bytes in the m128 variable if they cant be used?

  4. #19
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Some operations, like byte shuffle, work on bytes.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #20
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Sorry for asking all these question, but what is byte shuffle useful for?

  6. #21
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Endian conversion, planar/interleaved conversion, probably other things.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #22
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Ok, now I have written my blitting function, but it is dead slow.

    My test scene has 63 small sprites with alpha channel.
    With the SSE I get less then 80 fps
    With the normal C++ function I get more then 310fps

    I was wondering if anyone could give me a hint of why that is


    Here is my code:

    Code for aligning the texture data
    Code:
    // Create SSE data
    	// Test sprite is 64*64
    	//64 * 64 = 4096
    	__declspec(align(16)) float blue[4096];
    	__declspec(align(16)) float green[4096];
    	__declspec(align(16)) float red[4096];
    	__declspec(align(16)) float trans[4096];
    
    	// Pointer to the original spritedata
    	BYTE *texture = spriteData;
    
    	// Splitt the data up so that we have each colour in a seperate array
    	for(int i = 0; i < 4096; i++)
    	{
    		blue[i] = *texture;
    		texture++;
    
    		green[i] = *texture;
    		texture++;
    
    		red[i] = *texture;
    		texture++;
    
    		trans[i] = *texture;
    		texture++;	
    	}
    
    	// Point the variebels in the header file to the values here
    	b = blue;
    	g = green;
    	r = red;
    	t = trans;
    And here is the blitting code.
    Code:
    //-------------------------------------------------------
    	// SEE BLITTING CODE!
    	//-------------------------------------------------------
    
    	// Create a pointer to the colours we are using
    	float *cBlue = b;
    	float *cGreen = g;
    	float *cRed = r;
    	float *cAlpha = t;
    
    	// Array for storing the result of each blitting
    	__declspec(align(16)) float blue[4];
    	__declspec(align(16)) float green[4];
    	__declspec(align(16)) float red[4];
    
    	for (int i = 0; i < height; i++)
    	{
    		//dividedWidth is width of the texture / 4
    		for (int j = 0; j < dividedWidth; j++)
    		{
    			__m128 textBlue =  _mm_load_ps( cBlue);
    			__m128 textGreen = _mm_load_ps( cGreen);
    			__m128 textRed =   _mm_load_ps( cRed);
    			__m128 textAlpha = _mm_load_ps( cAlpha);
    
    			// Increment colour pointer
    			cBlue  +=4;
    			cGreen +=4;
    			cRed   +=4;
    			cAlpha +=4;
    
    			__m128 screenBlue =  _mm_setr_ps( *(screenDataPnt) ,     *(screenDataPnt + 4) ,*(screenDataPnt + 8) ,*(screenDataPnt + 12) );
    			__m128 screenGreen = _mm_setr_ps( *(screenDataPnt + 1) , *(screenDataPnt + 5) ,*(screenDataPnt + 9) ,*(screenDataPnt + 13) );
    			__m128 screenRed =   _mm_setr_ps( *(screenDataPnt + 2) , *(screenDataPnt + 6) ,*(screenDataPnt + 10) ,*(screenDataPnt + 14));
    
    			// Load 256 into each of the registers, so that we can use it to make alpha from 0 to 1
    			__m128 temp = _mm_set_ps(256,256,256,256);
    
    			// Make alpha in the range 0 to 1
    			textAlpha = _mm_div_ps(textAlpha, temp);
    
    			// Blue
    			temp = _mm_sub_ps(textBlue, screenBlue);
    			temp = _mm_mul_ps(temp,textAlpha);
    			temp = _mm_add_ps(temp, screenBlue);
    
    			_mm_storeu_ps(blue, temp);
    
    			// Green
    			temp = _mm_sub_ps(textGreen, screenGreen);
    			temp = _mm_mul_ps(temp,textAlpha);
    			temp = _mm_add_ps(temp, screenGreen);
    
    			_mm_storeu_ps(green, temp);
    
    			// Red
    			temp = _mm_sub_ps(textRed, screenRed);
    			temp = _mm_mul_ps(temp,textAlpha);
    			temp = _mm_add_ps(temp, screenRed);
    
    			_mm_storeu_ps(red, temp);
    
    
    			// Copy the result into the screenData pointer
    			for(int p = 0; p < 4; p++)
    			{
    				*screenDataPnt = blue[p];
    				screenDataPnt++;
    				*screenDataPnt = green[p];
    				screenDataPnt++;
    				*screenDataPnt = red[p];
    				screenDataPnt++;
    				screenDataPnt++;
    			}
    		}
    
    		// (ScreenWidth - textureWidth) * number of pixels
    		//	640         -      64		*     4
    		screenDataPnt += 2304;
    	}

  8. #23
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Could you compile the SSE code with -S on the gcc compile line, and post the code for that (as an attachment, perhaps, since it's probably a bit longish).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #24
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Im using VS so not sure what -s is, but I assume its assembly?

    The attached file is the result of activating Assembly With Source Code (/FAs) in the project option, is that the correct thing?

  10. #25
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Like I expected, the SSE code generated from intrinsics is not particularly good. Before I get to that:
    Code:
    			// Load 256 into each of the registers, so that we can use it to make alpha from 0 to 1
    			__m128 temp = _mm_set_ps(256,256,256,256);
    is wrong, it should be 255 (otherwise alpha=255 makes 0.96... instead of 1.0 - you are not the first to make that mistake).

    Every operation of this SSE code is basically doing a whole heap of SSE operations that aren't needed:
    Code:
    ; 141  : 
    ; 142  : 			// Make alpha in the range 0 to 1
    ; 143  : 			textAlpha = _mm_div_ps(textAlpha, temp);
    
    	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
    	movaps	xmm1, XMMWORD PTR _textAlpha$18597[ebp]
    	divps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18607[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18607[ebp]
    	movaps	XMMWORD PTR _textAlpha$18597[ebp], xmm0
    // Why not just store xmm1 in_textAlpha$18597[ebp]
    
    ; 144  : 
    ; 145  : 			// Blue
    ; 146  : 			temp = _mm_sub_ps(textBlue, screenBlue);
    
    	movaps	xmm0, XMMWORD PTR _screenBlue$18599[ebp]
    	movaps	xmm1, XMMWORD PTR _textBlue$18591[ebp]
    	subps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18608[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18608[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    // As above. 
    ; 147  : 			temp = _mm_mul_ps(temp,textAlpha);
    
    	movaps	xmm0, XMMWORD PTR _textAlpha$18597[ebp]
    	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
    	mulps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18609[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18609[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    ; 148  : 			temp = _mm_add_ps(temp, screenBlue);
    
    	movaps	xmm0, XMMWORD PTR _screenBlue$18599[ebp]
    	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
    	addps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18610[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18610[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    ; 149  : 
    ; 150  : 			_mm_storeu_ps(blue, temp);
    
    	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
    	movups	XMMWORD PTR _blue$[ebp], xmm0
    
    ; 151  : 
    ; 152  : 			// Green
    ; 153  : 			temp = _mm_sub_ps(textGreen, screenGreen);
    
    	movaps	xmm0, XMMWORD PTR _screenGreen$18601[ebp]
    	movaps	xmm1, XMMWORD PTR _textGreen$18593[ebp]
    	subps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18611[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18611[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    ; 154  : 			temp = _mm_mul_ps(temp,textAlpha);
    
    	movaps	xmm0, XMMWORD PTR _textAlpha$18597[ebp]
    	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
    	mulps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18612[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18612[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    ; 155  : 			temp = _mm_add_ps(temp, screenGreen);
    
    	movaps	xmm0, XMMWORD PTR _screenGreen$18601[ebp]
    	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
    	addps	xmm1, xmm0
    	movaps	XMMWORD PTR $T18613[ebp], xmm1
    	movaps	xmm0, XMMWORD PTR $T18613[ebp]
    	movaps	XMMWORD PTR _temp$18605[ebp], xmm0
    
    ; 156  : 
    ; 157  : 			_mm_storeu_ps(green, temp);
    
    	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
    	movups	XMMWORD PTR _green$[ebp], xmm0
    I could keep going with the above commens, but it's essentially just the same.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  11. #26
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    Ok, so my code basically does everything twice?

    If I understand right, the code in red can be deleted?

  12. #27
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by h3ro View Post
    Ok, so my code basically does everything twice?

    If I understand right, the code in red can be deleted?
    No, the red code could be written as one instruction instead of three (storing the value from the result register immediately, rather than storing it as a temporary value.

    Further, since all of your code only ever uses xmm0 and xmm1 [in the main calculation paths at least], it reduces the chances of the processor performing multiple operations in parallel [even tho' the processor probably has a register renaming feature, I doubt it will be clever enough to do it really well].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  13. #28
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    I am not sure I really understand how to change my code to make it better.
    To be honest I am not even sure my understanding of SSE is correct.

    I know this is a lot to ask, but Im not sure where to go from here, but does anyone have time to write a short psudo code or something of what I am trying to do? It does not have to be perfect or work 100&#37;, but just something I could have as guide, because what I thought would work pretty well turned out to be 400% worse then the original c++ code.

  14. #29
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Well, the right solution (I think) is to write the SSE code as inline assembler [assuming you are not using 64-bit compiler - if you are, then you will need to use MASM and make it an external function from the C++ code's perspective]. Whilst I probably could do that pretty quickly, the debugging of it may take some time (depending on what kind of and how many mistakes I make - there may be none, but I doubt it), and just posting something that "almost works" is not a good idea.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  15. #30
    Registered User
    Join Date
    Oct 2006
    Location
    UK/Norway
    Posts
    485
    I started this in assembly, but it it was painfully slow to write as I know no assembly at all (I had it working so that I loaded variables into memory and then to the screen, without doing anything with them)

    I was also told the correctly written intrinsics should be almost as fast as inlined assembly (15-25% difference I was told if I remember correctly)

    Im not using a 64bit compiler now, but as one of my PCs is a 64bit machine I would like to take advantage of 64bit at some stage (right now that machine is running XP32)

    Whilst I probably could do that pretty quickly, the debugging of it may take some time (depending on what kind of and how many mistakes I make - there may be none, but I doubt it), and just posting something that "almost works" is not a good idea
    I would prefer to do it myself as I doing this as a learning experience. What I meant with help is help with the concept or psudo code. Because from your reply I get the understanding that my code does not really take advantage of the power of the SSE registers.

    I have been fooling with this for ever now, I might put it on hold for a while until I have some more programming experience.

    As a side question, is there any books that I can order that will help me here? I had a look at amazon, but could not find anything. I tried reading the Intel manuals, but to be honest, I did not understand to much of it.

    Thanks to everyone who has given any input to this

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Get user commands from text file.
    By Ironic in forum C Programming
    Replies: 4
    Last Post: 12-08-2008, 11:38 PM
  2. Replies: 2
    Last Post: 07-27-2007, 12:48 PM
  3. Screwy Linker Error - VC2005
    By Tonto in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2007, 02:39 PM
  4. Disable ALT key commands
    By Lionmane in forum Windows Programming
    Replies: 9
    Last Post: 09-23-2005, 10:41 AM
  5. Dos commands hehe
    By Carp in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 01-17-2003, 02:51 PM