Compiling c++ intrinsics commands

**matsp** · 07-08-2008

Originally Posted by h3ro

Thanks again Mats, im starting to get the picture now. I even have something working

Congratulations.

One last question:
>>You can just divide each byte value by 255 to get a 0.0 .. 1.0 value range, that makes all the >>calculations work better.
Why would 0.0 to 1.0 be better to work with?

Because you can use natural math as you expect it to happen, and only at the end of the calculation translate it back to 0 .. 255 value.

Also, for the conversions, I assume its better to use the SSE2 commands for going from float to byte/int then simply casting in normal c++?

I would think so, but benchmarking is the only way to actually know what's better [unless you are so experienced you can inspect the code and say from that].

--
Mats

**h3ro** · 07-08-2008

>>[unless you are so experienced you can inspect the code and say from that].
I think its fairly safe to say im not :P

Thanks

**h3ro** · 07-10-2008

OK, I have a new question again.

When working with the __m128 data type I found something weird.

If you stop your program in debug an move your mouse over the __m128 variable, you get can read it like an array(which makes sense), but there are several ways the array can be read. Im loading in 4 floats, so the m128_f32 is the only one that has useful information in it(which again makes sense)

But I would very much like to get access to the data in m128_i8 or m128_u8 as the data there is just 8 bit (16 spaces in the array). I have been looking around and there is not much information about it.

from the MSDN

Floating-point data loaded stored as __m128 objects must be generally 16-byte aligned.

Why make it possible to see the bytes in the m128 variable if they cant be used?

**CornedBee** · 07-11-2008

Some operations, like byte shuffle, work on bytes.

**h3ro** · 07-11-2008

Sorry for asking all these question, but what is byte shuffle useful for?

**CornedBee** · 07-11-2008

Endian conversion, planar/interleaved conversion, probably other things.

**h3ro** · 07-11-2008

Ok, now I have written my blitting function, but it is dead slow.

My test scene has 63 small sprites with alpha channel.
With the SSE I get less then 80 fps
With the normal C++ function I get more then 310fps

I was wondering if anyone could give me a hint of why that is

Here is my code:

Code for aligning the texture data

Code:

// Create SSE data
	// Test sprite is 64*64
	//64 * 64 = 4096
	__declspec(align(16)) float blue[4096];
	__declspec(align(16)) float green[4096];
	__declspec(align(16)) float red[4096];
	__declspec(align(16)) float trans[4096];

	// Pointer to the original spritedata
	BYTE *texture = spriteData;

	// Splitt the data up so that we have each colour in a seperate array
	for(int i = 0; i < 4096; i++)
	{
		blue[i] = *texture;
		texture++;

		green[i] = *texture;
		texture++;

		red[i] = *texture;
		texture++;

		trans[i] = *texture;
		texture++;	
	}

	// Point the variebels in the header file to the values here
	b = blue;
	g = green;
	r = red;
	t = trans;

And here is the blitting code.

Code:

//-------------------------------------------------------
	// SEE BLITTING CODE!
	//-------------------------------------------------------

	// Create a pointer to the colours we are using
	float *cBlue = b;
	float *cGreen = g;
	float *cRed = r;
	float *cAlpha = t;

	// Array for storing the result of each blitting
	__declspec(align(16)) float blue[4];
	__declspec(align(16)) float green[4];
	__declspec(align(16)) float red[4];

	for (int i = 0; i < height; i++)
	{
		//dividedWidth is width of the texture / 4
		for (int j = 0; j < dividedWidth; j++)
		{
			__m128 textBlue =  _mm_load_ps( cBlue);
			__m128 textGreen = _mm_load_ps( cGreen);
			__m128 textRed =   _mm_load_ps( cRed);
			__m128 textAlpha = _mm_load_ps( cAlpha);

			// Increment colour pointer
			cBlue  +=4;
			cGreen +=4;
			cRed   +=4;
			cAlpha +=4;

			__m128 screenBlue =  _mm_setr_ps( *(screenDataPnt) ,     *(screenDataPnt + 4) ,*(screenDataPnt + 8) ,*(screenDataPnt + 12) );
			__m128 screenGreen = _mm_setr_ps( *(screenDataPnt + 1) , *(screenDataPnt + 5) ,*(screenDataPnt + 9) ,*(screenDataPnt + 13) );
			__m128 screenRed =   _mm_setr_ps( *(screenDataPnt + 2) , *(screenDataPnt + 6) ,*(screenDataPnt + 10) ,*(screenDataPnt + 14));

			// Load 256 into each of the registers, so that we can use it to make alpha from 0 to 1
			__m128 temp = _mm_set_ps(256,256,256,256);

			// Make alpha in the range 0 to 1
			textAlpha = _mm_div_ps(textAlpha, temp);

			// Blue
			temp = _mm_sub_ps(textBlue, screenBlue);
			temp = _mm_mul_ps(temp,textAlpha);
			temp = _mm_add_ps(temp, screenBlue);

			_mm_storeu_ps(blue, temp);

			// Green
			temp = _mm_sub_ps(textGreen, screenGreen);
			temp = _mm_mul_ps(temp,textAlpha);
			temp = _mm_add_ps(temp, screenGreen);

			_mm_storeu_ps(green, temp);

			// Red
			temp = _mm_sub_ps(textRed, screenRed);
			temp = _mm_mul_ps(temp,textAlpha);
			temp = _mm_add_ps(temp, screenRed);

			_mm_storeu_ps(red, temp);


			// Copy the result into the screenData pointer
			for(int p = 0; p < 4; p++)
			{
				*screenDataPnt = blue[p];
				screenDataPnt++;
				*screenDataPnt = green[p];
				screenDataPnt++;
				*screenDataPnt = red[p];
				screenDataPnt++;
				screenDataPnt++;
			}
		}

		// (ScreenWidth - textureWidth) * number of pixels
		//	640         -      64		*     4
		screenDataPnt += 2304;
	}

**matsp** · 07-11-2008

Could you compile the SSE code with -S on the gcc compile line, and post the code for that (as an attachment, perhaps, since it's probably a bit longish).

--
Mats

**h3ro** · 07-11-2008

Im using VS so not sure what -s is, but I assume its assembly?

The attached file is the result of activating Assembly With Source Code (/FAs) in the project option, is that the correct thing?

**matsp** · 07-11-2008

Like I expected, the SSE code generated from intrinsics is not particularly good. Before I get to that:

Code:

			// Load 256 into each of the registers, so that we can use it to make alpha from 0 to 1
			__m128 temp = _mm_set_ps(256,256,256,256);

is wrong, it should be 255 (otherwise alpha=255 makes 0.96... instead of 1.0 - you are not the first to make that mistake).

Every operation of this SSE code is basically doing a whole heap of SSE operations that aren't needed:

Code:

; 141  : 
; 142  : 			// Make alpha in the range 0 to 1
; 143  : 			textAlpha = _mm_div_ps(textAlpha, temp);

	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
	movaps	xmm1, XMMWORD PTR _textAlpha$18597[ebp]
	divps	xmm1, xmm0
	movaps	XMMWORD PTR $T18607[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18607[ebp]
	movaps	XMMWORD PTR _textAlpha$18597[ebp], xmm0
// Why not just store xmm1 in_textAlpha$18597[ebp]

; 144  : 
; 145  : 			// Blue
; 146  : 			temp = _mm_sub_ps(textBlue, screenBlue);

	movaps	xmm0, XMMWORD PTR _screenBlue$18599[ebp]
	movaps	xmm1, XMMWORD PTR _textBlue$18591[ebp]
	subps	xmm1, xmm0
	movaps	XMMWORD PTR $T18608[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18608[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

// As above. 
; 147  : 			temp = _mm_mul_ps(temp,textAlpha);

	movaps	xmm0, XMMWORD PTR _textAlpha$18597[ebp]
	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
	mulps	xmm1, xmm0
	movaps	XMMWORD PTR $T18609[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18609[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

; 148  : 			temp = _mm_add_ps(temp, screenBlue);

	movaps	xmm0, XMMWORD PTR _screenBlue$18599[ebp]
	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
	addps	xmm1, xmm0
	movaps	XMMWORD PTR $T18610[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18610[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

; 149  : 
; 150  : 			_mm_storeu_ps(blue, temp);

	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
	movups	XMMWORD PTR _blue$[ebp], xmm0

; 151  : 
; 152  : 			// Green
; 153  : 			temp = _mm_sub_ps(textGreen, screenGreen);

	movaps	xmm0, XMMWORD PTR _screenGreen$18601[ebp]
	movaps	xmm1, XMMWORD PTR _textGreen$18593[ebp]
	subps	xmm1, xmm0
	movaps	XMMWORD PTR $T18611[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18611[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

; 154  : 			temp = _mm_mul_ps(temp,textAlpha);

	movaps	xmm0, XMMWORD PTR _textAlpha$18597[ebp]
	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
	mulps	xmm1, xmm0
	movaps	XMMWORD PTR $T18612[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18612[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

; 155  : 			temp = _mm_add_ps(temp, screenGreen);

	movaps	xmm0, XMMWORD PTR _screenGreen$18601[ebp]
	movaps	xmm1, XMMWORD PTR _temp$18605[ebp]
	addps	xmm1, xmm0
	movaps	XMMWORD PTR $T18613[ebp], xmm1
	movaps	xmm0, XMMWORD PTR $T18613[ebp]
	movaps	XMMWORD PTR _temp$18605[ebp], xmm0

; 156  : 
; 157  : 			_mm_storeu_ps(green, temp);

	movaps	xmm0, XMMWORD PTR _temp$18605[ebp]
	movups	XMMWORD PTR _green$[ebp], xmm0

I could keep going with the above commens, but it's essentially just the same.

--
Mats

**h3ro** · 07-11-2008

Ok, so my code basically does everything twice?

If I understand right, the code in red can be deleted?

**matsp** · 07-11-2008

Originally Posted by h3ro

Ok, so my code basically does everything twice?

If I understand right, the code in red can be deleted?

No, the red code could be written as one instruction instead of three (storing the value from the result register immediately, rather than storing it as a temporary value.

Further, since all of your code only ever uses xmm0 and xmm1 [in the main calculation paths at least], it reduces the chances of the processor performing multiple operations in parallel [even tho' the processor probably has a register renaming feature, I doubt it will be clever enough to do it really well].

--
Mats

**h3ro** · 07-11-2008

I am not sure I really understand how to change my code to make it better.
To be honest I am not even sure my understanding of SSE is correct.

I know this is a lot to ask, but Im not sure where to go from here, but does anyone have time to write a short psudo code or something of what I am trying to do? It does not have to be perfect or work 100%, but just something I could have as guide, because what I thought would work pretty well turned out to be 400% worse then the original c++ code.

**matsp** · 07-11-2008

Well, the right solution (I think) is to write the SSE code as inline assembler [assuming you are not using 64-bit compiler - if you are, then you will need to use MASM and make it an external function from the C++ code's perspective]. Whilst I probably could do that pretty quickly, the debugging of it may take some time (depending on what kind of and how many mistakes I make - there may be none, but I doubt it), and just posting something that "almost works" is not a good idea.

--
Mats

**h3ro** · 07-11-2008

I started this in assembly, but it it was painfully slow to write as I know no assembly at all (I had it working so that I loaded variables into memory and then to the screen, without doing anything with them)

I was also told the correctly written intrinsics should be almost as fast as inlined assembly (15-25% difference I was told if I remember correctly)

Im not using a 64bit compiler now, but as one of my PCs is a 64bit machine I would like to take advantage of 64bit at some stage (right now that machine is running XP32)

Whilst I probably could do that pretty quickly, the debugging of it may take some time (depending on what kind of and how many mistakes I make - there may be none, but I doubt it), and just posting something that "almost works" is not a good idea

I would prefer to do it myself as I doing this as a learning experience. What I meant with help is help with the concept or psudo code. Because from your reply I get the understanding that my code does not really take advantage of the power of the SSE registers.

I have been fooling with this for ever now, I might put it on hold for a while until I have some more programming experience.

As a side question, is there any books that I can order that will help me here? I had a look at amazon, but could not find anything. I tried reading the Intel manuals, but to be honest, I did not understand to much of it.

Thanks to everyone who has given any input to this