1. ## Some optemizing questions

Hallo,

I bought a book recommended to me by someone on this forum (Tricks of the game programming gurus) and it has some interesting pages about optimizing some of the functions.

My first question is about using bit shifting instead of multiplication. I was always under the impression that doing bit shifting was a bit faster then using multiplication. So I made a small program to test how much faster it was. To my surprise I found multiplication to have almost the exact same speed as bit shifting. Here is the program I made, with the different speed results on the top.

Code:
```// Dev C++ 4.9.9.2

// Multiplication
// First computer
// 4531 first
// 4516 second
// 4516 third

// Second computer, more itterations
// 11485
// 11485
// 11469

// Bit shift
// First computer
// 4500 first
// 4469 second
// 4469 third

// Second computer, more itterations
// 11438
// 11437
// 11563

#include <windows.h>
#include <conio.h>
#include <iostream>

using namespace std;

int main()
{
char *buffer = new char[512 * 512 * 4];

cout << "Starting.." << endl;

double startTime = timeGetTime();

for(int i = 0; i < 1000; i++)
{
for (int y = 0; y < 512; y++)
{
for (int x = 0; x < 512; x++)
{
// The two different ways of doing the multiplication,
// comment out one.
//int offset = ((512 * y) + x ) * 4;
//int offset = (((y << 9)) + x) << 2;
}
}
}

double endTime = (timeGetTime() - startTime);

cout << "Ending..." << endl;
cout << endTime << endl;

delete [] buffer;

system("pause");
return 0;
}```
Am I doing something wrong, or is << almost the same as * when it comes to speed?

Also, the book talks about how to draw bitmaps to the screen using logical operators, but it does not really go into any details about how to use it. In the short paragraph it mentions that you can use OR to draw the bitmap. The advantage of using it is that I dont have to use an if statement for each pixel. The disadvantage is that you loose some colour and that it only works on 16bit colours. Does anyone know of where I can read up on this as I dont really understand it and I want to use 32bit colours.

Also, is there a way to get rid of the temp variable in this function:
Code:
```// Colour is a array of 4 values
void setPixel(BYTE &screenBuffer, BYTE &colour, int x, int y)
{
int offset = (( SCREENWIDTH * y) + x) * 4;

BYTE *temp = screnBuffer + offset;

memcpy(temp, colour, 4);
}```
Or im I better of using this:
Code:
```// Colour is a array of 4 values
void setPixel(BYTE &screenBuffer, BYTE &colour, int x, int y)
{
int offset = (( SCREENWIDTH * y) + x) * 4;

screenBuffer[offset++] = colour[0];   // R
screenBuffer[offset++] = colour[0];   // G
screenBuffer[offset] = colour[0];     // B
}```

2. > To my surprise I found multiplication to have almost the exact same speed as bit shifting.
Probably because most modern compilers are perfectly capable of performing these (and many other) optimisations all the time, rather than just when you think about it.

Also, if you fail to use the results (as you do in your simple test), then the compiler may well remove the whole nested loop construct because none of the results are used.

> In the short paragraph it mentions that you can use OR to draw the bitmap
> The disadvantage is that you loose some colour and that it only works on 16bit colours
It works only where you have a paletted display.
For example, if you set all the entries in the pallete which match say 0001XXXX (all 16 combinations to the same colour), then whatever you draw in the other 4 bits will always have that colour.
You end up with 15 colours for foreground objects (one must be transparent), and 16 colours for background objects. You now have an effective palette of 31 colours rather than the full 256.

This was a popular technique on 1980's home micros, but modern accelerated graphics cards can probably redraw the whole lot from scratch much quicker anyway, and in a full colour depth.

3. > To my surprise I found multiplication to have almost the exact same speed as bit shifting.
Probably because most modern compilers are perfectly capable of performing these (and many other) optimisations all the time, rather than just when you think about it.

Also, if you fail to use the results (as you do in your simple test), then the compiler may well remove the whole nested loop construct because none of the results are used.
Do the compiler use do it for "harder" values, where you might need to shift two or three times? How many shifts are ok to do, or should I always use multiplication?

You was right, the compiler optimized it away. Added so that I am using it now, but the difference between the two are still a lot smaller then what I thought it would be.

> In the short paragraph it mentions that you can use OR to draw the bitmap
> The disadvantage is that you loose some colour and that it only works on 16bit colours
It works only where you have a paletted display.
For example, if you set all the entries in the pallete which match say 0001XXXX (all 16 combinations to the same colour), then whatever you draw in the other 4 bits will always have that colour.
You end up with 15 colours for foreground objects (one must be transparent), and 16 colours for background objects. You now have an effective palette of 31 colours rather than the full 256.
So it will not work to good if I try to use that to draw a sprite of a car or something onto something else?

This was a popular technique on 1980's home micros, but modern accelerated graphics cards can probably redraw the whole lot from scratch much quicker anyway, and in a full colour depth.
All my graphic routines are old school. Im just using software. Thats why I am trying to optimize everything as much as I can, as drawing is so incredible expensive right now.

4. As for multiplication - are you using compiler optimizations? These tend to make the largest difference.

It also looks like you are looking for optimization possibilities in the wrong place. Rather than wondering whether one way of doing multiplication is faster than another, you should be looking for ways to do less multiplications in the first place. For example

Code:
```    int max = 512 * 512;
for(int i = 0; i < 1000; i++)
{
for (int y = 0; y < max; ++y) {
int offset = y * 4;
}
}```

5. As for multiplication - are you using compiler optimizations? These tend to make the largest difference.
Well, I am still playing a bit with that. Still not to good with visual studio, last time I tried turning on optimization the compiler optimized it a bit to well. It optimized several of my functions away (Hey, who needs input functions in a game?) Is there any setting or anything you can recommend? So far, the only "optimizing" I am using is to change from debug mode to release.

It also looks like you are looking for optimization possibilities in the wrong place. Rather than wondering whether one way of doing multiplication is faster than another, you should be looking for ways to do less multiplications in the first place. For example
Well, I think I have minimized the multiplication as much as possible. Or at lest as good as I can.

I know I am starting to look for optimization in places where there are almost nothing to be gained, but how expensive is assignment?

In one of my draw routines I create 4 new variables which I dont really really need. (Just make the code easier to read) Is there anything to gain from removing them?
The 4 assignments are done for each pixel.

6. > So it will not work to good if I try to use that to draw a sprite of a car or something onto something else?
Well you could, if you create more "layers" in your palette, but the number of colours goes down in response. A background, and non-overlapping foreground characters is about the limit of the technique.

> I know I am starting to look for optimization in places where there are almost nothing to be gained, but how expensive is assignment?
That's about the simplest thing there is.
What you really need to do is get hold of a profiler to actually tell you where the problem areas are, rather than guessing. Also, you need to look at your algorithms, not micro-optimisations which the compiler will manage perfectly well to do itself.

In fact, if you get too "clever" about it, you might just make it worse rather than better, as the compiler will no longer be able to figure out what you're really trying to do.

The compiler contains the distilled wisdom of programmers with a combined total of hundreds, if not thousands of years of programming experience. If there's a trick, you can be pretty sure they know about it, and hence the compiler probably does as well, if you make your intent clear enough to being with.

For example, no amount of micro-optimisation-inline-assembler-yada-yada would ever make a bubble sort better than a quick sort. But you as the programmer would look at the profile data, see that it was taking far too long, then do some research into better techniques. This is where you'll find all the 10x improvements.

> It optimized several of my functions away (Hey, who needs input functions in a game?)
In a correct program, optimisation should not change the functionality of the program. If the functionality changed, then I'd suspect bugs in your code. Many "problems" such as using uninitialised variables or minor buffer overruns are easy to gloss over in debug mode, but are then exposed when in release mode. Similarly, people sometimes complain it works in release mode but not debug mode for pretty much the same reasons.

> int offset = (( SCREENWIDTH * y) + x) * 4;
If you're calling this for incrementing x and y values all through the 2D array, then you can definitely improve things by moving this outside the function. Making it static and inline may give the compiler enough of a clue to achieve the kind of code motion which would reduce this to just adding 4 each time around the loop.

7. > I know I am starting to look for optimization in places where there are almost nothing to be gained, but how expensive is assignment?
That's about the simplest thing there is.
What you really need to do is get hold of a profiler to actually tell you where the problem areas are, rather than guessing. Also, you need to look at your algorithms, not micro-optimisations which the compiler will manage perfectly well to do itself.
I am not really sure what I can do to improve my algorithm. I have looked online at other blitting functions, and they are almost the same. A few are a bit faster, but they are using evil assembly hacks which I dont understand.

In fact, if you get too "clever" about it, you might just make it worse rather than better, as the compiler will no longer be able to figure out what you're really trying to do.

The compiler contains the distilled wisdom of programmers with a combined total of hundreds, if not thousands of years of programming experience. If there's a trick, you can be pretty sure they know about it, and hence the compiler probably does as well, if you make your intent clear enough to being with.
I have seen that a few times. Makes me feel bad, when the compiler better knows what I want to do then I know myself.

> It optimized several of my functions away (Hey, who needs input functions in a game?)
In a correct program, optimisation should not change the functionality of the program. If the functionality changed, then I'd suspect bugs in your code. Many "problems" such as using uninitialised variables or minor buffer overruns are easy to gloss over in debug mode, but are then exposed when in release mode. Similarly, people sometimes complain it works in release mode but not debug mode for pretty much the same reasons.
I was expecting it to be my fault, I just found it amusing at the time.

> int offset = (( SCREENWIDTH * y) + x) * 4;
If you're calling this for incrementing x and y values all through the 2D array, then you can definitely improve things by moving this outside the function. Making it static and inline may give the compiler enough of a clue to achieve the kind of code motion which would reduce this to just adding 4 each time around the loop.
The function is just for drawing more random like things, say a star field.

I know(or I have read in the game programming book) that if you have a special case happening over and over again, you should make a special blitting function for that case. I am not really sure how to do this, but maybe someone can point me in the right direction?
My player and enemy sprites are always 32x32 and always have alpha channel. Almost all of the center of the sprite if fully coloured(no alpha) while the edges have alpha. Is there any way I can use that to make a faster blitter?

> So it will not work to good if I try to use that to draw a sprite of a car or something onto something else?
Well you could, if you create more "layers" in your palette, but the number of colours goes down in response. A background, and non-overlapping foreground characters is about the limit of the technique.
If I have to use more then one logic operator, it is better to use an if, as one if statement take about the same time as two logic operators.

What you really need to do is get hold of a profiler to actually tell you where the problem areas are, rather than guessing.
I have tried to use one, but I could not really get it to work very well. Can anyone recommend a free one that work on Vista? (I am using Visual studio)

Here is the blitting function for alpha, it is probably not perfect, but it is as good as I can get it without any help
Code:
```void Sprite::drawAlphaSprite(BYTE *screenData, Rectangle2D &screenRec, int xPos, int yPos)
{
// Figure out which part of the texture to draw
Rectangle2D newTextureRec = calculateClipping(screenRec, textureRec, xPos, yPos);

// A pointer to the memory we want to copy to
BYTE *screenDataPnt = screenData + (xPos + yPos * screenRec.getWidth()) * 4;

// A pointer to the memory we want to copy from
BYTE *textureDataPnt = spriteData + (newTextureRec.left + newTextureRec.top * textureRec.right) * 4;

// Calculate the offset after each line
int endScreenOffset = (screenRec.getWidth() - newTextureRec.getWidth()) * 4;
int endTextureOffset = (textureRec.getWidth() - newTextureRec.getWidth()) * 4;

// Do the drawing
for (int y = 0; y < newTextureRec.getHeight(); y++)
{
for (int x = 0; x < newTextureRec.getWidth(); x++)
{
BYTE blue = textureDataPnt[0];
BYTE green = textureDataPnt[1];
BYTE red = textureDataPnt[2];
BYTE alpha = textureDataPnt[3];

// Calculate how much the alpha value is going to effect the pixel
float alphaMod = alpha / 255.0;

if(alphaMod != 0)
{
// Assign the correct colour to the current pixel, the fast way
screenDataPnt[0] = screenDataPnt[0] + (( alpha * (blue - screenDataPnt[0] )) >> 8);
screenDataPnt[1] = screenDataPnt[1] + (( alpha * (green - screenDataPnt[1] )) >> 8);
screenDataPnt[2] = screenDataPnt[2] + (( alpha * (red - screenDataPnt[2] )) >> 8);
}

// Move to the next pixel on the screen
screenDataPnt += 4;

// Move to the next pixel on the sprite
textureDataPnt += 4;
}

// Move to the next line on the screen and the next line of the sprite
screenDataPnt += endScreenOffset;
textureDataPnt += endTextureOffset;
}
}```
And I use this one when there is no alpha. This one is a lot faster
Code:
```void Sprite::drawNoAlphaSprite(BYTE *screenData, Rectangle2D &screenRec,int xPos, int yPos)
{
// Figure out which part of the texture to draw
Rectangle2D newTextureRec = calculateClipping(screenRec, textureRec, xPos, yPos);

// Pointer to the memory we want to copy to
BYTE *screenDataPnt = screenData + (xPos + yPos * screenRec.getWidth()) * 4;

// Pointer to the memory we want to copy from
BYTE *textureDataPnt = spriteData + (newTextureRec.left + newTextureRec.top * textureRec.getWidth()) * 4;

// Copy one line of the sprite to the screen for each loop
for (int y = newTextureRec.top; y < newTextureRec.bottom; y++)
{
// Copy the line
memcpy(screenDataPnt,textureDataPnt, newTextureRec.getWidth() * 4);

// Move to the next line of the texture
textureDataPnt += textureRec.getWidth() * 4;

// Move to the next line of the screen
screenDataPnt += screenRec.getWidth() * 4;
}
}```

8. Here's some suggestions that may gain you a little bit of an improvemnt. Mostly moving stuff out of the loop that can be moved out.

Code:
```                // Assign the correct colour to the current pixel, the fast way
screenDataPnt[0] +=  (( alpha * (blue - screenDataPnt[0] )) >> 8);
screenDataPnt[1] +=  (( alpha * (green - screenDataPnt[1] )) >> 8);
screenDataPnt[2] +=  (( alpha * (red - screenDataPnt[2] )) >> 8);```
This change makes the code easier to read - it won't affect speed at all, but easier to read code is always better.

Code:
`            float alphaMod = (lpha / 255.0);`
alphaMod is only used in the if-statement immediately after, so changing that if-statement to use alpha == 0 instead would do a small improvement.

Code:
```        int height = newTextureRec.getHeight();
int width = newTextureRec.getWidth();
for (int y = 0; y < height ; y++)
{
for (int x = 0; x < width; x++)
{```
This avoids a potential function call for every loop iteration - I'm assuming the texture is not changing size during the drawing.

Code:
```void Sprite::drawNoAlphaSprite(BYTE *screenData, Rectangle2D &screenRec,int xPos, int yPos)
{
int scWidth = screenRec.getWidth();
int texWidth = screenRec.getWidth();
// Figure out which part of the texture to draw
Rectangle2D newTextureRec = calculateClipping(screenRec, textureRec, xPos, yPos);

// Pointer to the memory we want to copy to
BYTE *screenDataPnt = screenData + (xPos + yPos * scWidth) * 4;

// Pointer to the memory we want to copy from
BYTE *textureDataPnt = spriteData + (newTextureRec.left + newTextureRec.top * texWidth) * 4;

scWidth *=4;
texWidth *= 4;

int newTexWidth = newTextureRec.getWidth() * 4;

// Copy one line of the sprite to the screen for each loop
for (int y = newTextureRec.top; y < newTextureRec.bottom; y++)
{
// Copy the line
memcpy(screenDataPnt,textureDataPnt, newTexWidth);

// Move to the next line of the texture
textureDataPnt += texWidth;

// Move to the next line of the screen
screenDataPnt += scWidth;
}
}```
Again, avoiding function (and multiplies) calls inside the for-loop.

--
Mats