The reason for the *pSVB++ is that each triangle in the code is 6 vertices. I'm not using indexed buffers and I'm not using a hardware buffer.
DrawPrimitiveUP is not in hardware which means you don't have to lock it to access it. Yes, you take a little bit of a hit with it but it's not too bad. It takes more of a hit with locking and unlocking a Video RAM vertex buffer than a software system RAM buffer.
So my point in doing 6 *pSVB++ is to increment the pointer, change the variables, and increment the pointer.
The actual first version was pSVB[vert].x+=irelScrollX.
But in profiling this was actually slower than *pSVB+=vertexCount which was my second version. The optimized version actually does 6 vertices at a time by incrementing a pointer which means the loop actually is faster. So instead of changing variables 1 at a time in a single increment loop, I change 6 at a time and increment by 6. I just unrolled the loop.
I tried to also unroll the loop on the vertexCount loop inside of the render function, but it was causing issues that I couldn't quite figure out so I changed it to what worked.
I'm trying to process 6 vertices in one loop, or one complete quad in one loop.
Now the reason for NOT using indexed buffers. First, this is in software RAM so memory is not an issue. Second, it is much easier to do special effects on a per-quad basis when you are not using indexed buffers. So for screen dissolves, wipes, fades, and/or other special effects like screen shattering, etc, it's much easier to use 6 vertices per quad. In this way I can literally tear apart the screen and do some rotational effects on each triangle without having to lock a hardware vertex buffer, without having to de-couple the index buffer from the vertex buffer, etc., etc.
If this was for a 3D rendering system then, yes, what you have said would be correct. In 3D I would never do any of this and don't, but for 2D I just need more control of the screen and of the sprites than Direct3D's and D3DX's API allows for.
Even ID3DXSprite has 1 complete glaring oversight in that you can not access the vertex buffer. This means that bounding boxes cannot be computed post-transformation for sprites and you pretty much have to compute that for yourself. I'm not sure why this is so or why they don't at least let you specify a bounding volume and allow you access to those. The way I do bounding boxes is to first create the bounding volume. Second, after the object is transformed I re-compute the bounding box from the transformed vertices. Just transforming the bounding box does not work correctly as explained in Real Time Rendering and Game Coding Complete. Moller and Haines provide an extremely detailed explanation of why this does not work.
There is a method to the madness.
Whats with all the offsets?
The reason for all of the offsets are so I can linearilly traverse the map in memory. I detest doing any multiplies in a loop such as that so to prevent it I pre-compute the offset. But in order to do this properly you must have two variables.
Here is why.
Code to traverse a 2D array (pArray) in linear fashion.
This is using array indexing and not manual pointer addition.
Slower, but it works.
The row and col counters are so I can track the width and height without using (dwOffset % width), which is slower since it is a divide.
[code]
DWORD dwOffset=0;
DWORD dwStartOffset=dwOffset;
int row=0,col=0;
int value=0;
do
{
pArray[dwOffset]=value;
value++;
dwOffset++;
col++;
if (col>ArrayWidth)
{
//Start offset is the beginning of the current line
//dwOffset is at the end of the current line
//Increment dwOffset by ArrayWidth puts dwOffset at the END
//of the next line, instead of the START of the next line
//So we increment StartOffset and then set dwOffset to
//equal that value
dwStartOffset+=ArrayWidth;
dwOffset=dwStartOffset;
//Reset column counter
col=0;
//Increment row counter
row++;
}
} while (row<ArrayHeight);
[code]
So for this loop we increment a 2D array in linear fashion and we have this:
Code:
pArray[dwOffset]=value;
value++;
dwOffset++;
col++;
1 indexed array access //add esi,dwOffset
3 integral additions //inc [ebp+stack_position_of_var]
Code:
if (col>ArrayWidth)
I'm not sure exactly how the compiler will treat this w/o looking at the asm source. I would do it this way which is prob not the fastest way. I would use ecx as the column counter.
1 comparison with a parameter on the stack.
mov eax,[ebp+stack_pos_of_ArrayWidth]
cmp eax,ecx
ja MOVEDOWNROW
jmp STARTOFLOOP
Code:
dwStartOffset+=ArrayWidth;
dwOffset=dwStartOffset;
//Reset column counter
col=0;
//Increment row counter
row++;
Again we have (in order)
1 Addition
1 Assignment
1 Assignment
1 Addition
Code:
} while (row<ArrayHeight);
Here is 1 comparison with a possible 2 register loads for row and ArrayHeight, again, depending on register/stack usage.
So the entire loop comes down to:
1 indexed array access
1 integral addition
1 integral addition
1 integral addition
1 comparison with a parameter on the stack.
1 integral addition
1 assignment
1 assignment
1 integral addition
1 comparison
So:
1 indexed array access
5 integral additions
2 comparisons
As opposed to something like this:
Code:
for (int i=0;i<ArrayHeight;i++)
{
for (int j=0;j<ArrayWidth;j++)
{
Array[i][j]=value;
}
}
2 loops.
1 inherent multiply executed once each iteration of the nested loop.
1 Array access inside of nested loop.
2 inherent comparisons, increments
Array[i][j] works out to:
Array[i*width+j]
Which is what I want to avoid as well as the nested loops.
So I unroll the loop and eliminate the multiply.
Attached is the profile of this code - profiled for function timing.
Here is the render function:
222.486 0.8 500.899 1.8 173 CScreenGrid::RenderSoftware(float) (cscreengrid.obj)
Here is the scroll function:
305 0.0 4.305 0.0 34 CScreenGrid::ScrollSoftware(int,int) (cscreengrid.obj)
This was computed by running the app, selecting new project, and then opening a bitmap and adding several tiles. Then I scrolled the map left and right 3 or 4 times.
Keep in mind this is inside of MFC where most of the time is spent in the msg loop and rendering is not constantly updated. It is inly updated when:
Invalidate() is called which calls
CZeldaEditorView::OnDraw() which calls
CEditor3D::RenderMainView() which calls
CScreenGrid::RenderSoftware().
A pure Direct3D app would not be this convoluted, but I'm using my engine DLL inside of an MFC CWnd object so it get's messy.