If you know assembly you can use it to speed up your blits. This sounds very strange, I know, but it seems to me that in my tile engine and my other 2D ventures that the hardware blitter was slower in some instances than pure assembly code. I'm not sure why since my theory for using the hardware blitter in the first place was that hardware was faster than software, but apparently there is more to it than that because it simply is not true in this case, or at least in my limited experience.
I would try it both ways. To make it easy first try to clear the surface using the hardware blitter and just blit a huge black box to the surface. Then try the same thing in assembly and profile it. This really would be the only way to tell which was faster.
I think this is the code for the blit. This is not a filled rectangle blit - it is really just a 32-bit memcpy
Code:
_Copy32 proc
ARG Source:DWORD,Target:DWORD,Length:DWORD
push ebp
mov ebp,esp
push ds
mov eax,Source
mov ds,eax
xor esi,esi
mov eax,Target
mov es,eax
xor edi,edi
mov ecx,Length
rep movsd
pop ds
pop ebp
ret
_Copy32 endp
Buffer and Source will be your surface pointers. DirectX will allow you to access these pointers. This code might be lengthier than need be since you may not need to load the segment registers since you are in protected mode - sorry, I really just do not remember. I've become a bit rusty on my 32-bit assembly - I will look at my engine code (sorry, I'm not on my system right now) and get back to you.
Also if you do not use the hardware clipper you will notice a significant speed gain as the others have stated. The only way to tell where the slow down is coming from is to profile the code. It might not be slowing down where you think it is and then again, maybe it is. The profiler will eliminate the guess work.
There will be a inherent slow down here, though, since you must lock the target surface before you can blit to it.
I will do some research for you on this subject and get back to you. Sorry I cannot help you more at the moment.