This should work for memory to memory, but it wont for bitmaps. For bitmaps you must align to the nearest 4 byte boundary which is easy to do. You say the code is slow, yet I can't see how any version in assembly could be all that slow. You could use memcpy from C and I still cannot see why it would be as slow as you are indicating.
I've tried both and althought my memcpy is faster it is less versatile. Most of the time I just use memcpy and it suits me just fine. I've never had a major problem with it being too slow.
The fastest way I can find in pure C is something like this:
There are some optimizations you can do in that loop as well. Notice that I'm not re-multiplying to compute the new screen offsets and I'm moving through the bitmap memory in linear fashion. It all comes down to a simple few adds. One mul per bitmap. You could code the line blitter in assembly but you would have to setup EDI and ESI prior to the loop or it would be slower than this.
void Blit(WORD x, WORD y,Image *Pic,DWORD Pitch)
//Compute memory offset of x,y
//Store for quick pointer addition later
//Set bitmap memory offset
//Pointer to image data for image
//Tracks width of current line in bitmap
for (DWORD i=0;i<Pic->Size;i++)
//Write pixel to screen from bitmap data
//Increment screen offset
//Increment bitmap offset
//Increment width count
//Are we at max width - or end of line
//Take pre-computed offset and add Pitch -> move down one line
//Reset width count
//Set screen offset to re-computed orig_offset - have moved down one line
Then in the loop
But this is really two loops. One is the C for loop and one is the implicit loop using rep and the ECX register
Of course you could write the whole blitter in pure assembly but I doubt you would get much benefit from doing that. But that blitter in C should be relatively fast since it only increments values and does not do any muls inside of the actual critical loop. Note that it does not do any clipping for the screen boundaries but that is a pretty simple addition if you would like to do it.
The key is that you want to move as linearily as possible through both the screen memory and the bitmap memory. The problem comes when you get to the end of one line of the bitmap. You need to re-compute the screenoffset but doing a mul here would be a bit slow. So what I've done is pre-compute the starting offset and saved its value in another variable. When you come to the end of a line, simply increment the starting offset by the pitch of the memory and set the current screen offset to that value. This in essence moves down the screen one line, but does it without using a multiplication.
Note that this is a top-left to bottom right horizontal blit. A faster blit would be to start at top left and increment screen_offset by Pitch and increment bitmap_offset by Image->Width. Then when you reach the bottom of the image or when your Y counter is > Pic->Width you would add one to orig_offset to move over one column. Then the whole process starts over. The reason this is faster is because there are always fewer pixels in the vertical dimension than in the horizontal. In fact this is the blit I would use in a game if I were coding the rasterizer from scratch. You would also have to store an original offset for the bitmap since you are no longer moving through it linearily. But this is easy. When you reach the bottom of the current column simply increment bitmap_orig_offset by 1 and set bitmap_offset to that value. Then you are ready for the next column. I think this would be extremely fast.
I'm not quite sure what you mean by unaligned since if the memory is unaligned or aligned incorrectly the bitmap will not display correctly at best and at worst you will crash the whole thing.