Fast memcpy for unaligned addresses

**Pea** · 12-14-2004

Hi all,

I have an ap on an ARM based processor that displays a bitmpa graphic, hence it copies chunks of data to the framebuffer memory for the LCD.

However, if the image is moved by an odd number of pixels, so that the resulting memory being copied to the framebuffer is not aligned to a 4-byte boundary, the performance of memcpy drops by 4 times (i.e. from 45 to 11 fps).

I have been doing some reading on this, and it seems if you can write your own optimised memcpy function to deal with this (http://www.embedded.com/showArticle....cleID=19205567).

Several variations of memcpy can be written that work better for certain sized memory chunks (e.g. 8 bytes or less, 32 bytes or less, 128 bytes or less, more than 128 bytes).

But surely, someone has written these before to deal with 4-byte boundaries? Does anyone know where I can find functions such as these?

TIA
Pea

**Prelude** · 12-14-2004

>someone has written these before to deal with 4-byte boundaries?
Yes.

>Does anyone know where I can find functions such as these?
The link you posted seems like a good start.

**~~master5001~~** · 12-14-2004

Actually the memory copy function that is included with the MASM32 compiler moves memory on the 32-bit boundary. You could always start by looking at the source for that.

**Salem** · 12-14-2004

There isn't a single answer, you're just going to have to measure and experiment.

> Several variations of memcpy can be written that work better for certain sized memory chunks
All of which are also sensitive to the hardware which they're running on - your mileage may vary.

> if the image is moved by an odd number of pixels
Consider allocating temporary bitmaps which are always rounded up to an even number of pixels, so that aligned memory copies are possible. Copy remaining odd bytes using a for loop.
This is trading memory occupancy for performance - a classic speed-time tradeoff which is very common.

How are the addresses mis-aligned?
Say for example you had addresses 0x123 and 0x457
As the article explains, the modulo-4 remainder of both addresses is 3, so you can copy just one byte normally, then both addresses become aligned and a regular memcpy becomes possible.

But if you have addresses 0x123 and 0x451, then you can only achieve modulo-2 alignment, which may be better than modulo-1, but not ideal.

I find the aggressive loop-unrolling of Duff's Device to be pretty good in some memcpy routines.

**Jez** · 12-14-2004

I had to do this once on a 68000 processor. In the end the fastest way was to use an unrolled loop of MOVEB (move byte). Horrible.

On the ARM you pretty well get a free shift with every instruction, so perhaps you could make use of this. (Sorry my ARM skills are pretty rusty so I can't be more help with this).

Does the ARM actually allow unaligned accesses at all (like x86) or is it strictly 32bit aligned?

**Pea** · 12-14-2004

Well, the code works, but it is just slow.

Thanks for your help guys. I was looking for a pre-existing solution, but I pretty much have to write my own, don't I

**VirtualAce** · 12-14-2004

Code:

_Copy32 proc
ARG Source:DWORD,Dest:DWORD,Size:DWORD
 
push ebp
mov ebp,esp
 
mov esi,[Source]
mov edi,[Dest]
mov ecx,[Size]
rep stosd
and ecx,03h
rep stosb
 
pop ebp
ret
_Copy32 endp

This should work for memory to memory, but it wont for bitmaps. For bitmaps you must align to the nearest 4 byte boundary which is easy to do. You say the code is slow, yet I can't see how any version in assembly could be all that slow. You could use memcpy from C and I still cannot see why it would be as slow as you are indicating.

I've tried both and althought my memcpy is faster it is less versatile. Most of the time I just use memcpy and it suits me just fine. I've never had a major problem with it being too slow.

The fastest way I can find in pure C is something like this:

Code:

 
void Blit(WORD x, WORD y,Image *Pic,DWORD Pitch)
{
//Compute memory offset of x,y
DWORD screen_offset=y*Pitch+x;
 
//Store for quick pointer addition later
DWORD orig_offset=screen_offset;
 
//Set bitmap memory offset
DWORD bitmap_offset=0;
 
//Pointer to image data for image
DWORD *ptrData=Pic->Data;
 
//Tracks width of current line in bitmap
int width_counter=0;
 
for (DWORD i=0;i<Pic->Size;i++)
{
	//Write pixel to screen from bitmap data
	Pixel[screen_offset]=ptrData[bitmap_offset];
 
	//Increment screen offset
	screen_offset++;
 
	//Increment bitmap offset
	bitmap_offset++;
 
	//Increment width count
	width_counter++;
 
	//Are we at max width - or end of line
	if (width_counter>Pic->Height)
	{
	 //Take pre-computed offset and add Pitch -> move down one line
	 orig_offset+=Pitch;
 
	 //Reset width count
	 width_counter=0;
 
	 //Set screen offset to re-computed orig_offset - have moved down one line
	 screen_offset=orig_offset;
	}
}
}

There are some optimizations you can do in that loop as well. Notice that I'm not re-multiplying to compute the new screen offsets and I'm moving through the bitmap memory in linear fashion. It all comes down to a simple few adds. One mul per bitmap. You could code the line blitter in assembly but you would have to setup EDI and ESI prior to the loop or it would be slower than this.

asm {
mov esi,[ptrData]
mov edi,[Screen]

}

Then in the loop

asm {
mov ecx,[Width]
rep stosd
and ecx,03h
rep stosb
}

But this is really two loops. One is the C for loop and one is the implicit loop using rep and the ECX register

Of course you could write the whole blitter in pure assembly but I doubt you would get much benefit from doing that. But that blitter in C should be relatively fast since it only increments values and does not do any muls inside of the actual critical loop. Note that it does not do any clipping for the screen boundaries but that is a pretty simple addition if you would like to do it.

The key is that you want to move as linearily as possible through both the screen memory and the bitmap memory. The problem comes when you get to the end of one line of the bitmap. You need to re-compute the screenoffset but doing a mul here would be a bit slow. So what I've done is pre-compute the starting offset and saved its value in another variable. When you come to the end of a line, simply increment the starting offset by the pitch of the memory and set the current screen offset to that value. This in essence moves down the screen one line, but does it without using a multiplication.

Note that this is a top-left to bottom right horizontal blit. A faster blit would be to start at top left and increment screen_offset by Pitch and increment bitmap_offset by Image->Width. Then when you reach the bottom of the image or when your Y counter is > Pic->Width you would add one to orig_offset to move over one column. Then the whole process starts over. The reason this is faster is because there are always fewer pixels in the vertical dimension than in the horizontal. In fact this is the blit I would use in a game if I were coding the rasterizer from scratch. You would also have to store an original offset for the bitmap since you are no longer moving through it linearily. But this is easy. When you reach the bottom of the current column simply increment bitmap_orig_offset by 1 and set bitmap_offset to that value. Then you are ready for the next column. I think this would be extremely fast.

I'm not quite sure what you mean by unaligned since if the memory is unaligned or aligned incorrectly the bitmap will not display correctly at best and at worst you will crash the whole thing.

**Jez** · 12-15-2004

Umm that looks suspiciously like x86 code, not ARM. There isn't really an issue on x86, since it allows unaligned accessess anyway, i.e. you just let the CPU's write combining buffers sort it out.

I've just dug out my old Sharp Zaurus, so if I can get C running on it I might try a few things.

**VirtualAce** · 12-15-2004

Yeah it is x86 but even if you cannot do unaligned accesses just align the data to what it needs. Zero pad the data or use NOPs to align the data to a boundary.

**Jez** · 12-15-2004

I'm not sure if that's too helpful. Often unaligned accesses are required, but you can't always just use padding to align them, although I agree that would be a solution if practicable.

From what Pea says, it sounds like the source bitmap is aligned (no reason not to be) and it's the destination screen that requires unaligned writes. It could be that the source + destination are the same region of memory even (i.e. panning or scrolling the screen). For sure you could keep multiple copies of the source bitmap, aligned differently, but that puts a premium on space as Salem pointed out. Typically the types of devices that use ARM's don't have unlimited amounts of RAM.

**Pea** · 01-23-2005

Hi guys,

Thanks for all the debate and help etc!

I clip the bitmap if it moves over the LCD border, so sometimes the source (bitmap) data is unaligned, and sometimes the destination (LCD) address is unaligned.

Bubba - thanks for the blitting code, but my blitting finctions are already very quick, and utilise memcpy to copy complete (or partial) scanlines at one go, and then jump forward through the bitmap and LCD (source and destination) by precomputed amounts before copying the next scanline. The problem is only that the performance of memcpy varies so greatly - from 44 fps to 11 fps on aligned/unaligned addresses.

I HAVE found a solution. Its a memcpy routine written in assembler that compiled for me with no adjustments to the code at all (gcc). It can be found here (this wasn't my original source, but I cant find that anymore): http://www.rtems.com/ml/rtems-users/.../msg00096.html

It copies in a variety of ways - e.g. If the memory is aligned, it copies in large chunks (4,8 or 16 bytes I think). If the LAST byte is aligned, it copies backwards, etc. The end results for me are:

Original memcpy
aligned: 44fps
unaligned: 11fps

Optimised memcpy
aligned: 46fps
unaligned: 44fps

Great results, huh? I think the moral is to not assume that just because a function is in assembler that it is necessarily the fastest or most optimised it can be.

Thread: Fast memcpy for unaligned addresses

Thread Tools

Search Thread

Display

Fast memcpy for unaligned addresses

Similar Threads

Okay, giant issue, I think I'm treading into territory outside of cboards maybe...

Memcpy(); Errors...

memcpy with 128 bit registers

memcpy