Thread: Fast memcpy for unaligned addresses

  1. #1
    Registered User
    Join Date
    Oct 2004
    Posts
    36

    Fast memcpy for unaligned addresses

    Hi all,

    I have an ap on an ARM based processor that displays a bitmpa graphic, hence it copies chunks of data to the framebuffer memory for the LCD.

    However, if the image is moved by an odd number of pixels, so that the resulting memory being copied to the framebuffer is not aligned to a 4-byte boundary, the performance of memcpy drops by 4 times (i.e. from 45 to 11 fps).

    I have been doing some reading on this, and it seems if you can write your own optimised memcpy function to deal with this (http://www.embedded.com/showArticle....cleID=19205567).

    Several variations of memcpy can be written that work better for certain sized memory chunks (e.g. 8 bytes or less, 32 bytes or less, 128 bytes or less, more than 128 bytes).

    But surely, someone has written these before to deal with 4-byte boundaries? Does anyone know where I can find functions such as these?

    TIA
    Pea

  2. #2
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    >someone has written these before to deal with 4-byte boundaries?
    Yes.

    >Does anyone know where I can find functions such as these?
    The link you posted seems like a good start.
    My best code is written with the delete key.

  3. #3
    Banned master5001's Avatar
    Join Date
    Aug 2001
    Location
    Visalia, CA, USA
    Posts
    3,685
    Actually the memory copy function that is included with the MASM32 compiler moves memory on the 32-bit boundary. You could always start by looking at the source for that.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    There isn't a single answer, you're just going to have to measure and experiment.

    > Several variations of memcpy can be written that work better for certain sized memory chunks
    All of which are also sensitive to the hardware which they're running on - your mileage may vary.

    > if the image is moved by an odd number of pixels
    Consider allocating temporary bitmaps which are always rounded up to an even number of pixels, so that aligned memory copies are possible. Copy remaining odd bytes using a for loop.
    This is trading memory occupancy for performance - a classic speed-time tradeoff which is very common.

    How are the addresses mis-aligned?
    Say for example you had addresses 0x123 and 0x457
    As the article explains, the modulo-4 remainder of both addresses is 3, so you can copy just one byte normally, then both addresses become aligned and a regular memcpy becomes possible.

    But if you have addresses 0x123 and 0x451, then you can only achieve modulo-2 alignment, which may be better than modulo-1, but not ideal.

    I find the aggressive loop-unrolling of Duff's Device to be pretty good in some memcpy routines.

  5. #5
    The C-er
    Join Date
    Mar 2004
    Posts
    192
    I had to do this once on a 68000 processor. In the end the fastest way was to use an unrolled loop of MOVEB (move byte). Horrible.

    On the ARM you pretty well get a free shift with every instruction, so perhaps you could make use of this. (Sorry my ARM skills are pretty rusty so I can't be more help with this).

    Does the ARM actually allow unaligned accesses at all (like x86) or is it strictly 32bit aligned?

  6. #6
    Registered User
    Join Date
    Oct 2004
    Posts
    36
    Well, the code works, but it is just slow.

    Thanks for your help guys. I was looking for a pre-existing solution, but I pretty much have to write my own, don't I

  7. #7
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    Code:
    _Copy32 proc
    ARG Source:DWORD,Dest:DWORD,Size:DWORD
     
    push ebp
    mov ebp,esp
     
    mov esi,[Source]
    mov edi,[Dest]
    mov ecx,[Size]
    rep stosd
    and ecx,03h
    rep stosb
     
    pop ebp
    ret
    _Copy32 endp
    This should work for memory to memory, but it wont for bitmaps. For bitmaps you must align to the nearest 4 byte boundary which is easy to do. You say the code is slow, yet I can't see how any version in assembly could be all that slow. You could use memcpy from C and I still cannot see why it would be as slow as you are indicating.

    I've tried both and althought my memcpy is faster it is less versatile. Most of the time I just use memcpy and it suits me just fine. I've never had a major problem with it being too slow.

    The fastest way I can find in pure C is something like this:

    Code:
     
    void Blit(WORD x, WORD y,Image *Pic,DWORD Pitch)
    {
    //Compute memory offset of x,y
    DWORD screen_offset=y*Pitch+x;
     
    //Store for quick pointer addition later
    DWORD orig_offset=screen_offset;
     
    //Set bitmap memory offset
    DWORD bitmap_offset=0;
     
    //Pointer to image data for image
    DWORD *ptrData=Pic->Data;
     
    //Tracks width of current line in bitmap
    int width_counter=0;
     
    for (DWORD i=0;i<Pic->Size;i++)
    {
    	//Write pixel to screen from bitmap data
    	Pixel[screen_offset]=ptrData[bitmap_offset];
     
    	//Increment screen offset
    	screen_offset++;
     
    	//Increment bitmap offset
    	bitmap_offset++;
     
    	//Increment width count
    	width_counter++;
     
    	//Are we at max width - or end of line
    	if (width_counter>Pic->Height)
    	{
    	 //Take pre-computed offset and add Pitch -> move down one line
    	 orig_offset+=Pitch;
     
    	 //Reset width count
    	 width_counter=0;
     
    	 //Set screen offset to re-computed orig_offset - have moved down one line
    	 screen_offset=orig_offset;
    	}
    }
    }
    There are some optimizations you can do in that loop as well. Notice that I'm not re-multiplying to compute the new screen offsets and I'm moving through the bitmap memory in linear fashion. It all comes down to a simple few adds. One mul per bitmap. You could code the line blitter in assembly but you would have to setup EDI and ESI prior to the loop or it would be slower than this.

    asm {
    mov esi,[ptrData]
    mov edi,[Screen]

    }

    Then in the loop

    asm {
    mov ecx,[Width]
    rep stosd
    and ecx,03h
    rep stosb
    }

    But this is really two loops. One is the C for loop and one is the implicit loop using rep and the ECX register

    Of course you could write the whole blitter in pure assembly but I doubt you would get much benefit from doing that. But that blitter in C should be relatively fast since it only increments values and does not do any muls inside of the actual critical loop. Note that it does not do any clipping for the screen boundaries but that is a pretty simple addition if you would like to do it.

    The key is that you want to move as linearily as possible through both the screen memory and the bitmap memory. The problem comes when you get to the end of one line of the bitmap. You need to re-compute the screenoffset but doing a mul here would be a bit slow. So what I've done is pre-compute the starting offset and saved its value in another variable. When you come to the end of a line, simply increment the starting offset by the pitch of the memory and set the current screen offset to that value. This in essence moves down the screen one line, but does it without using a multiplication.

    Note that this is a top-left to bottom right horizontal blit. A faster blit would be to start at top left and increment screen_offset by Pitch and increment bitmap_offset by Image->Width. Then when you reach the bottom of the image or when your Y counter is > Pic->Width you would add one to orig_offset to move over one column. Then the whole process starts over. The reason this is faster is because there are always fewer pixels in the vertical dimension than in the horizontal. In fact this is the blit I would use in a game if I were coding the rasterizer from scratch. You would also have to store an original offset for the bitmap since you are no longer moving through it linearily. But this is easy. When you reach the bottom of the current column simply increment bitmap_orig_offset by 1 and set bitmap_offset to that value. Then you are ready for the next column. I think this would be extremely fast.

    I'm not quite sure what you mean by unaligned since if the memory is unaligned or aligned incorrectly the bitmap will not display correctly at best and at worst you will crash the whole thing.
    Last edited by VirtualAce; 12-14-2004 at 09:58 PM.

  8. #8
    The C-er
    Join Date
    Mar 2004
    Posts
    192
    Umm that looks suspiciously like x86 code, not ARM. There isn't really an issue on x86, since it allows unaligned accessess anyway, i.e. you just let the CPU's write combining buffers sort it out.

    I've just dug out my old Sharp Zaurus, so if I can get C running on it I might try a few things.

  9. #9
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    Yeah it is x86 but even if you cannot do unaligned accesses just align the data to what it needs. Zero pad the data or use NOPs to align the data to a boundary.

  10. #10
    The C-er
    Join Date
    Mar 2004
    Posts
    192
    I'm not sure if that's too helpful. Often unaligned accesses are required, but you can't always just use padding to align them, although I agree that would be a solution if practicable.

    From what Pea says, it sounds like the source bitmap is aligned (no reason not to be) and it's the destination screen that requires unaligned writes. It could be that the source + destination are the same region of memory even (i.e. panning or scrolling the screen). For sure you could keep multiple copies of the source bitmap, aligned differently, but that puts a premium on space as Salem pointed out. Typically the types of devices that use ARM's don't have unlimited amounts of RAM.

  11. #11
    Registered User
    Join Date
    Oct 2004
    Posts
    36
    Hi guys,

    Thanks for all the debate and help etc!

    I clip the bitmap if it moves over the LCD border, so sometimes the source (bitmap) data is unaligned, and sometimes the destination (LCD) address is unaligned.

    Bubba - thanks for the blitting code, but my blitting finctions are already very quick, and utilise memcpy to copy complete (or partial) scanlines at one go, and then jump forward through the bitmap and LCD (source and destination) by precomputed amounts before copying the next scanline. The problem is only that the performance of memcpy varies so greatly - from 44 fps to 11 fps on aligned/unaligned addresses.

    I HAVE found a solution. Its a memcpy routine written in assembler that compiled for me with no adjustments to the code at all (gcc). It can be found here (this wasn't my original source, but I cant find that anymore): http://www.rtems.com/ml/rtems-users/.../msg00096.html

    It copies in a variety of ways - e.g. If the memory is aligned, it copies in large chunks (4,8 or 16 bytes I think). If the LAST byte is aligned, it copies backwards, etc. The end results for me are:

    Original memcpy
    aligned: 44fps
    unaligned: 11fps

    Optimised memcpy
    aligned: 46fps
    unaligned: 44fps

    Great results, huh? I think the moral is to not assume that just because a function is in assembler that it is necessarily the fastest or most optimised it can be.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 14
    Last Post: 06-28-2006, 01:58 AM
  2. Memcpy(); Errors...
    By Shamino in forum C++ Programming
    Replies: 4
    Last Post: 03-24-2006, 11:35 AM
  3. memcpy with 128 bit registers
    By grady in forum Linux Programming
    Replies: 2
    Last Post: 01-19-2004, 06:25 AM
  4. memcpy
    By doubleanti in forum C++ Programming
    Replies: 10
    Last Post: 02-28-2002, 04:44 PM