Fast memcpy() alternative for a 32-bit embedded processor (Posted just FYI and FWIW!)

**thickster** · 02-15-2013

Code:

/******************************************************************************
 *
 * Function Name:  manualCopy
 *
 * Description: Manually copies data from memory to memory.  This is used by
 * sysFastMemCopy to copy a few lingering bytes at the beginning and end.
 *
 *****************************************************************************/

inline void manualCopy( uint8 *pDest, uint8 *pSrc, uint32 len )
{
    uint32 i;

    // Manually copy the data
    for ( i = 0; i < len; i++ )
    {
        // Copy data from source to destination
        *pDest++ = *pSrc++;
    }
}

/******************************************************************************
 *
 * Function Name:  sysFastMemCopy
 *
 * Description: Assuming that your processor can do 32-bit memory accesses
 * and contains a barrel shifter and that you are using an efficient
 * compiler, then this memory-to-memory copy procedure will probably be more
 * efficient than just using the traditional memcpy procedure if the number
 * of bytes to copy is greater than about 20.  It works by doing 32-bit
 * reads and writes instead of using 8-bit memory accesses.
 *
 * NOTE that this procedure assumes a Little Endian processor!  The shift
 * operators ">>" and "<<" should all be reversed for Big Endian.
 *
 * NEVER use this when the number of bytes to be copied is less than about
 * 10, since it may not work for a small number of bytes.  Also, do not use
 * this when the source and destination regions overlap.
 *
 * NOTE that this may NOT be faster than memcpy if your processor supports a
 * really fast cache memory!
 *
 * Timing for this sysFastMemCopy varies some according to which shifts need
 * to be done.  The following results are from one attempt to measure timing
 * on a Cortex M4 processor running at 48 MHz.
 *
 *                           MEMCPY        FAST
 *                  BYTES  bytes/usec   bytes/usec
 *                  -----  ----------  ------------
 *                    50       4.2      6.3 to  6.3
 *                   100       4.5      8.3 to 10.0
 *                   150       4.8     10.0 to 11.5
 *                   200       4.9     10.5 to 12.5
 *                   250       5.1     11.4 to 13.2
 *                   300       5.1     11.5 to 13.6
 *                   350       5.1     12.1 to 14.6
 *                   400       5.1     12.1 to 14.8
 *                   450       5.2     12.2 to 15.5
 *                   500       5.2     12.5 to 15.2
 *
 * The following macro can be used instead of memcpy to automatically select
 * whether to use memcpy or sysFastMemCopy:
 *
 *   #define MEMCOPY(pDst,pSrc,len) \
 *     (len) < 16 ? memcpy(pDst,pSrc,len) : sysFastMemCopy(pDst,pSrc,len);
 *
 *****************************************************************************/

void sysFastMemCopy( uint8 *pDest, uint8 *pSrc, uint32 len )
{
    uint32 srcCnt;
    uint32 destCnt;
    uint32 newLen;
    uint32 endLen;
    uint32 longLen;
    uint32 *pLongSrc;
    uint32 *pLongDest;
    uint32 longWord1;
    uint32 longWord2;
    uint32 methodSelect;
    
    // Determine the number of bytes in the first word of src and dest
    srcCnt = 4 - ( (uint32) pSrc & 0x03 );
    destCnt = 4 - ( (uint32) pDest & 0x03 );
    
    // Copy the initial bytes to the destination
    manualCopy( pDest, pSrc, destCnt );
    
    // Determine the number of bytes remaining
    newLen = len - destCnt;
    
    // Determine how many full long words to copy to the destination
    longLen = newLen / 4;
    
    // Determine number of lingering bytes to copy at the end
    endLen = newLen & 0x03;
    
    // Pick the initial long destination word to copy to
    pLongDest = (uint32*) ( pDest + destCnt );
    
    // Pick the initial source word to start our algorithm at
    if ( srcCnt <= destCnt )
    {
        // Advance to pSrc at the start of the next full word
        pLongSrc = (uint32*) ( pSrc + srcCnt );
    }
    else // There are still source bytes remaining in the first word
    {
        // Set pSrc to the start of the first full word
        pLongSrc = (uint32*) ( pSrc + srcCnt - 4 );
    }
    
    // There are 4 different longWord copy methods
    methodSelect = ( srcCnt - destCnt ) & 0x03;
    
    // Just copy one-to-one
    if ( methodSelect == 0 )
    {
        // Just copy the specified number of long words
        while ( longLen-- > 0 )
        {
            *pLongDest++ = *pLongSrc++;
        }
    }
    else if ( methodSelect == 1 )
    {
        // Get the first long word
        longWord1 = *pLongSrc++;
        
        // Copy words created by combining 2 adjacent long words
        while ( longLen-- > 0 )
        {
            // Get the next 32-bit word
            longWord2 = *pLongSrc++;
            
            // Write to the destination
            *pLongDest++ = ( longWord1 >> 24 ) | ( longWord2 << 8 );
            
            // Re-use the word just retrieved
            longWord1 = longWord2;
        }
    }
    else if ( methodSelect == 2 )
    {
        // Get the first long word
        longWord1 = *pLongSrc++;
        
        // Copy words created by combining 2 adjacent long words
        while ( longLen-- > 0 )
        {
            // Get the next 32-bit word
            longWord2 = *pLongSrc++;
            
            // Write to the destination
            *pLongDest++ = ( longWord1 >> 16 ) | ( longWord2 << 16 );
            
            // Re-use the word just retrieved
            longWord1 = longWord2;
        }
    }
    else // ( methodSelect == 3 )
    {
        // Get the first long word
        longWord1 = *pLongSrc++;
        
        // Copy words created by combining 2 adjacent long words
        while ( longLen-- > 0 )
        {
            // Get the next 32-bit word
            longWord2 = *pLongSrc++;

            // Write to the destination
            *pLongDest++ = ( longWord1 >> 8 ) | ( longWord2 << 24 );

            // Re-use the word just retrieved
            longWord1 = longWord2;
        }
    }
    
    // Copy any remaining bytes
    if ( endLen != 0 )
    {
        // The trailing bytes will be copied next
        pDest = (uint8*) pLongDest;
        
        // Determine where the trailing source bytes are located
        pSrc += len - endLen;
        
        // Copy the remaining bytes
        manualCopy( pDest, pSrc, endLen );
    }
}

**Adak** · 02-15-2013

Do you have any tests with a modern cpu? I5, I7, perhaps?

Tests from a Cortex M4 running at 48 MHz are not too relevant. Of course newer cpu's have fast and large cache's.

**thickster** · 02-15-2013

Sorry, I'm a lowly firmware engineer who uses only embedded microcontrollers. I'm offering this code quite humbly for other embedded programmers who might need a few extra usecs of speed in their copy operations. I've tested it and it seems to work for me. It took me a few hours to figure out the algorithm and to implement it, so I figured it might be of use and save some people considerable design time. I just did this last night, so it would be advisable for a serious user to verify that it works for their application.

**smokeyangel** · 02-15-2013

I think that's pretty neat!

Generally optimising code for microcontrollers is a trade off between code size and performance. This code takes performance to an extreme at the cost of really rather bulky code. Compiled with Linaro GCC for Cortex-M4 it's over 500 bytes (with manualCopy inlined twice). The library memcpy is 130, and your simplest manualCopy case is just 50 bytes.

I don't think I've seen this particular approach before, like I said: pretty neat!

I have some questions....

What makes you think it'd be less effective on a core with a big fast cache? It looks pretty ARM specific to me, but I wouldn't necessarily think it's M-class specific. Word sized accesses are always better!
What library implementation of memcpy were you benchmarking against?
Do you have performance figures for different src/dest alignments? The memcpy in newlib I looked at will do word copies if both addresses are word-aligned, so I'd expect that to be slightly faster than you implementation. Yours should trounce it on unaligned addresses though
Cortex-M4 supports unaligned accesses -- but they're inefficient compared to aligned ones. Have you tried implementing a trivial word-copy version that ignores alignment except for dealing with the trailing bytes? As I understand it the same sort of shuffling you're doing then happens in hardware, might actually be quicker (library implementations can't do this as the M4 can be configured to fault unaligned accesses)
Or, copy enough bytes to align the source address but leave the destination address unaligned then do word copies. According to this ARM Information Center stores can't stall the core because of the write buffer

And just a couple of thoughts:

Code:

    // Determine the number of bytes in the first word of src and dest
    srcCnt = 4 - ( (uint32) pSrc & 0x03 );
    destCnt = 4 - ( (uint32) pDest & 0x03 );

If the addresses are word-aligned, this will set both cnt variables to 4, leading to an unnecessary byte-by-byte copy of the first 4 bytes. Doing the trivial "if 4, then 0" check is bound to cost less than coping those bytes.

Code:

 *   #define MEMCOPY(pDst,pSrc,len) \
 *     (len) < 16 ? memcpy(pDst,pSrc,len) : sysFastMemCopy(pDst,pSrc,len);

You should make your return type and args match memcpy's if you intend it to be a drop in replacement: memcpy - C++ Reference

Using a #define like this means you can't treat "MEMCOPY" like a proper function -- e.g, you can't take its address.

It also won't stop the compiler from generating implicit calls to library memcpy for things like struct copies. These can be spotted and rewritten though.

**Barney McGrew** · 02-15-2013

Functions like memcpy are usually implemented in assembly rather than C. Here's the one OpenBSD's libc uses for i386: http://www.openbsd.org/cgi-bin/cvswe...e=text%2Fplain

Usually it's better to just use whatever your C library provides instead of writing your own stuff.

**phantomotap** · 02-15-2013

O_o

You may want to try combing cases for the sake of code size.

Also, if you change the way you do your check, you can get simple (small range), aligned, fast, or default to `memcpy' (overlapped) with a similar number of operations.

Using a #define like this means you can't treat "MEMCOPY" like a proper function -- e.g, you can't take its address.

If he makes it a real replacement, he can just macro the name without parameters.

Soma

Code:

*pLongDest++ = ( longWord1 >> ls ) | ( longWord2 << rs ); // factor this by setting the shifts

**thickster** · 02-15-2013

Originally Posted by smokeyangel

I think that's pretty neat!

Generally optimising code for microcontrollers is a trade off between code size and performance. This code takes performance to an extreme at the cost of really rather bulky code. Compiled with Linaro GCC for Cortex-M4 it's over 500 bytes (with manualCopy inlined twice). The library memcpy is 130, and your simplest manualCopy case is just 50 bytes.

======= REPLY: In my current application (I'm doing this for a time critical packet-based communications application) I've got plenty of both code space and RAM, so the size didn't concern me. I'm just interested in speed. I may be copying data packets that are a few hundred bytes long.

I don't think I've seen this particular approach before, like I said: pretty neat!

I have some questions....

What makes you think it'd be less effective on a core with a big fast cache? It looks pretty ARM specific to me, but I wouldn't necessarily think it's M-class specific. Word sized accesses are always better!

==== REPLY: Heck, it just occurred to me after I wrote it that cache might make a difference, so I included this comment just as an afterthought. I claim no knowledge on this, but wanted to include it so others (like you) might consider the possibility.

What library implementation of memcpy were you benchmarking against?

==== REPLY: I'm currently using an Atmel processor so I'm using their Atmel Studio 6 development environment. (AS6 is really a treat to use for a free compiler, BTW. Much better than their previous efforts.) I don't know the details of their memcpy, but its timing sure didn't seem to vary regardless of the various alignments I tested. Clearly seemed to be a straight byte-to-byte approach.

Do you have performance figures for different src/dest alignments? The memcpy in newlib I looked at will do word copies if both addresses are word-aligned, so I'd expect that to be slightly faster than you implementation. Yours should trounce it on unaligned addresses though

==== REPLY: I tested all possible alignments. That's why I showed a range of times for my "fast" code. The shorter times were where it just copied already-aligned words.

Cortex-M4 supports unaligned accesses -- but they're inefficient compared to aligned ones. Have you tried implementing a trivial word-copy version that ignores alignment except for dealing with the trailing bytes? As I understand it the same sort of shuffling you're doing then happens in hardware, might actually be quicker (library implementations can't do this as the M4 can be configured to fault unaligned accesses)

==== REPLY: I just started using this processor a couple months ago and it didn't even OCCUR to me that the Cortex-M4 might support unaligned word accesses. (The 32-bit uC's I've used in the past didn't support unaligned accesses.) If so, then this code might be unnecessary. But if that's the case, then why didn't they optimize the memcpy function to do word accesses?

Or, copy enough bytes to align the source address but leave the destination address unaligned then do word copies. According to this ARM Information Center stores can't stall the core because of the write buffer

And just a couple of thoughts:

Code:

    // Determine the number of bytes in the first word of src and dest
    srcCnt = 4 - ( (uint32) pSrc & 0x03 );
    destCnt = 4 - ( (uint32) pDest & 0x03 );

If the addresses are word-aligned, this will set both cnt variables to 4, leading to an unnecessary byte-by-byte copy of the first 4 bytes. Doing the trivial "if 4, then 0" check is bound to cost less than coping those bytes.

==== REPLY: Yes, you're absolutely right, and I realized that when I wrote the code. I figured that one extra 4-byte manual copy was a trivial penalty. Doing the manual copy of those 4 bytes allowed that case to be handled naturally along with other cases, and I just thought that was a little more "elegant". (I may be weird, here.)

Code:

 *   #define MEMCOPY(pDst,pSrc,len) \
 *     (len) < 16 ? memcpy(pDst,pSrc,len) : sysFastMemCopy(pDst,pSrc,len);

You should make your return type and args match memcpy's if you intend it to be a drop in replacement: memcpy - C++ Reference

==== REPLY: You're right, but I'm not a purist! I'd be ashamed if you knew all shortcuts I take and the compiler warnings I've trained myself to ignore!

Using a #define like this means you can't treat "MEMCOPY" like a proper function -- e.g, you can't take its address.

It also won't stop the compiler from generating implicit calls to library memcpy for things like struct copies. These can be spotted and rewritten though.

(Sorry if I didn't do this reply correctly. I'm a first-timer.)

**smokeyangel** · 02-16-2013

Originally Posted by thickster

==== REPLY: Heck, it just occurred to me after I wrote it that cache might make a difference, so I included this comment just as an afterthought. I claim no knowledge on this, but wanted to include it so others (like you) might consider the possibility.

On ARM, natural alignment is always a good thing, and byte-by-byte copies always means executing more instructions whether or not the data is cached. So I think the approach would be just as valid on a cached core,

What library implementation of memcpy were you benchmarking against?

==== REPLY: I'm currently using an Atmel processor so I'm using their Atmel Studio 6 development environment. (AS6 is really a treat to use for a free compiler, BTW. Much better than their previous efforts.) I don't know the details of their memcpy, but its timing sure didn't seem to vary regardless of the various alignments I tested. Clearly seemed to be a straight byte-to-byte approach.

It does indeed sound like alignment is ignored. Which is a little surprising given how easy it is to word-copy or even LDM/STM aligned pointers.
I think Atmel do non-ARM chips too, so maybe they have a simple central C library implementation written mostly in portable C.

==== REPLY: I tested all possible alignments. That's why I showed a range of times for my "fast" code. The shorter times were where it just copied already-aligned words.

I'm not a memcpy expert, but having a look around at various implementations, I think most optimised memcpys would beat yours on word aligned src/dest. Most do word-by-word copy plus <some other clever stuff>.

They don't seem to care about the case where one or more pointer is unaligned though.
The Linaro memcpy.c source says:

Code:

  /* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */

"This should be rare" seems a bit sweeping to me. Copying bytestreams around the place doesn't seem that rare a thing to be doing.

I found this quite nice article about improving memcpy:
Optimizing Memcpy improves speed | Embedded

It's from 2004, and does mention the approach you've taken (has a diagram anyway). I wonder why this isn't a commonly optimised case.

==== REPLY: I just started using this processor a couple months ago and it didn't even OCCUR to me that the Cortex-M4 might support unaligned word accesses. (The 32-bit uC's I've used in the past didn't support unaligned accesses.) If so, then this code might be unnecessary. But if that's the case, then why didn't they optimize the memcpy function to do word accesses?

Indeed it does - LDR/STR but not LDM/STM or LDRD/STRD. ARM don't seem too keen on it:

Originally Posted by Cortex-M4 user guide @ arm.com

Unaligned accesses are usually slower than aligned accesses. In addition, some memory regions might not support unaligned accesses. Therefore, ARM recommends that programmers ensure that accesses are aligned. To trap accidental generation of unaligned accesses, use the UNALIGN_TRP bit in the Configuration and Control Register, see Configuration and Control Register.

The slowdown is:

Originally Posted by more Cortex-m4 @ arm.com

Unaligned word or halfword loads or stores add penalty cycles. A byte aligned halfword load or store adds one extra cycle to perform the operation as two bytes. A halfword aligned word load or store adds one extra cycle to perform the operation as two halfwords. A byte-aligned word load or store adds two extra cycles to perform the operation as a byte, a halfword, and a byte. These numbers increase if the memory stalls. A STR or STRH cannot delay the processor because of the write buffer.

Normal loads take 2 cycles.
So... the worst case is a byte-aligned word load would take 2 + 2 cycles. Whereas loading 1 byte at a time would be 2*4 cycles.

So.... I think the reason this isn't exploited in libraries, is that a person implementing a C library for Cortex-M4 devices would try to write it to work in the strictest configuration of the core. In this case, a library that uses only aligned accesses - because they don't know if the library functions will be linked into an image running on a system with unaligned accesses faulted, and they don't know that memcpy won't be asked to copy from/to some memory region that doesn't support unaligned accesses.

The manual doesn't seem to say what the reset value of the unaligned trap bit is, but I suspect it's 0, meaning unaligned accesses are allowed. Access away!!

==== REPLY: Yes, you're absolutely right, and I realized that when I wrote the code. I figured that one extra 4-byte manual copy was a trivial penalty. Doing the manual copy of those 4 bytes allowed that case to be handled naturally along with other cases, and I just thought that was a little more "elegant". (I may be weird, here.)

I definitely get what you mean. Letting a special case get swept up successully by a general case is satisfying.

You're probably right about it being a trivial penalty. Made even less of a problem by your code not being recommended for small memcpys. Millions of small aligned memcpys would probably at some point be slower because of those 4 bytes?

(Sorry if I didn't do this reply correctly. I'm a first-timer.)

Welcome

You reply was nice to read, but I had to copy/paste it into my reply as "quote" won't quote the whole.... quote. So it'd be good if you put replies outside of quote boxes.

**thickster** · 02-16-2013

It would be best to include some test for small copy sizes into this code. That way, it would work for all cases, so I was too quick to publish this code. The code assumes that the length will always be at least as large as the number of unaligned bytes in the first word of the destination ("destCnt"), which is of course a bad assumption.

To handle these small cases, one could test for "destCnt <= 4" (or some other small number of bytes > 4 where it would be more efficient than doing the rest of the algorithm) at the start of the procedure. If true, then manually copy "len" bytes and exit.

I'll fix my own version to do this so that I don't have to use a macro to select which copy procedure to use, then do a test to make sure I didn't miss anything else, too, for small copy sizes.... I'll post back here if I find anything.

**thickster** · 02-16-2013

To ensure this works for all cases of small copy lengths, insert the following code after the variable declarations:

Code:

    
    // This is necessary to avoid errors due to the careless assumption for
    // small copy lengths that destCnt (the number of unaligned bytes in the
    // first word of the destination) is >= len (the total number of bytes
    // to copy).  It also is more efficient to copy a small number of bytes
    // one at a time than to run the rest of the algorithm, so for
    // efficiency's sake, we could increase this from "4" to perhaps "10".
    
    if ( len <= 4 )
    {
        // For small copy lengths, copy byte-by-byte to the destination
        manualCopy( pDest, pSrc, len );
        
        // Exit now
        return;
    }

Originally Posted by thickster

It would be best to include some test for small copy sizes into this code. That way, it would work for all cases, so I was too quick to publish this code. The code assumes that the length will always be at least as large as the number of unaligned bytes in the first word of the destination ("destCnt"), which is of course a bad assumption.

To handle these small cases, one could test for "destCnt <= 4" (or some other small number of bytes > 4 where it would be more efficient than doing the rest of the algorithm) at the start of the procedure. If true, then manually copy "len" bytes and exit.

I'll fix my own version to do this so that I don't have to use a macro to select which copy procedure to use, then do a test to make sure I didn't miss anything else, too, for small copy sizes.... I'll post back here if I find anything.

**Salem** · 02-16-2013

Some code, and some thoughts.

Code:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
#include <string.h>

#define SIZE  (1024*1024*100)
#define SIZE2 (SIZE+50)

typedef unsigned int uint32;
typedef unsigned char uint8;

void manualCopy( uint8 *pDest, const uint8 *pSrc, size_t len )
{
    uint32 i;

    // Manually copy the data
    for ( i = 0; i < len; i++ )
    {
        // Copy data from source to destination
        *pDest++ = *pSrc++;
    }
}

void *sysFastMemCopy( void *d, const void *s, size_t len )
{
    uint8  *pDest = d;
    const uint8  *pSrc = s;
    uint32 srcCnt;
    uint32 destCnt;
    uint32 newLen;
    uint32 endLen;
    uint32 longLen;
    uint32 *pLongSrc;
    uint32 *pLongDest;
    uint32 longWord1;
    uint32 longWord2;
    uint32 methodSelect;
    
    if ( len <= 32 )
    {
        // For small copy lengths, copy byte-by-byte to the destination
        manualCopy( pDest, pSrc, len );
        // Exit now
        return pDest;
    }
    // Determine the number of bytes in the first word of src and dest
    srcCnt = 4 - ( (intptr_t) pSrc & 0x03 );
    destCnt = 4 - ( (intptr_t) pDest & 0x03 );
    
    // Copy the initial bytes to the destination
    manualCopy( pDest, pSrc, destCnt );
    
    // Determine the number of bytes remaining
    newLen = len - destCnt;
    
    // Determine how many full long words to copy to the destination
    longLen = newLen / 16;
    
    // Determine number of lingering bytes to copy at the end
    endLen = newLen & (16-1);
    
    // Pick the initial long destination word to copy to
    pLongDest = (uint32*) ( pDest + destCnt );
    
    // Pick the initial source word to start our algorithm at
    if ( srcCnt <= destCnt )
    {
        // Advance to pSrc at the start of the next full word
        pLongSrc = (uint32*) ( pSrc + srcCnt );
    }
    else // There are still source bytes remaining in the first word
    {
        // Set pSrc to the start of the first full word
        pLongSrc = (uint32*) ( pSrc + srcCnt - 4 );
    }
    
    // There are 4 different longWord copy methods
    methodSelect = ( srcCnt - destCnt ) & 0x03;
    
    // Just copy one-to-one
    if ( methodSelect == 0 )
    {
        // Just copy the specified number of long words
        do {
            *pLongDest++ = *pLongSrc++;
            *pLongDest++ = *pLongSrc++;
            *pLongDest++ = *pLongSrc++;
            *pLongDest++ = *pLongSrc++;
        }
        while ( longLen -= 4 > 0 );
    } else {
        int left = 0, right = 0;
        switch ( methodSelect ) {
            case 1: left = 8; right = 24; break;
            case 2: left = 16; right = 16; break;
            case 3: left = 24; right = 8; break;
            default: exit(1); // it's all gone wrong
        }
        // Get the first long word
        longWord1 = *pLongSrc++;
        
        // Copy words created by combining 2 adjacent long words
        do {
            longWord2 = *pLongSrc++;
            *pLongDest++ = ( longWord1 >> right ) | ( longWord2 << left );
            longWord1 = longWord2;

            longWord2 = *pLongSrc++;
            *pLongDest++ = ( longWord1 >> right ) | ( longWord2 << left );
            longWord1 = longWord2;

            longWord2 = *pLongSrc++;
            *pLongDest++ = ( longWord1 >> right ) | ( longWord2 << left );
            longWord1 = longWord2;

            longWord2 = *pLongSrc++;
            *pLongDest++ = ( longWord1 >> right ) | ( longWord2 << left );
            longWord1 = longWord2;
        }
        while ( longLen -= 4 > 0 );
    }

    // Copy any remaining bytes
    if ( endLen != 0 )
    {
        // The trailing bytes will be copied next
        pDest = (uint8*) pLongDest;
        
        // Determine where the trailing source bytes are located
        pSrc += len - endLen;
        
        // Copy the remaining bytes
        manualCopy( pDest, pSrc, endLen );
    }

    return pDest;
}

void fill ( char *p, size_t s, char f ) {
  while ( s-- ) {
    *p++ = f;
  }
}

int main ( int argc, char *argv[] ) {
  struct {
    char  *fname;
    void *(*func)(void*,const void*,size_t);
  } tests[] = {
#define TEST(x) #x, x
    { TEST(memcpy) },
    { TEST(memmove) },
    { TEST(sysFastMemCopy) },
  };
  int fromOffset = (argc>1) ? atoi(argv[1]) : 0;
  int toOffset = (argc>2) ? atoi(argv[2]) : 0;
  char  *from_block = malloc(SIZE2);
  char  *to_block = malloc(SIZE2);
  char  *dummy_block = malloc(SIZE2);
  char  *refblock = NULL;
  printf("Copy offsets are %d and %d\n", fromOffset, toOffset );
  for ( size_t i = 0 ; i < sizeof(tests)/sizeof(tests[0]) ; i++ ) {
    fill(from_block,SIZE2,'A');
    fill(to_block,SIZE2,'@');
    fill(dummy_block,SIZE2,'X');  // make sure from and to are out of cache
    struct timeval start,end;
    gettimeofday(&start,NULL);
    tests[i].func(to_block+toOffset,from_block+fromOffset,SIZE);
    gettimeofday(&end,NULL);
    if ( refblock == NULL ) {
      refblock = malloc(SIZE2);
      memcpy(refblock,to_block,SIZE2);
    }
    int diff = memcmp(to_block,refblock,SIZE2);
    unsigned long elapsed = (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec);
    printf("Copying %d bytes using %s took %lu uSec, result=%d\n", SIZE, tests[i].fname, elapsed, diff );
  }
  free(from_block);
  free(to_block);
  free(dummy_block);
  free(refblock);
}


$ gcc -std=c99 -Wall -Wextra -g -O2 foo.c
$ ./a.out ; ./a.out 1 2 ; ./a.out 2 1 ; ./a.out 2 2
Copy offsets are 0 and 0
Copying 104857600 bytes using memcpy took 12326 uSec, result=0
Copying 104857600 bytes using memmove took 12338 uSec, result=0
Copying 104857600 bytes using sysFastMemCopy took 16871 uSec, result=0
Copy offsets are 1 and 2
Copying 104857600 bytes using memcpy took 12316 uSec, result=0
Copying 104857600 bytes using memmove took 12654 uSec, result=0
Copying 104857600 bytes using sysFastMemCopy took 21705 uSec, result=0
Copy offsets are 2 and 1
Copying 104857600 bytes using memcpy took 12746 uSec, result=0
Copying 104857600 bytes using memmove took 12643 uSec, result=0
Copying 104857600 bytes using sysFastMemCopy took 21832 uSec, result=0
Copy offsets are 2 and 2
Copying 104857600 bytes using memcpy took 12162 uSec, result=0
Copying 104857600 bytes using memmove took 12057 uSec, result=0
Copying 104857600 bytes using sysFastMemCopy took 16843 uSec, result=0

1. The non-zero methodSelect's have been collapsed into a single else, and the shift amounts turned into variables initialised within a switch.
2. The loops have been unrolled 4 times to cut down on the number of branches (and other code) which doesn't actually do any copying.
How much should you unroll? Well not so much that you bloat the code and end up with instruction cache misses.
3. This machine is a fast x86_64, hence the 100MB of data to get some reasonable times.
4. I changed the function interface to match memmove / memcpy. You want the same interface to ease the drop-in replacement of one with the other.

If you really want to "go for it", you could code lines 100 to 120 in assembler, using LDM and STM with 4 registers to hold 4 32-bit values at once.
6 other registers would hold all the other variables and pointers in this loop.

**thickster** · 02-16-2013

And just to sum things up ....

After learning that my own Cortex-M4 processor actually DOES allow unaligned 32-bit accesses, I wrote a MUCH simpler procedure to do an alternative memcpy function that ignores alignment issues. It works, and it is usually faster than the more complicated method previously posted that tries to observe 32-bit alignment issues.

The speed improvement, when using the Cortex-M4's ability to access unaligned 32-bit words, was approximately 10%. The new procedure could, on average, copy 500 bytes in 33 usecs. The original (clumsy) procedure averaged 36.5 usecs to copy 500 bytes.

One interesting note, however: In 5 of the 16 possible alignment cases, the new method was actually SLOWER. The most notable cases were where both source and destination were misaligned by +1 byte, and by +3 bytes. In these cases, the new method was slower by about 20% (37 usecs vs 31 usecs to copy 500 bytes).

Bottom line, however, is that on a Cortex-M4 processor, on average it is best to let the processor do its 32-bit accesses without regard to word alignment.

Code:

/******************************************************************************
 *
 * Function Name:  sysFastMemCopy2
 *
 * Description: This is a straightforward alternative to "memcpy" that does
 * word memory accesses.  All potential alignment issues are ignored because
 * the Cortex-M4 processor on which this runs can do unaligned 32-bit reads
 * and writes.
 *
 *****************************************************************************/

void sysFastMemCopy2( uint8 *pDest, uint8 *pSrc, uint32 len )
{
    uint32 i;
    uint32 *pLongSrc;
    uint32 *pLongDest;
    uint32 numLongs = len / 4;
    uint32 endLen = len & 0x03;

    // Convert byte addressing to long addressing
    pLongSrc = (uint32*) pSrc;
    pLongDest = (uint32*) pDest;

    // Copy long values, disregarding any 32-bit alignment issues
    for ( i = 0; i < numLongs; i++ )
    {
        *pLongDest++ = *pLongSrc++;
    }
    
    // Convert back to byte addressing
    pSrc = (uint8*) pLongSrc;
    pDest = (uint8*) pLongDest;

    // Copy trailing bytes byte-by-byte
    while ( endLen-- > 0 )
    {
        *pDest++ = *pSrc++;
    }
}

/**************************************************************
 *
 *   NOTE: CORTEX-M4        Number of Usecs
 *      PROCESSOR             Required to
 *                          Copy 500 bytes
 *   Byte Offset from      From Src to Dest:
 *    Perfect 32-bit
 *     Alignment:                 Disregarding
 *                        Using     Alignment
 *     Src   Dest         Shifts     Issues
 *     ---   ----         ------     ------
 *      0      0            30         26
 *      0      1            41         32
 *      0      2            33         29
 *      0      3            41         31
 *
 *      1      0            41         32
 *      1      1            30         37  *
 *      1      2            41         34
 *      1      3            34         37  *
 *
 *      2      0            33         29
 *      2      1            41         34
 *      2      2            30         32  *
 *      2      3            41         35
 *
 *      3      0            41         32
 *      3      1            33         37  *
 *      3      2            41         34
 *      3      3            30         37  *
 *
 *                 AVGS:  36.3         33
 *          
 *            (A second trial follows)
 *          
 *      0      0            31         27
 *      0      1            41         32
 *      0      2            34         29
 *      0      3            41         32
 *
 *      1      0            41         31
 *      1      1            31         37  *
 *      1      2            41         35
 *      1      3            33         37  *
 *
 *      2      0            33         29
 *      2      1            41         34
 *      2      2            30         31  *
 *      2      3            41         34
 *
 *      3      0            41         31
 *      3      1            34         37  *
 *      3      2            42         34
 *      3      3            31         37  *
 *
 *                 AVGS:  36.6       32.9
 *
 *  * Asterisks mark cases where the approach of disregarding
 *    memory alignment when doing word accesses was actually
 *    SLOWER than the more complicated approach of accessing
 *    long words on 32-bit boundaries and doing shifts to
 *    align data before writing to the destination address.
 *
 *************************************************************/

Thread: Fast memcpy() alternative for a 32-bit embedded processor (Posted just FYI and FWIW!)

Thread Tools

Search Thread

Display

Fast memcpy() alternative for a 32-bit embedded processor (Posted just FYI and FWIW!)

Thanks for the interesting feedback on this...

Yeah, my warning about not working for small copies is correct....

And so, for safety's sake, code similar to the following should be inserted ....

And the Cortex-M4 CAN do unaligned 32-bit accesses, so ....

Similar Threads

Embedded Linux & Embedded Java

FWIW... Forums crosslinked

Fast memcpy for unaligned addresses

Sorry if this has been posted before, but...

Hopefully this can be posted here...

Tags for this Thread