Thread: Faster way to copy memory?

  1. #1
    Registered Abuser
    Join Date
    Jun 2006

    Post Faster way to copy memory?

    So I was thinking, if at the hardware level, most 32-bit CPUs are better able to deal with 4-byte values (i.e. integers), what if I implemented a simple memcpy routine that moves memory 4-bytes at a time (instead of what I would assume to be GCC's 1-byte implementation):
    char *createNewBlock(char const *block, size_t block_size)
        char *p = malloc(block_size + 1);
        int quotient = block_size / sizeof(int);
        int rem = block_size % sizeof(int);
        for(; quotient > 0; --quotient, p = (int *)p + 1, block = (int *)block + 1)
            *((int *)p) = *((int *)block);
        for(; rem > 0; --rem, ++p, ++block)
            *p = *block;
    Here, I copy 4-bytes at a time and then switch to 1-byte for any remaining bytes (that is, if it does what I meant it to do...). Is there any immediate gain in this over just copying 1-byte at a time the whole way through?
    My thinking is that, for very small block sizes, it is pointless and probably causes more overhead, but when dealing with larger blocks (perhaps > ~24 bytes) it may show an improvement in speed/effeciency as much as there is an improvement in the underlying hardware when dealing with 4-byte vs 1-byte values.
    Also, what is GCC's implementation of memcpy? is this more effecient? (here's some other implementations of memcpy I was looking at).

  2. #2
    Registered User
    Join Date
    Oct 2001
    why don't you do some benchmarks?

  3. #3
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    The edge of the known universe
    > Is there any immediate gain in this over just copying 1-byte at a time the whole way through?
    The answer is way too dependent on your architecture. You just have to benchmark a bunch of different ideas, and profile the results in actual use.

    My impression at the moment is that if you only have one level of cache, then the memory can reasonably keep up with the processor, so it's in your interest to move as much data as possible with each memory access. In which case using a 'long' type can have benefits.

    But if you have 2 levels of cache, then main memory is typically a lot slower than the processor. Whether you move bytes or longs doesn't matter, because either way the processor is perfectly capable of swamping the available memory bandwidth.
    Of course, if you only move a small amount (less than some fraction of the cache size), then perhaps the cache can absorb the sudden change and moving longs is once again a good idea.

    The function call overhead becomes significant if you're copying many very small blocks as well, so inlining your own copy is perhaps a good idea from that point of view.

    Anyway, the point of all that is there is no guaranteed always wins in all cases implementation.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Quote Originally Posted by @nthony View Post
    (instead of what I would assume to be GCC's 1-byte implementation):
    You know what happens when you assume?

    GCC not only inlines calls to memcpy, it even generates special code. GCC 4.3 will feature more enhancements in optimizing memcpy, like being able to optimize for the most common memory size in feedback-driven optimization (i.e. you compile a special version, run it, it generates a feedback file which can then be used by the compiler to optimize better). It can select a special implementation based on the sub-architecture details of the CPU: -mtune=xxx, where xxx is the sub-architecture (core2, k8, ...). It can make use of multimedia instructions for copying more data in one go than just 4 bytes - how about 8 or even 16? MMX and similar extensions make it possible. But if it's slow on another CPU, you can optimize for that instead, use rep stosd and other classic transfer instructions.

    Basically, you probably can't out-optimize the compiler on something as important as memcpy. It has more experience, more knowledge and more calculation power than you.
    All the buzzt!

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Registered Abuser
    Join Date
    Jun 2006
    Thanks for the input guys. I guess for all the architecture dependance and GCC's future/current capabilities and what not, it is probably more sanity-keeping to rely on memcpy for the "hardware hacks", but I will try that long long implementation just out of curiousity.

    >You know what happens when you assume?
    For the record though:
    >"...what I would assume to be GCC's 1-byte implementation if I did not already know that most of my assumptions on GCC's internals and implementation are probably false and/or skewed."

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. tools for finding memory leaks
    By stanlvw in forum C++ Programming
    Replies: 4
    Last Post: 04-03-2009, 11:41 AM
  2. Help with insert/delete binary search tree
    By Nazgulled in forum C Programming
    Replies: 39
    Last Post: 03-25-2009, 04:24 PM
  3. Memory leaks problem in C -- Help please
    By Amely in forum C Programming
    Replies: 14
    Last Post: 05-21-2008, 11:16 AM
  4. Locating A Segmentation Fault
    By Stack Overflow in forum C Programming
    Replies: 12
    Last Post: 12-14-2004, 01:33 PM
  5. What's the best memory (RAM) type?
    By Unregistered in forum A Brief History of
    Replies: 17
    Last Post: 12-15-2001, 12:37 AM