I don't really know how the cpu does to copy a number of bytes; does it take as long time to copy a whole word (4 bytes on a 32-bit system) as it takes to copy 4 bytes separatelly? Which sizes can the cpu copy at a time, can it for exampe copy exactly 3, or 5 bytes at a time? And does every copy take the same time?

Here's an example: If n and m in known to the compiler during compilation, what is the fastest way of copying at least n bytes from one place in memory to another, and at most m bytes? In one special case (color channels using SDL), there are 3 bytes which need to be copied. One could think the fastest way is to copy byte by byte. But in this case, there's a fourth byte in both the source and at the destination (since the video mode is put to 32 bits/pixel), which makes it possible to copy 4 bytes at a time, which I guess should make a faster copy (3 times faster?). Here n = 3 and m = 4.

The compiler won't be able to optimize this for me, or is it? Since it probably doesn't know m, just n.