So I was thinking, if at the hardware level, most 32-bit CPUs are better able to deal with 4-byte values (i.e. integers), what if I implemented a simple memcpy routine that moves memory 4-bytes at a time (instead of what I would assume to be GCC's 1-byte implementation):
Code:
char *createNewBlock(char const *block, size_t block_size)
{
char *p = malloc(block_size + 1);
int quotient = block_size / sizeof(int);
int rem = block_size % sizeof(int);
for(; quotient > 0; --quotient, p = (int *)p + 1, block = (int *)block + 1)
*((int *)p) = *((int *)block);
for(; rem > 0; --rem, ++p, ++block)
*p = *block;
}
Here, I copy 4-bytes at a time and then switch to 1-byte for any remaining bytes (that is, if it does what I meant it to do...). Is there any immediate gain in this over just copying 1-byte at a time the whole way through?
My thinking is that, for very small block sizes, it is pointless and probably causes more overhead, but when dealing with larger blocks (perhaps > ~24 bytes) it may show an improvement in speed/effeciency as much as there is an improvement in the underlying hardware when dealing with 4-byte vs 1-byte values.
Also, what is GCC's implementation of memcpy? is this more effecient? (here's some other implementations of memcpy I was looking at).