I wanted to know if using the large sse registers in a function like memcpy would be faster than standard memcpy. It seems that it is, but not by too much. I think my benchmark is too unscientific to mean much with such small differences in the performance; the results are all over the place. At worst the sse memcpy's can be 5% slower than standard memcpy but this is rare. Usually unaligned sse memcpy is 10% faster and aligned is 30% faster. At best the sse memcpy is ~225% faster but this is very rare.
Don't put much stock in this. This is mostly just a curiosity I wanted to post. All I can say for sure is aligned memory moving is always faster than unaligned, which it should be, and both functions are almost always faster than standard memcpy.
The test program takes two command line arguments, the number of megabytes to allocate, and the number of times to call each function for the average time per call. There is an instruction, emms, that can be removed in both of the memcpysse commands at the bottom.
testmemcpy.c
memcpysseu.s
memcpyssea.s