memcpy with 128 bit registers

**grady** · 01-16-2004

I wanted to know if using the large sse registers in a function like memcpy would be faster than standard memcpy. It seems that it is, but not by too much. I think my benchmark is too unscientific to mean much with such small differences in the performance; the results are all over the place. At worst the sse memcpy's can be 5% slower than standard memcpy but this is rare. Usually unaligned sse memcpy is 10% faster and aligned is 30% faster. At best the sse memcpy is ~225% faster but this is very rare.

Don't put much stock in this. This is mostly just a curiosity I wanted to post. All I can say for sure is aligned memory moving is always faster than unaligned, which it should be, and both functions are almost always faster than standard memcpy.

The test program takes two command line arguments, the number of megabytes to allocate, and the number of times to call each function for the average time per call. There is an instruction, emms, that can be removed in both of the memcpysse commands at the bottom.

testmemcpy.c
memcpysseu.s
memcpyssea.s

**Fordy** · 01-19-2004

One query would be the version of memcpy used. Is it a portable C implementation or an optimised assembler wrapper using the standard REPed memory moving instructions?

I would think that the standard registers and standard instructions would be more optimised for memory movement than the SSE registers....but I dont know for sure and I suppose it would depend on a lot of things

**grady** · 01-19-2004

I have been trying to make the test program better since I posted last, and I think you are right regarding the standard register and standard instructions being more optimized. My function gets worse and worse as I find ways to make the test more reasonable.

Thread: memcpy with 128 bit registers

Thread Tools

Search Thread

Display

Hybrid View

memcpy with 128 bit registers

Similar Threads

32 bit to 64 bit Ubuntu

128 bit uchar array? please help

32 bit or 64 bit allignment ?! gcc options??

How to set an 32 bit allocation to 31 bit for address, 1 bit for flag?

Array of boolean