Originally Posted by Bubba
Memcpy is accomplished via a REP MOVSB. Of course the operation will be faster if your memory chunk is a power of 2 so you don't leave any hanging bytes. AFAIK memcpy only copies bytes and therefore does not use REP MOVSW or REP MOVSD. It is possible that memcpy is smart enough to figure out when to use certain opcodes, however, I believe it only uses REP MOVSB.
This means that any code using REP MOVSW would be, in theory, twice as fast as REP MOVSB and anything using MOVSD would be four times as fast as using REP MOVSB. It is unfortunate that we do not have a REP MOVSQ which would move 64-bits at a time - perhaps later we will have this.