In my program, there is a iteration which does loop 1000000 times, each time it calls memcpy(). The iteration can't be improved or replaced for some reason, so I just want to use an optimized memcpy() to replace the standard version. I read some articles about memcpy optimization early years, but I can't remember the details. I have searched some sample code about optimization version, but almost all of them are useless in my program or just be optimized for AMD cpus. So who can give me some advice or sample whatever it is plain C code or embedded assemble code. My CPU is Intel P4. Thanks!