Originally Posted by
iMalc
I wouldn't count on it. Just because the code is still there doesn't mean it isn't optimised out. When you stop using the result of code that otherwise has no side effects, then the compiler will sometimes sneakily remove all the code that it sees has no effect. Optimisers are good like that.
That's probably why you were mislead into thinking it was the copy that took a long time. In reality the time taken to copy 16 bytes wil be much less that the work that is done 64-times, which includes function calls and lots of probably unpredicatable branching.
I wouldn't begin to optimse this without seeing what happens in those function calls.