Eh, no, that's not any better. There is no performance difference betweed a small constant and a register in adding - it's better to let the compiler use those registers for addresssing data, etc.
But dwCount _IS_ the number of items to do, right? So if you cound down to zero/negative, you don't need a compare at all - just a "jump not zero" or "jump not signed" - saves one instruction. Also, try to shuffle this up a few instructions, so that the subtract has time to finish before the branch is taken.
You can also improve the speed by unrolling the loop [calculating more than one set of values each time].
My comment on xmm registers being 16 bytes:
Are you actually intending to calculate
Code:
{ pInput[0], pInput[1] }
{ pInput[1], pInput[2] }
{ pInput[2], pInput[3] }
{ pInput[3], pInput[4] }
The braces indicate the pair being calculated each iteration.
I would have thought you wanted:
Code:
{ pInput[0], pInput[1] }
{ pInput[2], pInput[3] }
{ pInput[4], pInput[5] }
{ pInput[6], pInput[7] }
Although I expect that MOVAPD would actually fail if you add 8, because the next read is now unaligned.
And the same applies to pWeight, obviously.
Oh, and you still don't need to push and pop eax - the compiler will figure that out itself.
--
Mats