Yeah, I figured as much. Hell, even loop unrolling would help.Originally Posted by Salem
Quzah.
Yeah, I figured as much. Hell, even loop unrolling would help.Originally Posted by Salem
Quzah.
Hope is the first step on the road to disappointment.
Didn't tought that test results would differ _that_ much on different machines... well whatever, at least my func looked coolOriginally Posted by xErath
Not really, it was just "too big" to swallow at first. Anyway, I have played with duffs device for whole afternoon and came up with something more promising than my previous pathetic code. Here: http://www.ratol.fi/~ibaldine/my_memcpy.cOriginally Posted by Salem
Quite big (152 lines), but who cares? Speed is only thing that matters. Oh, and I'm using mmx registers too...
Test results on my machine (XP 3000+, W2k SP4):
And I've tested it like so:Code:| 1 M loops, 500 bytes | 4 M loops, 50 bytes my_memcpy | 7.354 s | 7.545 s VC6 memcpy | 10.132 s | 11.646 sCode:oldT = timeGetTime(); for(dwLoop = 0; dwLoop < 100000000; dwLoop++) my_memcpy(d, s, 500); //memcpy(d, s, 500); newT = timeGetTime(); printf("Time used: %f\n", (double)(newT - oldT)/1000.0f);
you used qwords, nice!... I didn't knew that there was a mmo 64 bit register...
Only the 32 bits ones.
it's not mmo, it's mm0 (zero)... one of mmx registers (mm0-mm7)
Your memcpy will show next to no advantage over the VC6 version in the long run. Waste of time if you ask me.
Ok. Maybe my function isn't _that_ much faseter... then how about this: ftp://ftp.amd.com/pub/devconn/vcc/memcpy_amd.zip
It's AMD's version, optimized for Athlon/Duron processors. About 2.5 times faster (on my machine) than the original memcpy. THAT'S got to be worth something. Of corse you'll have to identify processor and all the crap that follows, but hey! Everything comes with a price tag!
Here's PIII version: http://www.stereopsis.com/memcpy.html
A bit slower, but still almost 2 times faster than the original.
Why are you still not focusing on optimizing the rest of your loop/program which has to run a million times, instead of focusing on memcpy? People keep asking, and you apparently just keep ignoring it, hoping some miracle memcpy is going to fix the rest of your program or something.
Quzah.
Hope is the first step on the road to disappointment.
No, I am not. I'm optimizing memcpy because the name of the topic is "Help: About memcpy()". I saw it and thought, like "Hey, this guys are looking for the truth." Hell, it's not even my program, man! The loop was there only to test my function.Originally Posted by quzah
My bad, I was thinking you were the origional poster, to whom it's been suggested they optimize the rest of their program first.Originally Posted by iwabee
Quzah.
Hope is the first step on the road to disappointment.
To iwabee: I used your tweak code (MMX instructions) in my program, but there may be something wrong. When I ran the program, there is nothing to display (It should display the time). Could you tell me what complier switches did you use?
To quzah: Thank you for your suggestion, I have tried to modify and optimize the rest of the program. I think this question is not my private problem now, it has gone beyond its original domain. I've acquired many knowledge which I didn't know before.
I'm not a native english speaker, so I can't represent my feeling and idea prefectly. Anyway, thanks to everyone who has participated to this topic
Nothing special, just default ones. What compiler are you using? Cause I wrote that code using VC6+sp5. Anyway, you'll probably get better results using AMD's or SGI's memcpy (I posted links earlier).
even 2.5 isn't worth my time.
I don't use memcpy all that much anyways.