Thanks a lot!
I didn't notice your assembly code post, I see it now
Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.
Thanks a lot!
I didn't notice your assembly code post, I see it now
Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.
A couple of points
1. longword copies generally work best if
- the source and destination memory addresses are longword aligned
- the amount of memory copied is a multiple of a longword in length
If this doesn't hold for you, then you may not see an obvious benefit.
2. Maybe you should try a memcpy based on Duff's device
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
If at first you don't succeed, try writing your phone number on the exam paper.
Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??Originally Posted by naruto
Who said something about 20 days???
And you're welcome
//edit
About Duff's device
He copies 2 bytes at each time.Code:send(to, from, count) register short *to, *from; register count; { do *to = *from++; while(--count>0); }
And CPU work better with 32 bit registers than 16 or 8 bit, therefore moving 4 bytes at each time would be something good.
Last edited by xErath; 06-22-2004 at 11:33 AM.
A version a bit more tweaked
Like Salem said,Code:void *memcpy(void *_dest, const void *_src, int _size){ __asm{ push eax push ecx push edi push esi mov edi, _dest ;edi destiny, esi source mov esi, _src mov ecx, _size _cicle: ;copy 4bytes by 4 bytes cmp ecx, 0 ;if size<4 jle _lbl_end mov eax, dword ptr [esi] mov dword ptr [edi], eax add edi, 4 add esi, 4 sub ecx, 4 jmp _cicle _lbl_end: pop esi pop edi pop ecx pop eax } return _dest; }
But I think that this holds for everything. If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes, a longword multiple, therefore there isn't any strong reason to copy those last 3 bytes each at a time... just copy again 4 bytes, and an extra byte at the end will be tranfered, or 2 or 3 depends on the size. This way the first cicle only runs one more time, and the second cicle is ignored. If you want really to copy the exact amount of bytes especified in the size variable use my first asm implementation, if not use this one... a few milliseconds better.Originally Posted by Salem
> //edit
> About Duff's device
Except what you wrote isn't Duff's device - read again
> If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes
You don't know that for sure
What you can "get away with" on one machine translates into "segfault" on another.
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
If at first you don't succeed, try writing your phone number on the exam paper.
[QUOTE=xErath]Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??
Who said something about 20 days???
And you're welcome
No, 13X means 130 - 139, not 13 days
Maybe the presentation custom between us is different
Glupp.... Yuppy microsft.... Wish they compiled windows with borland's stuff...Originally Posted by naruto
Still try to merge the assemlby with your C, for the C builder. I had some very good results from C to asm
I've added your first assembly version of memcpy function into my memcpy function collection and generated a dll file. The dll file is complied and linked by vc6.0. Then I use the dll file in my program (the program is built by BC5.5). But the running time becomes to 1XXX days.... I doubt whether the reason is a BC version program to use a VC version dll?
I also have added your first assembly code into my program and directly used. The program is built by VC, the time was decreased to 11X days.
And you're absolutely sure the only way you can speed up your program is by hacking memcpy? I doubt this very much.
Quzah.
Hope is the first step on the road to disappointment.
Actually, I'm not absolutely sure it is the only way. The code optimization of this project is not my business, but I just want to try. I know the biggest running bottleneck of the program is that very large loop, but the loop times can't be decreased because of the request.Originally Posted by quzah
As for me,the optimization of this code is not the main aim, I think I can improve my programming skill by this research.
I could mention Duff again, and perhaps use a profiler
Personally, I think both will go ignored....
Oh well,
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
If at first you don't succeed, try writing your phone number on the exam paper.
If you wanted to optimize code you'd have to tweak all funtions, that includes memcpy. Using a asm version of memcpy, like mine would be an improvement, but that's not enough. After all, you don't want to wait 38, or 43 or 1xxx days...
You call that a tweak? THAT'S a tweak (no offence of course):Originally Posted by xErathCode will compile with VC++6.Code:__declspec(naked) void *my_memcpy(void *_dest, const void *_src, int _size) { __asm { ; Prologue push ebp mov ebp, esp push edi push esi ; Initializations mov edi, _dest mov esi, _src cld ; Copying first n*4 bytes first... mov ecx, _size shr ecx, 2 rep movsd ; ...and the remaining 0-3 bytes last mov ecx, _size and ecx, 0x03 rep movsb ; Epilogue pop esi pop edi mov esp, ebp pop ebp mov eax, _dest ret } }
On my machine, execution time of your test program (500 bytes 1000000 times) dropped from 26 seconds to 10 seconds! Certanly processor _is_ optimizing that, cause I'm copying same 500 bytes, but the point is that my version is much faster.
...In fact this
mov ecx, _size
shr ecx, 2
rep movsd
was a great tweak!!!
I'm not used to this operators nor VC++ full assembly sintax and precompilers macros...
Still, I tried both my assembly function and yours, and suprisingly I go about more 25% of time with your function... I don't get why... yours should be better.
Hum, I tested several memcpys:
TIME: 12 //VC++ memcpy
TIME: 17 //my memcpy
TIME: 23 //iwabee memcpy
like this:Code:t=time(0); for(i=0;i<200000000;i++) memcpy_(d, s, 50); printf("TIME: %d\n", time(0)-t);