Thread: Help: About memcpy()

  1. #16
    Registered User
    Join Date
    Jun 2004
    Posts
    42
    Thanks a lot!

    I didn't notice your assembly code post, I see it now

    Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.

  2. #17
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    A couple of points
    1. longword copies generally work best if
    - the source and destination memory addresses are longword aligned
    - the amount of memory copied is a multiple of a longword in length
    If this doesn't hold for you, then you may not see an obvious benefit.

    2. Maybe you should try a memcpy based on Duff's device
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #18
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    Quote Originally Posted by naruto
    Thanks a lot!

    I didn't notice your assembly code post, I see it now

    Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.
    Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??

    Who said something about 20 days???
    And you're welcome

    //edit
    About Duff's device
    Code:
    send(to, from, count)
    register short *to, *from;
    register count;
    {
        do
            *to = *from++;
        while(--count>0);
    }
    He copies 2 bytes at each time.
    And CPU work better with 32 bit registers than 16 or 8 bit, therefore moving 4 bytes at each time would be something good.
    Last edited by xErath; 06-22-2004 at 11:33 AM.

  4. #19
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    A version a bit more tweaked

    Code:
    void *memcpy(void *_dest, const void *_src, int _size){
    __asm{
    	push eax
    	push ecx
    	push edi
    	push esi
    	mov edi, _dest	;edi destiny, esi source
    	mov esi, _src
    	mov ecx, _size
    
    _cicle:		;copy 4bytes by 4 bytes
    		cmp ecx, 0	;if size<4
    		jle _lbl_end
    		mov eax, dword ptr [esi]
    		mov dword ptr [edi], eax
    		add edi, 4
    		add esi, 4
    		sub ecx, 4
    		jmp _cicle
    
    _lbl_end:
    	pop esi
    	pop edi
    	pop ecx
    	pop eax
    	}
    	return _dest;
    }
    Like Salem said,
    Quote Originally Posted by Salem
    1. longword copies generally work best if
    - the source and destination memory addresses are longword aligned
    - the amount of memory copied is a multiple of a longword in length
    But I think that this holds for everything. If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes, a longword multiple, therefore there isn't any strong reason to copy those last 3 bytes each at a time... just copy again 4 bytes, and an extra byte at the end will be tranfered, or 2 or 3 depends on the size. This way the first cicle only runs one more time, and the second cicle is ignored. If you want really to copy the exact amount of bytes especified in the size variable use my first asm implementation, if not use this one... a few milliseconds better.

  5. #20
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > //edit
    > About Duff's device
    Except what you wrote isn't Duff's device - read again

    > If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes
    You don't know that for sure
    What you can "get away with" on one machine translates into "segfault" on another.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #21
    Registered User
    Join Date
    Jun 2004
    Posts
    42
    [QUOTE=xErath]Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??

    Who said something about 20 days???
    And you're welcome


    No, 13X means 130 - 139, not 13 days

    Maybe the presentation custom between us is different

  7. #22
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    Quote Originally Posted by naruto
    No, 13X means 130 - 139, not 13 days
    Glupp.... Yuppy microsft.... Wish they compiled windows with borland's stuff...

    Still try to merge the assemlby with your C, for the C builder. I had some very good results from C to asm

  8. #23
    Registered User
    Join Date
    Jun 2004
    Posts
    42
    I've added your first assembly version of memcpy function into my memcpy function collection and generated a dll file. The dll file is complied and linked by vc6.0. Then I use the dll file in my program (the program is built by BC5.5). But the running time becomes to 1XXX days.... I doubt whether the reason is a BC version program to use a VC version dll?

    I also have added your first assembly code into my program and directly used. The program is built by VC, the time was decreased to 11X days.

  9. #24
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    And you're absolutely sure the only way you can speed up your program is by hacking memcpy? I doubt this very much.

    Quzah.
    Hope is the first step on the road to disappointment.

  10. #25
    Registered User
    Join Date
    Jun 2004
    Posts
    42

    Cool

    Quote Originally Posted by quzah
    And you're absolutely sure the only way you can speed up your program is by hacking memcpy? I doubt this very much.

    Quzah.
    Actually, I'm not absolutely sure it is the only way. The code optimization of this project is not my business, but I just want to try. I know the biggest running bottleneck of the program is that very large loop, but the loop times can't be decreased because of the request.

    As for me,the optimization of this code is not the main aim, I think I can improve my programming skill by this research.

  11. #26
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    I could mention Duff again, and perhaps use a profiler
    Personally, I think both will go ignored....
    Oh well,
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  12. #27
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    If you wanted to optimize code you'd have to tweak all funtions, that includes memcpy. Using a asm version of memcpy, like mine would be an improvement, but that's not enough. After all, you don't want to wait 38, or 43 or 1xxx days...

  13. #28
    Registered User
    Join Date
    Jun 2004
    Posts
    84
    Quote Originally Posted by xErath
    A version a bit more tweaked

    Code:
    void *memcpy(void *_dest, const void *_src, int _size){
    __asm{
    	push eax
    	push ecx
    	push edi
    	push esi
    	mov edi, _dest	;edi destiny, esi source
    	mov esi, _src
    	mov ecx, _size
    
    _cicle:		;copy 4bytes by 4 bytes
    		cmp ecx, 0	;if size<4
    		jle _lbl_end
    		mov eax, dword ptr [esi]
    		mov dword ptr [edi], eax
    		add edi, 4
    		add esi, 4
    		sub ecx, 4
    		jmp _cicle
    
    _lbl_end:
    	pop esi
    	pop edi
    	pop ecx
    	pop eax
    	}
    	return _dest;
    }
    You call that a tweak? THAT'S a tweak (no offence of course):
    Code:
    __declspec(naked) void *my_memcpy(void *_dest, const void *_src, int _size)
    {
    	__asm
    	{
    		; Prologue
    		push ebp
    		mov ebp, esp
    		push edi
    		push esi
    
    		; Initializations
    		mov edi, _dest
    		mov esi, _src
    		cld
    
    		; Copying first n*4 bytes first...
    		mov ecx, _size
    		shr ecx, 2
    		rep movsd
    		
    		; ...and the remaining 0-3 bytes last
    		mov ecx, _size
    		and ecx, 0x03
    		rep movsb
    
    		; Epilogue
    		pop esi
    		pop edi
    		mov esp, ebp
    		pop ebp
    		mov eax, _dest
    		ret
    	}
    }
    Code will compile with VC++6.
    On my machine, execution time of your test program (500 bytes 1000000 times) dropped from 26 seconds to 10 seconds! Certanly processor _is_ optimizing that, cause I'm copying same 500 bytes, but the point is that my version is much faster.

  14. #29
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    ...In fact this
    mov ecx, _size
    shr ecx, 2
    rep movsd
    was a great tweak!!!
    I'm not used to this operators nor VC++ full assembly sintax and precompilers macros...
    Still, I tried both my assembly function and yours, and suprisingly I go about more 25% of time with your function... I don't get why... yours should be better.

  15. #30
    Registered User
    Join Date
    Jun 2004
    Posts
    722
    Hum, I tested several memcpys:
    TIME: 12 //VC++ memcpy
    TIME: 17 //my memcpy
    TIME: 23 //iwabee memcpy

    like this:
    Code:
    	t=time(0);
    	for(i=0;i<200000000;i++)
    		memcpy_(d, s, 50);
    	printf("TIME: %d\n", time(0)-t);

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Disagreement about memcpy
    By ch4 in forum C Programming
    Replies: 9
    Last Post: 05-28-2009, 10:12 AM
  2. Replies: 14
    Last Post: 06-28-2006, 01:58 AM
  3. Memcpy(); Errors...
    By Shamino in forum C++ Programming
    Replies: 4
    Last Post: 03-24-2006, 11:35 AM
  4. memcpy with 128 bit registers
    By grady in forum Linux Programming
    Replies: 2
    Last Post: 01-19-2004, 06:25 AM
  5. memcpy
    By doubleanti in forum C++ Programming
    Replies: 10
    Last Post: 02-28-2002, 04:44 PM