Help: About memcpy()

**naruto** · 06-22-2004

Thanks a lot!

I didn't notice your assembly code post, I see it now

Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.

**Salem** · 06-22-2004

A couple of points
1. longword copies generally work best if
- the source and destination memory addresses are longword aligned
- the amount of memory copied is a multiple of a longword in length
If this doesn't hold for you, then you may not see an obvious benefit.

2. Maybe you should try a memcpy based on Duff's device

**xErath** · 06-22-2004

Originally Posted by naruto

Thanks a lot!

I didn't notice your assembly code post, I see it now

Another thing I'm curious is that If I use VC6.0 to complie and link the program ,the time is 13X days, but I use Borland C++ 5.5, the time is 38 days.

Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??

Who said something about 20 days???

And you're welcome

//edit
About Duff's device

Code:

send(to, from, count)
register short *to, *from;
register count;
{
    do
        *to = *from++;
    while(--count>0);
}

He copies 2 bytes at each time.
And CPU work better with 32 bit registers than 16 or 8 bit, therefore moving 4 bytes at each time would be something good.

**xErath** · 06-22-2004

A version a bit more tweaked

Code:

void *memcpy(void *_dest, const void *_src, int _size){
__asm{
	push eax
	push ecx
	push edi
	push esi
	mov edi, _dest	;edi destiny, esi source
	mov esi, _src
	mov ecx, _size

_cicle:		;copy 4bytes by 4 bytes
		cmp ecx, 0	;if size<4
		jle _lbl_end
		mov eax, dword ptr [esi]
		mov dword ptr [edi], eax
		add edi, 4
		add esi, 4
		sub ecx, 4
		jmp _cicle

_lbl_end:
	pop esi
	pop edi
	pop ecx
	pop eax
	}
	return _dest;
}

Like Salem said,

Originally Posted by Salem

1. longword copies generally work best if
- the source and destination memory addresses are longword aligned
- the amount of memory copied is a multiple of a longword in length

But I think that this holds for everything. If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes, a longword multiple, therefore there isn't any strong reason to copy those last 3 bytes each at a time... just copy again 4 bytes, and an extra byte at the end will be tranfered, or 2 or 3 depends on the size. This way the first cicle only runs one more time, and the second cicle is ignored. If you want really to copy the exact amount of bytes especified in the size variable use my first asm implementation, if not use this one... a few milliseconds better.

**Salem** · 06-22-2004

> //edit
> About Duff's device
Except what you wrote isn't Duff's device - read again

> If you have a buffer which you alloc 11 bytes to it, in fact it'll have 12 bytes
You don't know that for sure
What you can "get away with" on one machine translates into "segfault" on another.

**naruto** · 06-22-2004

[QUOTE=xErath]Gee! I thought that Borland compilers were better than microsoft... Those 13 days were with the assembly memcpy??

Who said something about 20 days???

And you're welcome

No, 13X means 130 - 139, not 13 days

Maybe the presentation custom between us is different

**xErath** · 06-22-2004

Originally Posted by naruto

No, 13X means 130 - 139, not 13 days

Glupp.... Yuppy microsft.... Wish they compiled windows with borland's stuff...

Still try to merge the assemlby with your C, for the C builder. I had some very good results from C to asm

**naruto** · 06-23-2004

I've added your first assembly version of memcpy function into my memcpy function collection and generated a dll file. The dll file is complied and linked by vc6.0. Then I use the dll file in my program (the program is built by BC5.5). But the running time becomes to 1XXX days.... I doubt whether the reason is a BC version program to use a VC version dll?

I also have added your first assembly code into my program and directly used. The program is built by VC, the time was decreased to 11X days.

**quzah** · 06-23-2004

And you're absolutely sure the only way you can speed up your program is by hacking memcpy? I doubt this very much.

Quzah.

**naruto** · 06-23-2004

Originally Posted by quzah

And you're absolutely sure the only way you can speed up your program is by hacking memcpy? I doubt this very much.

Quzah.

Actually, I'm not absolutely sure it is the only way. The code optimization of this project is not my business, but I just want to try. I know the biggest running bottleneck of the program is that very large loop, but the loop times can't be decreased because of the request.

As for me,the optimization of this code is not the main aim, I think I can improve my programming skill by this research.

**Salem** · 06-23-2004

I could mention Duff again, and perhaps use a profiler
Personally, I think both will go ignored....
Oh well,

**xErath** · 06-23-2004

If you wanted to optimize code you'd have to tweak all funtions, that includes memcpy. Using a asm version of memcpy, like mine would be an improvement, but that's not enough. After all, you don't want to wait 38, or 43 or 1xxx days...

**iwabee** · 06-23-2004

Originally Posted by xErath

A version a bit more tweaked

Code:

void *memcpy(void *_dest, const void *_src, int _size){
__asm{
	push eax
	push ecx
	push edi
	push esi
	mov edi, _dest	;edi destiny, esi source
	mov esi, _src
	mov ecx, _size

_cicle:		;copy 4bytes by 4 bytes
		cmp ecx, 0	;if size<4
		jle _lbl_end
		mov eax, dword ptr [esi]
		mov dword ptr [edi], eax
		add edi, 4
		add esi, 4
		sub ecx, 4
		jmp _cicle

_lbl_end:
	pop esi
	pop edi
	pop ecx
	pop eax
	}
	return _dest;
}

You call that a tweak? THAT'S a tweak (no offence of course):

Code:

__declspec(naked) void *my_memcpy(void *_dest, const void *_src, int _size)
{
	__asm
	{
		; Prologue
		push ebp
		mov ebp, esp
		push edi
		push esi

		; Initializations
		mov edi, _dest
		mov esi, _src
		cld

		; Copying first n*4 bytes first...
		mov ecx, _size
		shr ecx, 2
		rep movsd
		
		; ...and the remaining 0-3 bytes last
		mov ecx, _size
		and ecx, 0x03
		rep movsb

		; Epilogue
		pop esi
		pop edi
		mov esp, ebp
		pop ebp
		mov eax, _dest
		ret
	}
}

Code will compile with VC++6.
On my machine, execution time of your test program (500 bytes 1000000 times) dropped from 26 seconds to 10 seconds! Certanly processor _is_ optimizing that, cause I'm copying same 500 bytes, but the point is that my version is much faster.

**xErath** · 06-23-2004

...In fact this
mov ecx, _size
shr ecx, 2
rep movsd
was a great tweak!!!
I'm not used to this operators nor VC++ full assembly sintax and precompilers macros...
Still, I tried both my assembly function and yours, and suprisingly I go about more 25% of time with your function... I don't get why... yours should be better.

**xErath** · 06-23-2004

Hum, I tested several memcpys:
TIME: 12 //VC++ memcpy
TIME: 17 //my memcpy
TIME: 23 //iwabee memcpy

like this:

Code:

	t=time(0);
	for(i=0;i<200000000;i++)
		memcpy_(d, s, 50);
	printf("TIME: %d\n", time(0)-t);

Thread: Help: About memcpy()

Thread Tools

Search Thread

Display

Similar Threads

Disagreement about memcpy

Okay, giant issue, I think I'm treading into territory outside of cboards maybe...

Memcpy(); Errors...

memcpy with 128 bit registers

memcpy