My mem_n_cpy() in Assembly

**siavoshkc** · 08-14-2008

I wrote it in assembly:

Code:

.486
.model flat, c
.data
;No data defined
.code


mem_n_cpy PROC
; edx	reminder
; eax	quotient
; esi	src
; edi	dst
; ecx	acts as a copy buffer
; ebx	temp
	
	push	ebp
	mov		ebp, esp
	push	edi
	push	esi

	mov		eax, [ebp+16]	;Count parameter
	
	cmp		eax, 0	;
	jle		RETURN	;Count is zero or negative, invalid parameter
	
	mov		edi, [ebp+8]	;dst parameter
	mov		esi, [ebp+12]	;src parameter
	
	xor		edx, edx
	mov		ebx, 4
	div		ebx
	
	;Check to see if src is overlapped on dst or not
	lea		ebx, [edx + 4*eax]
	mov		ecx, edi 
	sub		ecx, esi	
	add		ecx, ebx
	cmp		ecx, ebx
	jg		REVERSE		;Overlapped
	

QLOOP:
	dec		eax
	mov		ecx, [esi]
	mov		[edi], ecx
	add		esi, 4
	add		edi, 4
	test	eax, eax
	jne		QLOOP
	
	test	edx, edx
	je		RETURN
	
RLOOP:
	dec		edx
	mov		BYTE PTR cl, [esi]
	mov		BYTE PTR [edi], cl
	inc		esi
	inc		edi
	test	edx, edx
	jne		RLOOP	
	jmp		RETURN
	
REVERSE:
	lea		ebx, [edx + 4*eax]
	add		esi, ebx
	add		edi, ebx
	
RQLOOP:
	dec		eax
	sub		esi, 4
	sub		edi, 4
	mov		ecx, [esi]
	mov		[edi], ecx
	test	eax, eax
	jne		RQLOOP
	
	cmp		edx, 0
	je		RETURN

RRLOOP:
	dec		edx
	dec		esi
	dec		edi
	mov		BYTE PTR cl, [esi]
	mov		BYTE PTR [edi], cl
	test	edx, edx
	jne		RRLOOP
	
RETURN:
	pop		esi
	pop		edi
	mov		esp, ebp
	pop		ebp
	ret

mem_n_cpy ENDP
END

I appreciate your ideas about it.

**CornedBee** · 08-14-2008

I imagine that even a simple REP STOSD would be faster than your main loop.

Memory copy routines are subject to extreme optimization, and to peculiar differences between individual CPUs. Generally, a good memcpy does a CPU detection and a detection of the alignment of source and destination address and chooses a copy method according to this.

For example, the Core2 CPUs have a 128-bit data channel for the XMM registers, so for 16-byte aligned storage the fastest copying method is to use the SSE load/store commands. On the other hand, if the storage is not aligned, those methods don't work. If the misalignment is the same for source and destination, you can copy the start and end by slower methods and push the middle the fast way.
The AMD K8 series, however, have only 64-bit data channels for the XMM registers, making the SSE copy a lot slower - slower than other methods, anyway.
In general, you should consult the vendor's optimization manual in order to find out how to implement memcpy.

As for CPU detection, remember that you only need to do that once.

Code:

.data
cpu_memcpy DWORD = switch_memcpy

my_memcpy PROC
  ; prologue stuff here
  mov eax [cpu_memcpy]
  ; jump to the specific routine
  jmp [eax]
EPILOGUE:
  ; function epilogue here
  ret
my_memcpy ENDP

switch_memcpy:
  ; detect CPU here
  ; e.g. with the cpuid instruction
  ; detected 128-bit data bus CPU:
  mov [cpu_memcpy], sse_memcpy
  ; etc.
  mov eax [cpu_memcpy]
  jmp [eax]

sse_memcpy:
  ; test alignment
  ; jump to different routine if it's wrong
  ; do the SSE stuff here
  jmp EPILOGUE

stosd_memcpy:
  ; a possible fallback

The thing about your version is that I don't see why you would write that in Assembly. It looks like a direct translation of some very straight-forward C code, except that the compiler would probably have some scheduling tricks up its sleeve.

**siavoshkc** · 08-15-2008

As you may know MSVC uses assembly to do memcpy().

[edit] What is the 'U' and 'V' in comments of assembly code?

**CornedBee** · 08-15-2008

Originally Posted by siavoshkc

As you may know MSVC uses assembly to do memcpy().

Yes. You'll typically implement memcpy in assembly, but it will also be typically quite complex.

What is the 'U' and 'V' in comments of assembly code?

Huh?

**matsp** · 08-17-2008

Originally Posted by siavoshkc

As you may know MSVC uses assembly to do memcpy().

Yes, in fact it is an "intrinsic" for most circumstances, meaning that not only is it implemented in assembler, but the compiler actually knows in itself how to form a memcpy() from its arguments without calling a function at all. There are possibly conditions where the C library memcpy() is being called, and yes, that would be written in highly optimized assembler, since memcpy() is one of several functions that need to be as fast as it can be, since applications often use memcpy() in performance critical sections of code [including benchmarks].

I have no idea what U and V are referring to in comments - I may know if you post an example.

--
Mats

**siavoshkc** · 08-17-2008

Of course, as CornedBee points out, a plain set of REP STOSD + REP STOSB would most likely beat your function by a bit. With a little bit of math + the STD instruction, you can do the backwards version too.

Can you give me an example using STSD please?

I have no idea what U and V are referring to in comments - I may know if you post an example.

A part of VC memcpy()

Code:

LeadUp1:
        and     edx,ecx         ;U - trailing byte count
        mov     al,[esi]        ;V - get first byte from source

        mov     [edi],al        ;U - write second byte to destination
        mov     al,[esi+1]      ;V - get second byte from source

        mov     [edi+1],al      ;U - write second byte to destination
        mov     al,[esi+2]      ;V - get third byte from source

        shr     ecx,2           ;U - shift down to dword count
        mov     [edi+2],al      ;V - write third byte to destination

        add     esi,3           ;U - advance source pointer
        add     edi,3           ;V - advance destination pointer

        cmp     ecx,8           ;U - test if small enough for unwind copy
        jb      short CopyUnwindUp ;V - if so, then jump

        rep     movsd           ;N - move all of our dwords

        jmp     dword ptr TrailUpVec[edx*4] ;N - process trailing bytes

        align   @WordSize

**matsp** · 08-18-2008

I _think_ those refer to an Intel design - not sure which one. It had V and U "units" in the processor, and the comments are there to say which unit would execute those instructions - alternating between then would improve performance by eliminating stalls. More modern processor designs have out of order and speculative execution, so it's not necessarily meaningful on the latest models of processors.

When I said STOSD/STOSB, I actually means MOVSD/MOVSB, which is what the memcpy segment you copied uses. STOS[B,D] is for storing constant values (so memset and similar functions).

--
Mats

**VirtualAce** · 08-18-2008

U and V are basically integer pipelines inside of the CPU. There was a huge site on the internet devoted to how you could gain performance by optimizing your assembly so that it could use U and V pipelines at the same time. Not sure if it is still up or not.

You can write code that actually forces the CPU to only use U or V which means, in theory, you are getting half the performance. Most compilers should be optimized now to take advantage of the U and V pipelines.

Quite surprised to see this as the U and V pipelines were introduced quite awhile ago. Don't know if AMD is identical to Intel in this regard.

**CornedBee** · 08-18-2008

Yes, AMD has a superscalar design, too.
(Eh, I think that's what superscalar means, anyway.)

**matsp** · 08-18-2008

Originally Posted by CornedBee

Yes, AMD has a superscalar design, too.
(Eh, I think that's what superscalar means, anyway.)

AMD and Intel processors have quite different internal design however. In general, code that runs OK on Intel runs better on AMD processors [up until Core2 at least], whilst Intel often requires "good compiler" to solve the problems. Perhaps this is couple with the fact that Intel does produce a compiler, AMD does not - so they fix the processor to do a better job with "average" code.

I believe both processors are now capable of using any pipeline for any instruction with a few exceptions (some instructions take a lot of chip real-estate to implement, and if they are also relatively unusual instructions, it makes sense to only have one copy of that piece of hardware - a bit like a compiler would only inline two copies of a function if it's small enough).

--
Mats

**abachler** · 08-19-2008

Originally Posted by CornedBee

Yes, AMD has a superscalar design, too.
(Eh, I think that's what superscalar means, anyway.)

http://en.wikipedia.org/wiki/Superscalar

yes, all modern processors are superscalar, but to greater or lesser degree.

I cant believe we are still stuck on 64 bit external data buses, we should be up to 512 by now, especially with the large pin count packaging now available.

**matsp** · 08-20-2008

Originally Posted by abachler

http://en.wikipedia.org/wiki/Superscalar

yes, all modern processors are superscalar, but to greater or lesser degree.

I cant believe we are still stuck on 64 bit external data buses, we should be up to 512 by now, especially with the large pin count packaging now available.

First of all, Athlon64[1] and Pentium4/Core2 processors have 128-bit wide bus. Making the bus wider improves the throughput for sequential access, but unfortunately, a lot of applications are not spending that much time doing sequential access, but rather reading/writing all over the place. With modern CPU's having large caches and clever prefetching, it is actually quite hard to write an application that stalls due to memory throughput - that is, the processor can't fill the 128 bit memory bus today [for sequential access].

The most likely cause of memory throughput problem is stupid programs that do "memcpy()/memset()" (or similar) operations on large chunks of memory and that doesn't use non-temporal stores - this means that the processor switches from writing to reading to writing to reading for a given piece of memory.

Edit: [1] The first few models of Athlon64 did not have 128-bit bus, but all 939 socket and newer processors have 128-bit bus.

--
Mats

**siavoshkc** · 08-25-2008

In dual core, 128bit is divided to 2 64bit for each core.

**matsp** · 08-25-2008

Originally Posted by siavoshkc

In dual core, 128bit is divided to 2 64bit for each core.

No it's not. The bus is still 128 bit wide, but it is shared via an arbiter to allow both cores to access memory. Any given memory access is EITHER core 0 or core 1, but it's ONE SINGLE bus.

--
Mats

**siavoshkc** · 08-26-2008

No it's not. The bus is still 128 bit wide, but it is shared via an arbiter to allow both cores to access memory. Any given memory access is EITHER core 0 or core 1, but it's ONE SINGLE bus.

Maybe in new ones its true. But in early AMD dual cores I am sure it was 2x64bit. Then it wasn't true for FX ones.

[edit]
This is a thread about it http://www.pcguide.com/vb/showthread.php?t=40148

Thread: My mem_n_cpy() in Assembly

Thread Tools

Search Thread

Display

Hybrid View

My mem_n_cpy() in Assembly

Similar Threads

Learning Assembly

C to assembly interface

assembly language...the best tool for game programming?

True ASM vs. Fake ASM ????

C,C++,Perl,Java