My mem_n_cpy() in Assembly

This is a discussion on My mem_n_cpy() in Assembly within the Tech Board forums, part of the Community Boards category; I wrote it in assembly: Code: .486 .model flat, c .data ;No data defined .code mem_n_cpy PROC ; edx reminder ...

  1. #1
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,231

    My mem_n_cpy() in Assembly

    I wrote it in assembly:
    Code:
    .486
    .model flat, c
    .data
    ;No data defined
    .code
    
    
    mem_n_cpy PROC
    ; edx	reminder
    ; eax	quotient
    ; esi	src
    ; edi	dst
    ; ecx	acts as a copy buffer
    ; ebx	temp
    	
    	push	ebp
    	mov		ebp, esp
    	push	edi
    	push	esi
    
    	mov		eax, [ebp+16]	;Count parameter
    	
    	cmp		eax, 0	;
    	jle		RETURN	;Count is zero or negative, invalid parameter
    	
    	mov		edi, [ebp+8]	;dst parameter
    	mov		esi, [ebp+12]	;src parameter
    	
    	xor		edx, edx
    	mov		ebx, 4
    	div		ebx
    	
    	;Check to see if src is overlapped on dst or not
    	lea		ebx, [edx + 4*eax]
    	mov		ecx, edi 
    	sub		ecx, esi	
    	add		ecx, ebx
    	cmp		ecx, ebx
    	jg		REVERSE		;Overlapped
    	
    
    QLOOP:
    	dec		eax
    	mov		ecx, [esi]
    	mov		[edi], ecx
    	add		esi, 4
    	add		edi, 4
    	test	eax, eax
    	jne		QLOOP
    	
    	test	edx, edx
    	je		RETURN
    	
    RLOOP:
    	dec		edx
    	mov		BYTE PTR cl, [esi]
    	mov		BYTE PTR [edi], cl
    	inc		esi
    	inc		edi
    	test	edx, edx
    	jne		RLOOP	
    	jmp		RETURN
    	
    REVERSE:
    	lea		ebx, [edx + 4*eax]
    	add		esi, ebx
    	add		edi, ebx
    	
    RQLOOP:
    	dec		eax
    	sub		esi, 4
    	sub		edi, 4
    	mov		ecx, [esi]
    	mov		[edi], ecx
    	test	eax, eax
    	jne		RQLOOP
    	
    	cmp		edx, 0
    	je		RETURN
    
    RRLOOP:
    	dec		edx
    	dec		esi
    	dec		edi
    	mov		BYTE PTR cl, [esi]
    	mov		BYTE PTR [edi], cl
    	test	edx, edx
    	jne		RRLOOP
    	
    RETURN:
    	pop		esi
    	pop		edi
    	mov		esp, ebp
    	pop		ebp
    	ret
    
    mem_n_cpy ENDP
    END
    I appreciate your ideas about it.
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    I imagine that even a simple REP STOSD would be faster than your main loop.

    Memory copy routines are subject to extreme optimization, and to peculiar differences between individual CPUs. Generally, a good memcpy does a CPU detection and a detection of the alignment of source and destination address and chooses a copy method according to this.

    For example, the Core2 CPUs have a 128-bit data channel for the XMM registers, so for 16-byte aligned storage the fastest copying method is to use the SSE load/store commands. On the other hand, if the storage is not aligned, those methods don't work. If the misalignment is the same for source and destination, you can copy the start and end by slower methods and push the middle the fast way.
    The AMD K8 series, however, have only 64-bit data channels for the XMM registers, making the SSE copy a lot slower - slower than other methods, anyway.
    In general, you should consult the vendor's optimization manual in order to find out how to implement memcpy.

    As for CPU detection, remember that you only need to do that once.
    Code:
    .data
    cpu_memcpy DWORD = switch_memcpy
    
    my_memcpy PROC
      ; prologue stuff here
      mov eax [cpu_memcpy]
      ; jump to the specific routine
      jmp [eax]
    EPILOGUE:
      ; function epilogue here
      ret
    my_memcpy ENDP
    
    switch_memcpy:
      ; detect CPU here
      ; e.g. with the cpuid instruction
      ; detected 128-bit data bus CPU:
      mov [cpu_memcpy], sse_memcpy
      ; etc.
      mov eax [cpu_memcpy]
      jmp [eax]
    
    sse_memcpy:
      ; test alignment
      ; jump to different routine if it's wrong
      ; do the SSE stuff here
      jmp EPILOGUE
    
    stosd_memcpy:
      ; a possible fallback
    The thing about your version is that I don't see why you would write that in Assembly. It looks like a direct translation of some very straight-forward C code, except that the compiler would probably have some scheduling tricks up its sleeve.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Code:
    .data
    ;No data defined
    So don't put a .data in there - if it's empty, why do you need one?

    Code:
    	mov		ebx, 4
    	div		ebx
    Absolutely don't use DIV to divide by 4. Use a shift right by 2 instruciton. That set of instructions will take longer than about 20 ordinary instructions.

    Use an and with 3 to figure out the remainder.

    Of course, as CornedBee points out, a plain set of REP STOSD + REP STOSB would most likely beat your function by a bit. With a little bit of math + the STD instruction, you can do the backwards version too.

    If you insist on not using the processor instrucitons above, perhaps you'd want to unroll the loop so that you do more bytes per loop iteration.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Quote Originally Posted by matsp View Post
    If you insist on not using the processor instrucitons above, perhaps you'd want to unroll the loop so that you do more bytes per loop iteration.
    Another thing the compiler is probably better at.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Another thing the compiler is probably better at.
    Agreed. Although all my attempts to write memcpy() type functions, I have beaten the compiler - but usually because I know better what I'm doing than C allows me to tell the compiler about what I'm doing [and I spent quite some time on it, of course].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Quote Originally Posted by matsp View Post
    functions, I have beaten the compiler
    Out of curiosity, which compiler? I've heard recently that GCC is supposed to be bad at memcpy prior to 4.3.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Out of curiosity, which compiler? I've heard recently that GCC is supposed to be bad at memcpy prior to 4.3.
    Several compilers, gcc, Visual Studio/earlier MS compilers, embedded compilers, etc. [I also worked on the memcpy() for glibc on x86_64, although I think my improvements never made it into the official release for some reason - I'd have to ask AJ for the reasons, if he remembers that far back].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,231
    As you may know MSVC uses assembly to do memcpy().

    [edit] What is the 'U' and 'V' in comments of assembly code?
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Quote Originally Posted by siavoshkc View Post
    As you may know MSVC uses assembly to do memcpy().
    Yes. You'll typically implement memcpy in assembly, but it will also be typically quite complex.

    What is the 'U' and 'V' in comments of assembly code?
    Huh?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by siavoshkc View Post
    As you may know MSVC uses assembly to do memcpy().
    Yes, in fact it is an "intrinsic" for most circumstances, meaning that not only is it implemented in assembler, but the compiler actually knows in itself how to form a memcpy() from its arguments without calling a function at all. There are possibly conditions where the C library memcpy() is being called, and yes, that would be written in highly optimized assembler, since memcpy() is one of several functions that need to be as fast as it can be, since applications often use memcpy() in performance critical sections of code [including benchmarks].

    I have no idea what U and V are referring to in comments - I may know if you post an example.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  11. #11
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,231
    Of course, as CornedBee points out, a plain set of REP STOSD + REP STOSB would most likely beat your function by a bit. With a little bit of math + the STD instruction, you can do the backwards version too.
    Can you give me an example using STSD please?

    I have no idea what U and V are referring to in comments - I may know if you post an example.
    A part of VC memcpy()
    Code:
    LeadUp1:
            and     edx,ecx         ;U - trailing byte count
            mov     al,[esi]        ;V - get first byte from source
    
            mov     [edi],al        ;U - write second byte to destination
            mov     al,[esi+1]      ;V - get second byte from source
    
            mov     [edi+1],al      ;U - write second byte to destination
            mov     al,[esi+2]      ;V - get third byte from source
    
            shr     ecx,2           ;U - shift down to dword count
            mov     [edi+2],al      ;V - write third byte to destination
    
            add     esi,3           ;U - advance source pointer
            add     edi,3           ;V - advance destination pointer
    
            cmp     ecx,8           ;U - test if small enough for unwind copy
            jb      short CopyUnwindUp ;V - if so, then jump
    
            rep     movsd           ;N - move all of our dwords
    
            jmp     dword ptr TrailUpVec[edx*4] ;N - process trailing bytes
    
            align   @WordSize
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  12. #12
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I _think_ those refer to an Intel design - not sure which one. It had V and U "units" in the processor, and the comments are there to say which unit would execute those instructions - alternating between then would improve performance by eliminating stalls. More modern processor designs have out of order and speculative execution, so it's not necessarily meaningful on the latest models of processors.

    When I said STOSD/STOSB, I actually means MOVSD/MOVSB, which is what the memcpy segment you copied uses. STOS[B,D] is for storing constant values (so memset and similar functions).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  13. #13
    Super Moderator VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,590
    U and V are basically integer pipelines inside of the CPU. There was a huge site on the internet devoted to how you could gain performance by optimizing your assembly so that it could use U and V pipelines at the same time. Not sure if it is still up or not.

    You can write code that actually forces the CPU to only use U or V which means, in theory, you are getting half the performance. Most compilers should be optimized now to take advantage of the U and V pipelines.

    Quite surprised to see this as the U and V pipelines were introduced quite awhile ago. Don't know if AMD is identical to Intel in this regard.

  14. #14
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,893
    Yes, AMD has a superscalar design, too.
    (Eh, I think that's what superscalar means, anyway.)
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  15. #15
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Yes, AMD has a superscalar design, too.
    (Eh, I think that's what superscalar means, anyway.)
    AMD and Intel processors have quite different internal design however. In general, code that runs OK on Intel runs better on AMD processors [up until Core2 at least], whilst Intel often requires "good compiler" to solve the problems. Perhaps this is couple with the fact that Intel does produce a compiler, AMD does not - so they fix the processor to do a better job with "average" code.

    I believe both processors are now capable of using any pipeline for any instruction with a few exceptions (some instructions take a lot of chip real-estate to implement, and if they are also relatively unusual instructions, it makes sense to only have one copy of that piece of hardware - a bit like a compiler would only inline two copies of a function if it's small enough).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Learning Assembly
    By mrafcho001 in forum Tech Board
    Replies: 5
    Last Post: 03-12-2006, 04:00 PM
  2. C to assembly interface
    By Roaring_Tiger in forum C Programming
    Replies: 4
    Last Post: 02-04-2005, 02:51 PM
  3. assembly language...the best tool for game programming?
    By silk.odyssey in forum Game Programming
    Replies: 50
    Last Post: 06-22-2004, 01:11 PM
  4. True ASM vs. Fake ASM ????
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 04-02-2003, 03:28 AM
  5. C,C++,Perl,Java
    By brusli in forum C Programming
    Replies: 9
    Last Post: 12-31-2001, 02:35 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21