Thread: My mem_n_cpy() in Assembly

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246

    My mem_n_cpy() in Assembly

    I wrote it in assembly:
    Code:
    .486
    .model flat, c
    .data
    ;No data defined
    .code
    
    
    mem_n_cpy PROC
    ; edx	reminder
    ; eax	quotient
    ; esi	src
    ; edi	dst
    ; ecx	acts as a copy buffer
    ; ebx	temp
    	
    	push	ebp
    	mov		ebp, esp
    	push	edi
    	push	esi
    
    	mov		eax, [ebp+16]	;Count parameter
    	
    	cmp		eax, 0	;
    	jle		RETURN	;Count is zero or negative, invalid parameter
    	
    	mov		edi, [ebp+8]	;dst parameter
    	mov		esi, [ebp+12]	;src parameter
    	
    	xor		edx, edx
    	mov		ebx, 4
    	div		ebx
    	
    	;Check to see if src is overlapped on dst or not
    	lea		ebx, [edx + 4*eax]
    	mov		ecx, edi 
    	sub		ecx, esi	
    	add		ecx, ebx
    	cmp		ecx, ebx
    	jg		REVERSE		;Overlapped
    	
    
    QLOOP:
    	dec		eax
    	mov		ecx, [esi]
    	mov		[edi], ecx
    	add		esi, 4
    	add		edi, 4
    	test	eax, eax
    	jne		QLOOP
    	
    	test	edx, edx
    	je		RETURN
    	
    RLOOP:
    	dec		edx
    	mov		BYTE PTR cl, [esi]
    	mov		BYTE PTR [edi], cl
    	inc		esi
    	inc		edi
    	test	edx, edx
    	jne		RLOOP	
    	jmp		RETURN
    	
    REVERSE:
    	lea		ebx, [edx + 4*eax]
    	add		esi, ebx
    	add		edi, ebx
    	
    RQLOOP:
    	dec		eax
    	sub		esi, 4
    	sub		edi, 4
    	mov		ecx, [esi]
    	mov		[edi], ecx
    	test	eax, eax
    	jne		RQLOOP
    	
    	cmp		edx, 0
    	je		RETURN
    
    RRLOOP:
    	dec		edx
    	dec		esi
    	dec		edi
    	mov		BYTE PTR cl, [esi]
    	mov		BYTE PTR [edi], cl
    	test	edx, edx
    	jne		RRLOOP
    	
    RETURN:
    	pop		esi
    	pop		edi
    	mov		esp, ebp
    	pop		ebp
    	ret
    
    mem_n_cpy ENDP
    END
    I appreciate your ideas about it.
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  2. #2
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    I imagine that even a simple REP STOSD would be faster than your main loop.

    Memory copy routines are subject to extreme optimization, and to peculiar differences between individual CPUs. Generally, a good memcpy does a CPU detection and a detection of the alignment of source and destination address and chooses a copy method according to this.

    For example, the Core2 CPUs have a 128-bit data channel for the XMM registers, so for 16-byte aligned storage the fastest copying method is to use the SSE load/store commands. On the other hand, if the storage is not aligned, those methods don't work. If the misalignment is the same for source and destination, you can copy the start and end by slower methods and push the middle the fast way.
    The AMD K8 series, however, have only 64-bit data channels for the XMM registers, making the SSE copy a lot slower - slower than other methods, anyway.
    In general, you should consult the vendor's optimization manual in order to find out how to implement memcpy.

    As for CPU detection, remember that you only need to do that once.
    Code:
    .data
    cpu_memcpy DWORD = switch_memcpy
    
    my_memcpy PROC
      ; prologue stuff here
      mov eax [cpu_memcpy]
      ; jump to the specific routine
      jmp [eax]
    EPILOGUE:
      ; function epilogue here
      ret
    my_memcpy ENDP
    
    switch_memcpy:
      ; detect CPU here
      ; e.g. with the cpuid instruction
      ; detected 128-bit data bus CPU:
      mov [cpu_memcpy], sse_memcpy
      ; etc.
      mov eax [cpu_memcpy]
      jmp [eax]
    
    sse_memcpy:
      ; test alignment
      ; jump to different routine if it's wrong
      ; do the SSE stuff here
      jmp EPILOGUE
    
    stosd_memcpy:
      ; a possible fallback
    The thing about your version is that I don't see why you would write that in Assembly. It looks like a direct translation of some very straight-forward C code, except that the compiler would probably have some scheduling tricks up its sleeve.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  3. #3
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246
    As you may know MSVC uses assembly to do memcpy().

    [edit] What is the 'U' and 'V' in comments of assembly code?
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Quote Originally Posted by siavoshkc View Post
    As you may know MSVC uses assembly to do memcpy().
    Yes. You'll typically implement memcpy in assembly, but it will also be typically quite complex.

    What is the 'U' and 'V' in comments of assembly code?
    Huh?
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by siavoshkc View Post
    As you may know MSVC uses assembly to do memcpy().
    Yes, in fact it is an "intrinsic" for most circumstances, meaning that not only is it implemented in assembler, but the compiler actually knows in itself how to form a memcpy() from its arguments without calling a function at all. There are possibly conditions where the C library memcpy() is being called, and yes, that would be written in highly optimized assembler, since memcpy() is one of several functions that need to be as fast as it can be, since applications often use memcpy() in performance critical sections of code [including benchmarks].

    I have no idea what U and V are referring to in comments - I may know if you post an example.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #6
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246
    Of course, as CornedBee points out, a plain set of REP STOSD + REP STOSB would most likely beat your function by a bit. With a little bit of math + the STD instruction, you can do the backwards version too.
    Can you give me an example using STSD please?

    I have no idea what U and V are referring to in comments - I may know if you post an example.
    A part of VC memcpy()
    Code:
    LeadUp1:
            and     edx,ecx         ;U - trailing byte count
            mov     al,[esi]        ;V - get first byte from source
    
            mov     [edi],al        ;U - write second byte to destination
            mov     al,[esi+1]      ;V - get second byte from source
    
            mov     [edi+1],al      ;U - write second byte to destination
            mov     al,[esi+2]      ;V - get third byte from source
    
            shr     ecx,2           ;U - shift down to dword count
            mov     [edi+2],al      ;V - write third byte to destination
    
            add     esi,3           ;U - advance source pointer
            add     edi,3           ;V - advance destination pointer
    
            cmp     ecx,8           ;U - test if small enough for unwind copy
            jb      short CopyUnwindUp ;V - if so, then jump
    
            rep     movsd           ;N - move all of our dwords
    
            jmp     dword ptr TrailUpVec[edx*4] ;N - process trailing bytes
    
            align   @WordSize
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I _think_ those refer to an Intel design - not sure which one. It had V and U "units" in the processor, and the comments are there to say which unit would execute those instructions - alternating between then would improve performance by eliminating stalls. More modern processor designs have out of order and speculative execution, so it's not necessarily meaningful on the latest models of processors.

    When I said STOSD/STOSB, I actually means MOVSD/MOVSB, which is what the memcpy segment you copied uses. STOS[B,D] is for storing constant values (so memset and similar functions).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    U and V are basically integer pipelines inside of the CPU. There was a huge site on the internet devoted to how you could gain performance by optimizing your assembly so that it could use U and V pipelines at the same time. Not sure if it is still up or not.

    You can write code that actually forces the CPU to only use U or V which means, in theory, you are getting half the performance. Most compilers should be optimized now to take advantage of the U and V pipelines.

    Quite surprised to see this as the U and V pipelines were introduced quite awhile ago. Don't know if AMD is identical to Intel in this regard.

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Yes, AMD has a superscalar design, too.
    (Eh, I think that's what superscalar means, anyway.)
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Yes, AMD has a superscalar design, too.
    (Eh, I think that's what superscalar means, anyway.)
    AMD and Intel processors have quite different internal design however. In general, code that runs OK on Intel runs better on AMD processors [up until Core2 at least], whilst Intel often requires "good compiler" to solve the problems. Perhaps this is couple with the fact that Intel does produce a compiler, AMD does not - so they fix the processor to do a better job with "average" code.

    I believe both processors are now capable of using any pipeline for any instruction with a few exceptions (some instructions take a lot of chip real-estate to implement, and if they are also relatively unusual instructions, it makes sense to only have one copy of that piece of hardware - a bit like a compiler would only inline two copies of a function if it's small enough).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  11. #11
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by CornedBee View Post
    Yes, AMD has a superscalar design, too.
    (Eh, I think that's what superscalar means, anyway.)
    http://en.wikipedia.org/wiki/Superscalar

    yes, all modern processors are superscalar, but to greater or lesser degree.

    I cant believe we are still stuck on 64 bit external data buses, we should be up to 512 by now, especially with the large pin count packaging now available.

  12. #12
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    http://en.wikipedia.org/wiki/Superscalar

    yes, all modern processors are superscalar, but to greater or lesser degree.

    I cant believe we are still stuck on 64 bit external data buses, we should be up to 512 by now, especially with the large pin count packaging now available.
    First of all, Athlon64[1] and Pentium4/Core2 processors have 128-bit wide bus. Making the bus wider improves the throughput for sequential access, but unfortunately, a lot of applications are not spending that much time doing sequential access, but rather reading/writing all over the place. With modern CPU's having large caches and clever prefetching, it is actually quite hard to write an application that stalls due to memory throughput - that is, the processor can't fill the 128 bit memory bus today [for sequential access].

    The most likely cause of memory throughput problem is stupid programs that do "memcpy()/memset()" (or similar) operations on large chunks of memory and that doesn't use non-temporal stores - this means that the processor switches from writing to reading to writing to reading for a given piece of memory.

    Edit: [1] The first few models of Athlon64 did not have 128-bit bus, but all 939 socket and newer processors have 128-bit bus.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  13. #13
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246
    In dual core, 128bit is divided to 2 64bit for each core.
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

  14. #14
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by siavoshkc View Post
    In dual core, 128bit is divided to 2 64bit for each core.
    No it's not. The bus is still 128 bit wide, but it is shared via an arbiter to allow both cores to access memory. Any given memory access is EITHER core 0 or core 1, but it's ONE SINGLE bus.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  15. #15
    System Novice siavoshkc's Avatar
    Join Date
    Jan 2006
    Location
    Tehran
    Posts
    1,246
    No it's not. The bus is still 128 bit wide, but it is shared via an arbiter to allow both cores to access memory. Any given memory access is EITHER core 0 or core 1, but it's ONE SINGLE bus.
    Maybe in new ones its true. But in early AMD dual cores I am sure it was 2x64bit. Then it wasn't true for FX ones.

    [edit]
    This is a thread about it http://www.pcguide.com/vb/showthread.php?t=40148
    Last edited by siavoshkc; 08-26-2008 at 10:55 AM.
    Learn C++ (C++ Books, C Books, FAQ, Forum Search)
    Code painter latest version on sourceforge DOWNLOAD NOW!
    Download FSB Data Integrity Tester.
    Siavosh K C

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Learning Assembly
    By mrafcho001 in forum Tech Board
    Replies: 5
    Last Post: 03-12-2006, 05:00 PM
  2. C to assembly interface
    By Roaring_Tiger in forum C Programming
    Replies: 4
    Last Post: 02-04-2005, 03:51 PM
  3. assembly language...the best tool for game programming?
    By silk.odyssey in forum Game Programming
    Replies: 50
    Last Post: 06-22-2004, 01:11 PM
  4. True ASM vs. Fake ASM ????
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 04-02-2003, 04:28 AM
  5. C,C++,Perl,Java
    By brusli in forum C Programming
    Replies: 9
    Last Post: 12-31-2001, 03:35 AM