Thread: Inline ASM help? (output variables / instruction speed)

  1. #1
    Registered User
    Join Date
    Dec 2011
    Posts
    795

    Inline ASM help? (output variables / instruction speed)

    My code:

    Code:
    void str_replace(char *output, const char *input, char find, char rep)
    {
        asm volatile (
                      "str_rep_start:"
                      "lodsb\n"
                      "cmpb %%al, %%bl\n"
                      "jne str_rep_sto\n"
                      "movb %%cl, %%al\n"
                      
                      "str_rep_sto: "
                      "stosb\n"
                      "cmpb $0, %%al\n"
                      "jnz str_rep_start"
                      
                      : : "S" (input), "D" (output), "b" (find), "c" (rep)
                      : "memory"
                      );
    }
    Strcpy:

    Code:
    char *strcpy(char *dest, const char *src)
    {
        int d0, d1, d2;
        asm volatile("1:\tlodsb\n\t"
            "stosb\n\t"
            "testb %%al, %%al\n\t"
            "jne 1b"
            : "=&S" (d0), "=&D" (d1), "=&a" (d2)
            : "0" (src), "1" (dest) : "memory");
        return dest;
    First, why is testb used instead of cmp to zero? Is it faster?

    Also, why do they have three extra ints that accept registers? What does this do, given that my code works just fine without them.

  2. #2
    Registered User
    Join Date
    Dec 2011
    Posts
    795
    Already second page with no replies? -.- bumped

  3. #3
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    After 238 posts, you ought to know that bumping is against the forum guidelines . I read your thread this morning, but didn't get around to investigating or answering until just now (stupid working for a living). Also this is probably better suited for the tech board or Linux programming since it's specific to whatever compiler's (GCC's?) inline assembler feature, not standard C. That aside...

    > First, why is testb used instead of cmp to zero? Is it faster?
    It's probably faster. According to this site, test uses bitwise-AND while cmp uses subtraction. AND can be done very quickly, on all bits in parallel. Subtraction has to account for borrowing, which gives it a sense of bitwise serial-ness.

    >
    Also, why do they have three extra ints that accept registers? What does this do, given that my code works just fine without them.
    No clue really, since I've never really messed with GCC's inline assembly. Another inline ASM thread cropped up last week, and it used the extra int variables as well. It saw it in many online examples, but I'm not sure if that's an instance of everybody copying a bad example to begin with.

  4. #4
    Registered User
    Join Date
    Dec 2011
    Posts
    795
    > After 238 posts, you ought to know that bumping is against the forum guidelines
    Right, sorry.

    > According to this site, test uses bitwise-AND while cmp uses subtraction.
    Bookmarked that site, thanks.

    @register thingy:
    I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?

  5. #5
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    Quote Originally Posted by memcpy View Post
    I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?
    No idea what rdtscl() does but omitting registers means reading 'n writing to memory which is way slower than register access, hence the extra clock cycles.

  6. #6
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    No idea what rdtscl() does
    rdtsc => read timestamp counter

    I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?
    What exactly do you mean by "using registers"? I'm assuming that you know registers are faster than memory, so by "using registers" do you mean this:
    Code:
    char *strcpy(char *dest, const char *src) {
        int d0, d1, d2;
        asm volatile("1:\tlodsb\n\t"
            "stosb\n\t"
            "testb %%al, %%al\n\t"
            "jne 1b"
            : "=&S" (d0), "=&D" (d1), "=&a" (d2)
            : "0" (src), "1" (dest)
            : "memory");
        return dest;
    }
    as opposed to this:
    Code:
    char *strcpy(char *dest, const char *src) {
        asm volatile("1:\tlodsb\n\t"
            "stosb\n\t"
            "testb %%al, %%al\n\t"
            "jne 1b"
            :
            : "S" (src), "D" (dest)
            : "eax", "memory");
        return dest;
    }
    because it would seem strange if the first was the faster one, but then again, that would explain the ints (if the "copying a bad example" explanation is incorrect).

    At any rate, a disassembly of the relevant code should show what's up.
    The cost of software maintenance increases with the square of the programmer's creativity. - Robert D. Bliss

  7. #7
    Registered User
    Join Date
    Dec 2011
    Posts
    795
    Yes, I was vague, but I was referring to the temporary example. It truly is faster, and consistently so, here's a dissasemble of both the strcpy()s:

    No temp variable:
    Code:
    0x0000000100000eb9 <nstr+0>:	push   %rbp
    0x0000000100000eba <nstr+1>:	mov    %rsp,%rbp
    0x0000000100000ebd <nstr+4>:	mov    %rdi,-0x8(%rbp)
    0x0000000100000ec1 <nstr+8>:	mov    %rsi,-0x10(%rbp)
    0x0000000100000ec5 <nstr+12>:	mov    -0x10(%rbp),%rsi
    0x0000000100000ec9 <nstr+16>:	mov    -0x8(%rbp),%rdi
    0x0000000100000ecd <nstr+20>:	lods   %ds:(%rsi),%al
    0x0000000100000ece <nstr+21>:	stos   %al,%es:(%rdi)
    0x0000000100000ecf <nstr+22>:	test   %al,%al
    0x0000000100000ed1 <nstr+24>:	jne    0x100000ecd <nstr+20>
    0x0000000100000ed3 <nstr+26>:	mov    -0x8(%rbp),%rax
    0x0000000100000ed7 <nstr+30>:	leaveq 
    0x0000000100000ed8 <nstr+31>:	retq
    Temp variable:
    Code:
    0x0000000100000e90 <rstr+0>:	push   %rbp
    0x0000000100000e91 <rstr+1>:	mov    %rsp,%rbp
    0x0000000100000e94 <rstr+4>:	mov    %rdi,-0x18(%rbp)
    0x0000000100000e98 <rstr+8>:	mov    %rsi,-0x20(%rbp)
    0x0000000100000e9c <rstr+12>:	mov    -0x20(%rbp),%rsi
    0x0000000100000ea0 <rstr+16>:	mov    -0x18(%rbp),%rdi
    0x0000000100000ea4 <rstr+20>:	lods   %ds:(%rsi),%al
    0x0000000100000ea5 <rstr+21>:	stos   %al,%es:(%rdi)
    0x0000000100000ea6 <rstr+22>:	test   %al,%al
    0x0000000100000ea8 <rstr+24>:	jne    0x100000ea4 <rstr+20>
    0x0000000100000eaa <rstr+26>:	mov    %esi,-0x4(%rbp)
    0x0000000100000ead <rstr+29>:	mov    %edi,-0x8(%rbp)
    0x0000000100000eb0 <rstr+32>:	mov    %eax,-0xc(%rbp)
    0x0000000100000eb3 <rstr+35>:	mov    -0x18(%rbp),%rax
    0x0000000100000eb7 <rstr+39>:	leaveq 
    0x0000000100000eb8 <rstr+40>:	retq
    What? More instructions? Anyone wanna enlighten me here?

  8. #8
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by memcpy View Post
    What? More instructions? Anyone wanna enlighten me here?
    This is all speculative, but I think your answer lies here:
    Code:
            : "=&S" (d0), "=&D" (d1), "=&a" (d2)
            : "0" (src), "1" (dest)
            : "memory");
    That first line sets d0 to %esi, d1 to %edi and d2 to %eax. The second line says src is also in %esx, and dest is in %edx (0th and 1st parameters from the line above). The last line says that your assembly may clobber memory in an unpredictable way. I think that means that d0, d1 and d2 (the values in those spots in memory) are not guaranteed to be correct after executing the instructions in your assembler template. Thus, we must restore them at the end. That produces the 3 extra lines below:
    Code:
    0x0000000100000eaa <rstr+26>:    mov    %esi,-0x4(%rbp)    // restore d0 from %esi
    0x0000000100000ead <rstr+29>:    mov    %edi,-0x8(%rbp)    // restore d1 from %edi
    0x0000000100000eb0 <rstr+32>:    mov    %eax,-0xc(%rbp)    // restore d2 from %eax
    Everything else is the same in those two versions (except the relative offsets from the base pointer %rbp). Try taking "memory" out of the clobber list and see what happens.

    On a related note, the following baffles me (it happens in both versions):
    Code:
    0x0000000100000e94 <rstr+4>:    mov    %rdi,-0x18(%rbp)
    0x0000000100000e98 <rstr+8>:    mov    %rsi,-0x20(%rbp)
    0x0000000100000e9c <rstr+12>:    mov    -0x20(%rbp),%rsi
    0x0000000100000ea0 <rstr+16>:    mov    -0x18(%rbp),%rdi
    It seems that moves the values in %rdi and %rsi into dest and src, then moves them right back. This seems backwards (shouldn't it move them from src/dest into registers first?) and pointless. Am I missing something obvious here?

  9. #9
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    gcc knows that the locals are being affected because they're mentioned in the output list. "memory" is in the clobber list because the strcpy operation is writing to memory pointed to by dest. That of course does not invalidate your idea as to why they're written back at the end.
    It seems that moves the values in %rdi and %rsi into dest and src, then moves them right back. This seems backwards (shouldn't it move them from src/dest into registers first?)
    Is it possible that the input params are passed in rdi and rsi? We need to see the disassembly of the call. It would be interesting to see an optimized version as well.

  10. #10
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    %rdi and %rsi are naturally used for the first two arguments in the AMD64 calling convention used by Linux. So if you specify that src goes in "S" and dst goes in "D" it should be able to eliminate all the memory variables
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  11. #11
    Registered User
    Join Date
    Dec 2011
    Posts
    795
    Another bit of oddness:

    Code:
    unsigned rdtscl()
    {
    	unsigned low, high;
    	asm volatile("rdtsc" : "=a" (low), "=d" (high));
    	
    	return (unsigned)((low) | ((unsigned long long)(high) << 32));
    }
    
    
    char *rstr(char *dest, const char *src) {
    	int d0, d1, d2;
    	asm volatile("1:\tlodsb\n\t"
    		     "stosb\n\t"
    		     "testb %%al, %%al\n\t"
    		     "jne 1b"
    		     : "=&S" (d0), "=&D" (d1), "=&a" (d2)
    		     : "0" (src), "1" (dest)
    		     : "memory");
    	return dest;
    }
    
    
    char *nstr(char *dest, const char *src) {
    	asm volatile("1:\tlodsb\n\t"
    		     "stosb\n\t"
    		     "testb %%al, %%al\n\t"
    		     "jne 1b"
    		     :
    		     : "S" (src), "D" (dest)
    		     : );
    	return dest;
    }
    
    
    int main(void)
    { 
    	char *out; 
    	unsigned time;
    	
    	out = malloc(40);
    	
    	time = rdtscl();
    	rstr(out, "abcdefg");
    	printf("rs=%d\n",rdtscl()-time);
    	
    	time = rdtscl();
    	nstr(out, "abcdefg");
    	printf("ns=%d\n",rdtscl()-time);
    	
    	free(out);
    }
    This, for some reason, sometimes the no-int version is actually faster, sometimes it's slightly more, and sometimes it's more by a couple hundred cycles. I don't see how taking out the clobber register changes this, especially because "gcc -S" looks relatively similar...

  12. #12
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    Here's a rewrite of your program using some ideas from here about the difficulties of using rdtsc. Basically, they involve the effects of data and code caching (and also "out of order execution", which I didn't compensate for).

    To alleviate the caching effects, each routine is called 3 times, taking the timing only from the last call (if USE_CACHE_WARMING is non-zero).

    I usually get "r172 n168" using cache warming and "r236 n232" without, although occasionally something like "n664" crops up.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    #define USE_CACHE_WARMING  1   /* 0 don't, 1 do */
    
    typedef unsigned long long ull;
    
    unsigned before_low, before_high, after_low, after_high;
    
    
    char *rstr(char *dest, const char *src) {
        int d0, d1, d2;
    
        asm("rdtsc" : "=a" (before_low), "=d" (before_high));
    
        asm(
            "1: \n"
            "lodsb \n"
            "stosb \n"
            "testb %%al, %%al \n"
            "jne 1b"
    
            : "=&S" (d0), "=&D" (d1), "=&a" (d2)
            : "0" (src), "1" (dest)
            : "memory"
        );
    
        asm("rdtsc" : "=a" (after_low), "=d" (after_high));
    
        return dest;
    }
    
    
    char *nstr(char *dest, const char *src) {
    
        asm("rdtsc" : "=a" (before_low), "=d" (before_high));
    
        asm(
            "1: \n"
            "lodsb \n"
            "stosb \n"
            "testb %%al, %%al \n"
            "jne 1b"
    
            :
            : "S" (src), "D" (dest)
            : "memory"
        );
    
        asm("rdtsc" : "=a" (after_low), "=d" (after_high));
    
        return dest;
    }
    
    
    ull cycles(void) {
        return (ull)(after_low  - before_low)
            + ((ull)(after_high - before_high) << 32);
    }
    
    
    int main(void) {
        char out[40], *str = "abcdefg";
        ull rcycles, ncycles;
    
    #if USE_CACHE_WARMING
        nstr(out, str);
        nstr(out, str);
        nstr(out, str);
        ncycles = cycles();
    
        rstr(out, str);
        rstr(out, str);
        rstr(out, str);
        rcycles = cycles();
    
    #else
        nstr(out, str);
        ncycles = cycles();
    
        rstr(out, str);
        rcycles = cycles();
    
    #endif
    
        printf("r%llu\t", rcycles);
        printf("n%llu\n", ncycles);
    
        return 0;
    }

  13. #13
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    Quote Originally Posted by memcpy View Post
    Yes, I was vague, but I was referring to the temporary example. It truly is faster, and consistently so, here's a dissasemble of both the strcpy()s:

    No temp variable:
    Code:
    0x0000000100000eb9 <nstr+0>:	push   %rbp
    0x0000000100000eba <nstr+1>:	mov    %rsp,%rbp
    0x0000000100000ebd <nstr+4>:	mov    %rdi,-0x8(%rbp)
    0x0000000100000ec1 <nstr+8>:	mov    %rsi,-0x10(%rbp)
    0x0000000100000ec5 <nstr+12>:	mov    -0x10(%rbp),%rsi
    0x0000000100000ec9 <nstr+16>:	mov    -0x8(%rbp),%rdi
    0x0000000100000ecd <nstr+20>:	lods   %ds:(%rsi),%al
    0x0000000100000ece <nstr+21>:	stos   %al,%es:(%rdi)
    0x0000000100000ecf <nstr+22>:	test   %al,%al
    0x0000000100000ed1 <nstr+24>:	jne    0x100000ecd <nstr+20>
    0x0000000100000ed3 <nstr+26>:	mov    -0x8(%rbp),%rax
    0x0000000100000ed7 <nstr+30>:	leaveq 
    0x0000000100000ed8 <nstr+31>:	retq
    Temp variable:
    Code:
    0x0000000100000e90 <rstr+0>:	push   %rbp
    0x0000000100000e91 <rstr+1>:	mov    %rsp,%rbp
    0x0000000100000e94 <rstr+4>:	mov    %rdi,-0x18(%rbp)
    0x0000000100000e98 <rstr+8>:	mov    %rsi,-0x20(%rbp)
    0x0000000100000e9c <rstr+12>:	mov    -0x20(%rbp),%rsi
    0x0000000100000ea0 <rstr+16>:	mov    -0x18(%rbp),%rdi
    0x0000000100000ea4 <rstr+20>:	lods   %ds:(%rsi),%al
    0x0000000100000ea5 <rstr+21>:	stos   %al,%es:(%rdi)
    0x0000000100000ea6 <rstr+22>:	test   %al,%al
    0x0000000100000ea8 <rstr+24>:	jne    0x100000ea4 <rstr+20>
    0x0000000100000eaa <rstr+26>:	mov    %esi,-0x4(%rbp)
    0x0000000100000ead <rstr+29>:	mov    %edi,-0x8(%rbp)
    0x0000000100000eb0 <rstr+32>:	mov    %eax,-0xc(%rbp)
    0x0000000100000eb3 <rstr+35>:	mov    -0x18(%rbp),%rax
    0x0000000100000eb7 <rstr+39>:	leaveq 
    0x0000000100000eb8 <rstr+40>:	retq
    What? More instructions? Anyone wanna enlighten me here?
    Don't know which source you disassembled, but if it is the ones from your first post, then strcpy() has less instructions than str_replace(), even tho' the former has 3 unused ints.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Reading variables through file output...
    By laar in forum C++ Programming
    Replies: 4
    Last Post: 09-23-2005, 09:42 PM
  2. Ext. Inline Output list
    By cboard_member in forum Linux Programming
    Replies: 0
    Last Post: 09-09-2005, 10:38 AM
  3. gcc inline asm: illegal instruction (core dump)
    By Sargnagel in forum C Programming
    Replies: 4
    Last Post: 10-28-2003, 01:41 PM
  4. question about inline _asm and speed
    By Silvercord in forum C++ Programming
    Replies: 11
    Last Post: 04-10-2003, 07:02 PM
  5. Replies: 8
    Last Post: 01-30-2003, 04:29 PM