Inline ASM help? (output variables / instruction speed)

**memcpy** · 02-08-2012

My code:

Code:

void str_replace(char *output, const char *input, char find, char rep)
{
    asm volatile (
                  "str_rep_start:"
                  "lodsb\n"
                  "cmpb %%al, %%bl\n"
                  "jne str_rep_sto\n"
                  "movb %%cl, %%al\n"
                  
                  "str_rep_sto: "
                  "stosb\n"
                  "cmpb $0, %%al\n"
                  "jnz str_rep_start"
                  
                  : : "S" (input), "D" (output), "b" (find), "c" (rep)
                  : "memory"
                  );
}

Strcpy:

Code:

char *strcpy(char *dest, const char *src)
{
    int d0, d1, d2;
    asm volatile("1:\tlodsb\n\t"
        "stosb\n\t"
        "testb %%al, %%al\n\t"
        "jne 1b"
        : "=&S" (d0), "=&D" (d1), "=&a" (d2)
        : "0" (src), "1" (dest) : "memory");
    return dest;

First, why is testb used instead of cmp to zero? Is it faster?

Also, why do they have three extra ints that accept registers? What does this do, given that my code works just fine without them.

**memcpy** · 02-08-2012

Already second page with no replies? -.- bumped

**anduril462** · 02-08-2012

After 238 posts, you ought to know that bumping is against the forum guidelines

. I read your thread this morning, but didn't get around to investigating or answering until just now (stupid working for a living). Also this is probably better suited for the tech board or Linux programming since it's specific to whatever compiler's (GCC's?) inline assembler feature, not standard C. That aside...

> First, why is testb used instead of cmp to zero? Is it faster?
It's probably faster. According to this site, test uses bitwise-AND while cmp uses subtraction. AND can be done very quickly, on all bits in parallel. Subtraction has to account for borrowing, which gives it a sense of bitwise serial-ness.

> Also, why do they have three extra ints that accept registers? What does this do, given that my code works just fine without them.
No clue really, since I've never really messed with GCC's inline assembly. Another inline ASM thread cropped up last week, and it used the extra int variables as well. It saw it in many online examples, but I'm not sure if that's an instance of everybody copying a bad example to begin with.

**memcpy** · 02-08-2012

> After 238 posts, you ought to know that bumping is against the forum guidelines
Right, sorry.

> According to this site, test uses bitwise-AND while cmp uses subtraction.
Bookmarked that site, thanks.

@register thingy:
I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?

**itCbitC** · 02-09-2012

Originally Posted by memcpy

I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?

No idea what rdtscl() does but omitting registers means reading 'n writing to memory which is way slower than register access, hence the extra clock cycles.

**oogabooga** · 02-09-2012

No idea what rdtscl() does

rdtsc => read timestamp counter

I just did a rdtscl() test, and turns out that using registers yields 288 cycles, whereas omitting them takes 360. Any idea why?

What exactly do you mean by "using registers"? I'm assuming that you know registers are faster than memory, so by "using registers" do you mean this:

Code:

char *strcpy(char *dest, const char *src) {
    int d0, d1, d2;
    asm volatile("1:\tlodsb\n\t"
        "stosb\n\t"
        "testb %%al, %%al\n\t"
        "jne 1b"
        : "=&S" (d0), "=&D" (d1), "=&a" (d2)
        : "0" (src), "1" (dest)
        : "memory");
    return dest;
}

as opposed to this:

Code:

char *strcpy(char *dest, const char *src) {
    asm volatile("1:\tlodsb\n\t"
        "stosb\n\t"
        "testb %%al, %%al\n\t"
        "jne 1b"
        :
        : "S" (src), "D" (dest)
        : "eax", "memory");
    return dest;
}

because it would seem strange if the first was the faster one, but then again, that would explain the ints (if the "copying a bad example" explanation is incorrect).

At any rate, a disassembly of the relevant code should show what's up.

**memcpy** · 02-09-2012

Yes, I was vague, but I was referring to the temporary example. It truly is faster, and consistently so, here's a dissasemble of both the strcpy()s:

No temp variable:

Code:

0x0000000100000eb9 <nstr+0>:	push   %rbp
0x0000000100000eba <nstr+1>:	mov    %rsp,%rbp
0x0000000100000ebd <nstr+4>:	mov    %rdi,-0x8(%rbp)
0x0000000100000ec1 <nstr+8>:	mov    %rsi,-0x10(%rbp)
0x0000000100000ec5 <nstr+12>:	mov    -0x10(%rbp),%rsi
0x0000000100000ec9 <nstr+16>:	mov    -0x8(%rbp),%rdi
0x0000000100000ecd <nstr+20>:	lods   %ds:(%rsi),%al
0x0000000100000ece <nstr+21>:	stos   %al,%es:(%rdi)
0x0000000100000ecf <nstr+22>:	test   %al,%al
0x0000000100000ed1 <nstr+24>:	jne    0x100000ecd <nstr+20>
0x0000000100000ed3 <nstr+26>:	mov    -0x8(%rbp),%rax
0x0000000100000ed7 <nstr+30>:	leaveq 
0x0000000100000ed8 <nstr+31>:	retq

Temp variable:

Code:

0x0000000100000e90 <rstr+0>:	push   %rbp
0x0000000100000e91 <rstr+1>:	mov    %rsp,%rbp
0x0000000100000e94 <rstr+4>:	mov    %rdi,-0x18(%rbp)
0x0000000100000e98 <rstr+8>:	mov    %rsi,-0x20(%rbp)
0x0000000100000e9c <rstr+12>:	mov    -0x20(%rbp),%rsi
0x0000000100000ea0 <rstr+16>:	mov    -0x18(%rbp),%rdi
0x0000000100000ea4 <rstr+20>:	lods   %ds:(%rsi),%al
0x0000000100000ea5 <rstr+21>:	stos   %al,%es:(%rdi)
0x0000000100000ea6 <rstr+22>:	test   %al,%al
0x0000000100000ea8 <rstr+24>:	jne    0x100000ea4 <rstr+20>
0x0000000100000eaa <rstr+26>:	mov    %esi,-0x4(%rbp)
0x0000000100000ead <rstr+29>:	mov    %edi,-0x8(%rbp)
0x0000000100000eb0 <rstr+32>:	mov    %eax,-0xc(%rbp)
0x0000000100000eb3 <rstr+35>:	mov    -0x18(%rbp),%rax
0x0000000100000eb7 <rstr+39>:	leaveq 
0x0000000100000eb8 <rstr+40>:	retq

What? More instructions? Anyone wanna enlighten me here?

**anduril462** · 02-09-2012

Originally Posted by memcpy

What? More instructions? Anyone wanna enlighten me here?

This is all speculative, but I think your answer lies here:

Code:

        : "=&S" (d0), "=&D" (d1), "=&a" (d2)
        : "0" (src), "1" (dest)
        : "memory");

That first line sets d0 to %esi, d1 to %edi and d2 to %eax. The second line says src is also in %esx, and dest is in %edx (0th and 1st parameters from the line above). The last line says that your assembly may clobber memory in an unpredictable way. I think that means that d0, d1 and d2 (the values in those spots in memory) are not guaranteed to be correct after executing the instructions in your assembler template. Thus, we must restore them at the end. That produces the 3 extra lines below:

Code:

0x0000000100000eaa <rstr+26>:    mov    %esi,-0x4(%rbp)    // restore d0 from %esi
0x0000000100000ead <rstr+29>:    mov    %edi,-0x8(%rbp)    // restore d1 from %edi
0x0000000100000eb0 <rstr+32>:    mov    %eax,-0xc(%rbp)    // restore d2 from %eax

Everything else is the same in those two versions (except the relative offsets from the base pointer %rbp). Try taking "memory" out of the clobber list and see what happens.

On a related note, the following baffles me (it happens in both versions):

Code:

0x0000000100000e94 <rstr+4>:    mov    %rdi,-0x18(%rbp)
0x0000000100000e98 <rstr+8>:    mov    %rsi,-0x20(%rbp)
0x0000000100000e9c <rstr+12>:    mov    -0x20(%rbp),%rsi
0x0000000100000ea0 <rstr+16>:    mov    -0x18(%rbp),%rdi

It seems that moves the values in %rdi and %rsi into dest and src, then moves them right back. This seems backwards (shouldn't it move them from src/dest into registers first?) and pointless. Am I missing something obvious here?

**oogabooga** · 02-09-2012

gcc knows that the locals are being affected because they're mentioned in the output list. "memory" is in the clobber list because the strcpy operation is writing to memory pointed to by dest. That of course does not invalidate your idea as to why they're written back at the end.

It seems that moves the values in %rdi and %rsi into dest and src, then moves them right back. This seems backwards (shouldn't it move them from src/dest into registers first?)

Is it possible that the input params are passed in rdi and rsi? We need to see the disassembly of the call. It would be interesting to see an optimized version as well.

**brewbuck** · 02-09-2012

%rdi and %rsi are naturally used for the first two arguments in the AMD64 calling convention used by Linux. So if you specify that src goes in "S" and dst goes in "D" it should be able to eliminate all the memory variables

**memcpy** · 02-09-2012

Another bit of oddness:

Code:

unsigned rdtscl()
{
	unsigned low, high;
	asm volatile("rdtsc" : "=a" (low), "=d" (high));
	
	return (unsigned)((low) | ((unsigned long long)(high) << 32));
}


char *rstr(char *dest, const char *src) {
	int d0, d1, d2;
	asm volatile("1:\tlodsb\n\t"
		     "stosb\n\t"
		     "testb %%al, %%al\n\t"
		     "jne 1b"
		     : "=&S" (d0), "=&D" (d1), "=&a" (d2)
		     : "0" (src), "1" (dest)
		     : "memory");
	return dest;
}


char *nstr(char *dest, const char *src) {
	asm volatile("1:\tlodsb\n\t"
		     "stosb\n\t"
		     "testb %%al, %%al\n\t"
		     "jne 1b"
		     :
		     : "S" (src), "D" (dest)
		     : );
	return dest;
}


int main(void)
{ 
	char *out; 
	unsigned time;
	
	out = malloc(40);
	
	time = rdtscl();
	rstr(out, "abcdefg");
	printf("rs=%d\n",rdtscl()-time);
	
	time = rdtscl();
	nstr(out, "abcdefg");
	printf("ns=%d\n",rdtscl()-time);
	
	free(out);
}

This, for some reason, sometimes the no-int version is actually faster, sometimes it's slightly more, and sometimes it's more by a couple hundred cycles. I don't see how taking out the clobber register changes this, especially because "gcc -S" looks relatively similar...

**oogabooga** · 02-10-2012

Here's a rewrite of your program using some ideas from here about the difficulties of using rdtsc. Basically, they involve the effects of data and code caching (and also "out of order execution", which I didn't compensate for).

To alleviate the caching effects, each routine is called 3 times, taking the timing only from the last call (if USE_CACHE_WARMING is non-zero).

I usually get "r172 n168" using cache warming and "r236 n232" without, although occasionally something like "n664" crops up.

Code:

#include <stdio.h>
#include <stdlib.h>

#define USE_CACHE_WARMING  1   /* 0 don't, 1 do */

typedef unsigned long long ull;

unsigned before_low, before_high, after_low, after_high;


char *rstr(char *dest, const char *src) {
    int d0, d1, d2;

    asm("rdtsc" : "=a" (before_low), "=d" (before_high));

    asm(
        "1: \n"
        "lodsb \n"
        "stosb \n"
        "testb %%al, %%al \n"
        "jne 1b"

        : "=&S" (d0), "=&D" (d1), "=&a" (d2)
        : "0" (src), "1" (dest)
        : "memory"
    );

    asm("rdtsc" : "=a" (after_low), "=d" (after_high));

    return dest;
}


char *nstr(char *dest, const char *src) {

    asm("rdtsc" : "=a" (before_low), "=d" (before_high));

    asm(
        "1: \n"
        "lodsb \n"
        "stosb \n"
        "testb %%al, %%al \n"
        "jne 1b"

        :
        : "S" (src), "D" (dest)
        : "memory"
    );

    asm("rdtsc" : "=a" (after_low), "=d" (after_high));

    return dest;
}


ull cycles(void) {
    return (ull)(after_low  - before_low)
        + ((ull)(after_high - before_high) << 32);
}


int main(void) {
    char out[40], *str = "abcdefg";
    ull rcycles, ncycles;

#if USE_CACHE_WARMING
    nstr(out, str);
    nstr(out, str);
    nstr(out, str);
    ncycles = cycles();

    rstr(out, str);
    rstr(out, str);
    rstr(out, str);
    rcycles = cycles();

#else
    nstr(out, str);
    ncycles = cycles();

    rstr(out, str);
    rcycles = cycles();

#endif

    printf("r%llu\t", rcycles);
    printf("n%llu\n", ncycles);

    return 0;
}

**itCbitC** · 02-10-2012

Originally Posted by memcpy

Yes, I was vague, but I was referring to the temporary example. It truly is faster, and consistently so, here's a dissasemble of both the strcpy()s:

No temp variable:

Code:

0x0000000100000eb9 <nstr+0>:	push   %rbp
0x0000000100000eba <nstr+1>:	mov    %rsp,%rbp
0x0000000100000ebd <nstr+4>:	mov    %rdi,-0x8(%rbp)
0x0000000100000ec1 <nstr+8>:	mov    %rsi,-0x10(%rbp)
0x0000000100000ec5 <nstr+12>:	mov    -0x10(%rbp),%rsi
0x0000000100000ec9 <nstr+16>:	mov    -0x8(%rbp),%rdi
0x0000000100000ecd <nstr+20>:	lods   %ds:(%rsi),%al
0x0000000100000ece <nstr+21>:	stos   %al,%es:(%rdi)
0x0000000100000ecf <nstr+22>:	test   %al,%al
0x0000000100000ed1 <nstr+24>:	jne    0x100000ecd <nstr+20>
0x0000000100000ed3 <nstr+26>:	mov    -0x8(%rbp),%rax
0x0000000100000ed7 <nstr+30>:	leaveq 
0x0000000100000ed8 <nstr+31>:	retq

Temp variable:

Code:

0x0000000100000e90 <rstr+0>:	push   %rbp
0x0000000100000e91 <rstr+1>:	mov    %rsp,%rbp
0x0000000100000e94 <rstr+4>:	mov    %rdi,-0x18(%rbp)
0x0000000100000e98 <rstr+8>:	mov    %rsi,-0x20(%rbp)
0x0000000100000e9c <rstr+12>:	mov    -0x20(%rbp),%rsi
0x0000000100000ea0 <rstr+16>:	mov    -0x18(%rbp),%rdi
0x0000000100000ea4 <rstr+20>:	lods   %ds:(%rsi),%al
0x0000000100000ea5 <rstr+21>:	stos   %al,%es:(%rdi)
0x0000000100000ea6 <rstr+22>:	test   %al,%al
0x0000000100000ea8 <rstr+24>:	jne    0x100000ea4 <rstr+20>
0x0000000100000eaa <rstr+26>:	mov    %esi,-0x4(%rbp)
0x0000000100000ead <rstr+29>:	mov    %edi,-0x8(%rbp)
0x0000000100000eb0 <rstr+32>:	mov    %eax,-0xc(%rbp)
0x0000000100000eb3 <rstr+35>:	mov    -0x18(%rbp),%rax
0x0000000100000eb7 <rstr+39>:	leaveq 
0x0000000100000eb8 <rstr+40>:	retq

What? More instructions? Anyone wanna enlighten me here?

Don't know which source you disassembled, but if it is the ones from your first post, then strcpy() has less instructions than str_replace(), even tho' the former has 3 unused ints.

Thread: Inline ASM help? (output variables / instruction speed)

Thread Tools

Search Thread

Display

Inline ASM help? (output variables / instruction speed)

Similar Threads

Reading variables through file output...

Ext. Inline Output list

gcc inline asm: illegal instruction (core dump)

question about inline _asm and speed

Does inline assembly still have to follow microprocessor instruction set?