Optimized 'strlen'

**GReaper** · 01-10-2011

Just yesterday i completed an optimized version of 'strlen' by using inline assembly. The problem is that although it's much faster than the typical approach:

Code:

int i;

while (*str++ != '\0') ++i;

return i;

when compared to the standard library's function it sucks!!!! I'm comparing bytes as fast as i can, while maintaining legacy support of course. The only way that the library could overrun mine would be by using MMX or better yet SSE-SSE2 instructions. But wouldn't that make the library incompatible with old x86 CPUs? It doesn't make sense!!!

EDIT: Even CodeBlocks at -O3 and with Core2 instructions enabled hesitates to use them!!

**EVOEx** · 01-10-2011

I'm using mac, myself, and I think the actual function doing the strlen is "__dyld_strlen". It has the following assembly dump:

Code:

0x00007fff5fc22e90 <__dyld_strlen+0>:	pxor   %xmm0,%xmm0
0x00007fff5fc22e94 <__dyld_strlen+4>:	mov    %edi,%ecx
0x00007fff5fc22e96 <__dyld_strlen+6>:	mov    %rdi,%rdx
0x00007fff5fc22e99 <__dyld_strlen+9>:	and    $0xfffffffffffffff0,%rdi
0x00007fff5fc22e9d <__dyld_strlen+13>:	or     $0xffffffffffffffff,%eax
0x00007fff5fc22ea0 <__dyld_strlen+16>:	pcmpeqb (%rdi),%xmm0
0x00007fff5fc22ea4 <__dyld_strlen+20>:	and    $0xf,%ecx
0x00007fff5fc22ea7 <__dyld_strlen+23>:	shl    %cl,%eax
0x00007fff5fc22ea9 <__dyld_strlen+25>:	pmovmskb %xmm0,%ecx
0x00007fff5fc22ead <__dyld_strlen+29>:	and    %eax,%ecx
0x00007fff5fc22eaf <__dyld_strlen+31>:	je     0x7fff5fc22ebb <__dyld_strlen+43>
0x00007fff5fc22eb1 <__dyld_strlen+33>:	bsf    %ecx,%eax
0x00007fff5fc22eb4 <__dyld_strlen+36>:	sub    %rdx,%rdi
0x00007fff5fc22eb7 <__dyld_strlen+39>:	add    %rdi,%rax
0x00007fff5fc22eba <__dyld_strlen+42>:	retq   
0x00007fff5fc22ebb <__dyld_strlen+43>:	pxor   %xmm0,%xmm0
0x00007fff5fc22ebf <__dyld_strlen+47>:	add    $0x10,%rdi
0x00007fff5fc22ec3 <__dyld_strlen+51>:	movdqa (%rdi),%xmm1
0x00007fff5fc22ec7 <__dyld_strlen+55>:	add    $0x10,%rdi
0x00007fff5fc22ecb <__dyld_strlen+59>:	pcmpeqb %xmm0,%xmm1
0x00007fff5fc22ecf <__dyld_strlen+63>:	pmovmskb %xmm1,%ecx
0x00007fff5fc22ed3 <__dyld_strlen+67>:	test   %ecx,%ecx
0x00007fff5fc22ed5 <__dyld_strlen+69>:	je     0x7fff5fc22ec3 <__dyld_strlen+51>
0x00007fff5fc22ed7 <__dyld_strlen+71>:	sub    $0x10,%rdi
0x00007fff5fc22edb <__dyld_strlen+75>:	jmp    0x7fff5fc22eb1 <__dyld_strlen+33>

I have to admit I didn't read all of it, but it's obvious it uses SSE2 (I think, I have to admit I've never used it myself).

About it being unportable: it's probably determined at compile time if your processor supports it. If not, it uses the "slower but more portable" versions. If it does support them, it uses them. That's how most of those applications that require a maximum speed work. Big integers, for instance, usually have dozens of implementations, one for each architecture supported.

**anduril462** · 01-10-2011

EVOEx is right, many lower level library routines have dozens of implementations that make the optimized for one specific processor type. This way they can leverage certain assembly instructions, registers or extensions. Also, though strlen effectively counts individual bytes, some implementations may only count individual bytes until they find themselves on a word boundary, then use word instructions to speed things up, counting in chunks of 4 bytes instead of 1.

What specific processor are you on? I'd be interested in seeing your implementation, to see if anything more could be squeezed out of it.

**nonoob** · 01-10-2011

If you wanted to emulate the assembler, this:

Code:

while (*str++ != '\0') ++i;

is not an optimal example to work from. You don't need to maintain a count in i. Just increment the pointer. At the end, take the difference between the final pointer and the original (having saved the original at the start).

Code:

while (*str++);

**GReaper** · 01-11-2011

What i posted in code quotes is the typical approach, not what I made.

EVOEx: I can see that you're running 64-bit, right? Well, it's a little different in 32-bit.

anduril462: I've run a series of tests without optimization, and it seems that my versions run asymptotically faster in most case.

Check the attached *.zip for details and for two versions of my optimized StrLen. Enjoy!

( Change its file extension to zip before opening

)

**EVOEx** · 01-11-2011

Originally Posted by Sipher

What i posted in code quotes is the typical approach, not what I made.

EVOEx: I can see that you're running 64-bit, right? Well, it's a little different in 32-bit.

anduril462: I've run a series of tests without optimization, and it seems that my versions run asymptotically faster in most case.

Check the attached *.zip for details and for two versions of my optimized StrLen. Enjoy!

( Change its file extension to zip before opening

)

Yes, 64-bit is different as in that the code is different. But the concepts are still completely the same. Maybe the 32-bit programmers didn't bother to implement SSE2, but if they did for 64 bits, I am quite sure they did.
Also, it was just to show you that yes, sometimes, SSE2 will be used if available, and if not available a suitable alternative will be used.

So in effect, 32 bits and 64 bits are not different at all. In all cases it will have one or more option, and if more options are available, usually the best option will be used.

**GReaper** · 01-11-2011

Originally Posted by EVOEx

EDIT: What kind of file format is that? It's not a simple txt. "PK" first two bytes - isn't that a windows executable?

No, it's a *.zip! Didn't i mention that?

**EVOEx** · 01-11-2011

Originally Posted by Sipher

No, it's a *.zip! Didn't i mention that?

Yeah, I rememberd after writing that ;-). But I was too late removing it. And yes, you did write it, but I didn't actually read that bit :P. (MZ was windows executable wasn't it?)

**GReaper** · 01-11-2011

Yes. MZ stands for Mark Zbikowski (an early DOS programmer)

Edit: It's the least Windows could have made to honor his excessive tribute to the system

Thread: Optimized 'strlen'

Thread Tools

Search Thread

Display

Optimized 'strlen'

Similar Threads

strlen()

strlen in expressions

strlen()

Just say NO to strlen.

Why O why, strlen?