Thread: Optimized 'strlen'

  1. #1
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,738

    Optimized 'strlen'

    Just yesterday i completed an optimized version of 'strlen' by using inline assembly. The problem is that although it's much faster than the typical approach:
    Code:
    int i;
    
    while (*str++ != '\0') ++i;
    
    return i;
    when compared to the standard library's function it sucks!!!! I'm comparing bytes as fast as i can, while maintaining legacy support of course. The only way that the library could overrun mine would be by using MMX or better yet SSE-SSE2 instructions. But wouldn't that make the library incompatible with old x86 CPUs? It doesn't make sense!!!

    EDIT: Even CodeBlocks at -O3 and with Core2 instructions enabled hesitates to use them!!
    Last edited by GReaper; 01-10-2011 at 05:28 AM.
    Devoted my life to programming...

  2. #2
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    I'm using mac, myself, and I think the actual function doing the strlen is "__dyld_strlen". It has the following assembly dump:
    Code:
    0x00007fff5fc22e90 <__dyld_strlen+0>:	pxor   %xmm0,%xmm0
    0x00007fff5fc22e94 <__dyld_strlen+4>:	mov    %edi,%ecx
    0x00007fff5fc22e96 <__dyld_strlen+6>:	mov    %rdi,%rdx
    0x00007fff5fc22e99 <__dyld_strlen+9>:	and    $0xfffffffffffffff0,%rdi
    0x00007fff5fc22e9d <__dyld_strlen+13>:	or     $0xffffffffffffffff,%eax
    0x00007fff5fc22ea0 <__dyld_strlen+16>:	pcmpeqb (%rdi),%xmm0
    0x00007fff5fc22ea4 <__dyld_strlen+20>:	and    $0xf,%ecx
    0x00007fff5fc22ea7 <__dyld_strlen+23>:	shl    %cl,%eax
    0x00007fff5fc22ea9 <__dyld_strlen+25>:	pmovmskb %xmm0,%ecx
    0x00007fff5fc22ead <__dyld_strlen+29>:	and    %eax,%ecx
    0x00007fff5fc22eaf <__dyld_strlen+31>:	je     0x7fff5fc22ebb <__dyld_strlen+43>
    0x00007fff5fc22eb1 <__dyld_strlen+33>:	bsf    %ecx,%eax
    0x00007fff5fc22eb4 <__dyld_strlen+36>:	sub    %rdx,%rdi
    0x00007fff5fc22eb7 <__dyld_strlen+39>:	add    %rdi,%rax
    0x00007fff5fc22eba <__dyld_strlen+42>:	retq   
    0x00007fff5fc22ebb <__dyld_strlen+43>:	pxor   %xmm0,%xmm0
    0x00007fff5fc22ebf <__dyld_strlen+47>:	add    $0x10,%rdi
    0x00007fff5fc22ec3 <__dyld_strlen+51>:	movdqa (%rdi),%xmm1
    0x00007fff5fc22ec7 <__dyld_strlen+55>:	add    $0x10,%rdi
    0x00007fff5fc22ecb <__dyld_strlen+59>:	pcmpeqb %xmm0,%xmm1
    0x00007fff5fc22ecf <__dyld_strlen+63>:	pmovmskb %xmm1,%ecx
    0x00007fff5fc22ed3 <__dyld_strlen+67>:	test   %ecx,%ecx
    0x00007fff5fc22ed5 <__dyld_strlen+69>:	je     0x7fff5fc22ec3 <__dyld_strlen+51>
    0x00007fff5fc22ed7 <__dyld_strlen+71>:	sub    $0x10,%rdi
    0x00007fff5fc22edb <__dyld_strlen+75>:	jmp    0x7fff5fc22eb1 <__dyld_strlen+33>
    I have to admit I didn't read all of it, but it's obvious it uses SSE2 (I think, I have to admit I've never used it myself).

    About it being unportable: it's probably determined at compile time if your processor supports it. If not, it uses the "slower but more portable" versions. If it does support them, it uses them. That's how most of those applications that require a maximum speed work. Big integers, for instance, usually have dozens of implementations, one for each architecture supported.

  3. #3
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    EVOEx is right, many lower level library routines have dozens of implementations that make the optimized for one specific processor type. This way they can leverage certain assembly instructions, registers or extensions. Also, though strlen effectively counts individual bytes, some implementations may only count individual bytes until they find themselves on a word boundary, then use word instructions to speed things up, counting in chunks of 4 bytes instead of 1.

    What specific processor are you on? I'd be interested in seeing your implementation, to see if anything more could be squeezed out of it.

  4. #4
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    If you wanted to emulate the assembler, this:
    Code:
    while (*str++ != '\0') ++i;
    is not an optimal example to work from. You don't need to maintain a count in i. Just increment the pointer. At the end, take the difference between the final pointer and the original (having saved the original at the start).
    Code:
    while (*str++);

  5. #5
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,738
    What i posted in code quotes is the typical approach, not what I made.

    EVOEx: I can see that you're running 64-bit, right? Well, it's a little different in 32-bit.

    anduril462: I've run a series of tests without optimization, and it seems that my versions run asymptotically faster in most case.

    Check the attached *.zip for details and for two versions of my optimized StrLen. Enjoy!
    ( Change its file extension to zip before opening )
    Devoted my life to programming...

  6. #6
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Quote Originally Posted by Sipher View Post
    What i posted in code quotes is the typical approach, not what I made.

    EVOEx: I can see that you're running 64-bit, right? Well, it's a little different in 32-bit.

    anduril462: I've run a series of tests without optimization, and it seems that my versions run asymptotically faster in most case.

    Check the attached *.zip for details and for two versions of my optimized StrLen. Enjoy!
    ( Change its file extension to zip before opening )
    Yes, 64-bit is different as in that the code is different. But the concepts are still completely the same. Maybe the 32-bit programmers didn't bother to implement SSE2, but if they did for 64 bits, I am quite sure they did.
    Also, it was just to show you that yes, sometimes, SSE2 will be used if available, and if not available a suitable alternative will be used.

    So in effect, 32 bits and 64 bits are not different at all. In all cases it will have one or more option, and if more options are available, usually the best option will be used.
    Last edited by EVOEx; 01-11-2011 at 09:43 AM.

  7. #7
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,738
    Quote Originally Posted by EVOEx View Post
    EDIT: What kind of file format is that? It's not a simple txt. "PK" first two bytes - isn't that a windows executable?
    No, it's a *.zip! Didn't i mention that?
    Devoted my life to programming...

  8. #8
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Quote Originally Posted by Sipher View Post
    No, it's a *.zip! Didn't i mention that?
    Yeah, I rememberd after writing that ;-). But I was too late removing it. And yes, you did write it, but I didn't actually read that bit :P. (MZ was windows executable wasn't it?)

  9. #9
    Programming Wraith GReaper's Avatar
    Join Date
    Apr 2009
    Location
    Greece
    Posts
    2,738
    Yes. MZ stands for Mark Zbikowski (an early DOS programmer)

    Edit: It's the least Windows could have made to honor his excessive tribute to the system
    Last edited by GReaper; 01-11-2011 at 09:51 AM.
    Devoted my life to programming...

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. strlen()
    By BoneXXX in forum C Programming
    Replies: 6
    Last Post: 06-17-2007, 01:32 AM
  2. strlen in expressions
    By justforthis1 in forum C++ Programming
    Replies: 4
    Last Post: 10-24-2006, 10:28 AM
  3. strlen()
    By exoeight in forum C Programming
    Replies: 9
    Last Post: 04-01-2005, 10:18 AM
  4. Just say NO to strlen.
    By anonytmouse in forum A Brief History of Cprogramming.com
    Replies: 24
    Last Post: 02-11-2005, 01:34 PM
  5. Why O why, strlen?
    By Sebastiani in forum C Programming
    Replies: 11
    Last Post: 08-24-2001, 01:41 PM