PDA

View Full Version : Just say NO to strlen.



anonytmouse
02-10-2005, 07:32 AM
The strlen function is expensive. For every call it must loop over every character until a nul terminator is found. Therefore, it is very unwise to call it more often than needed. This is bad code:


for ( int ix = 0; ix < strlen(a_str); ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}

Lets consider a string 1000 characters in length. strlen will be called 1001 times and loop over 1000 characters each time. That is over one million wasted iterations. If the tolower call and assignment takes the equivalent of 10 iterations we can calculate that the operation takes a massive 10000% of the time it would take if it was written correctly.

Computers may be much faster than they were ten years ago (although sometimes I doubt that), but a hundred times performance hit is still unacceptable. Do we really want C/C++ code running slower than the VB, PHP, Perl, Python, Javascript, punch card and little old lady with typewriter versions combined?

With the frequent use of strlen and the resulting performance wipe out, one would assume that it is very hard to write code that doesn't use strlen in the loop condition. To the contrary, Dr Watson! Writing a loop condition without strlen is actually quite achievable, even for the most experienced coder.

All that is required is to replace the use of strlen with:


for ( int ix = 0; a_str[ix] != '\0'; ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}

or the slightly less efficient:


int len = strlen(a_str);
for ( int ix = 0; ix < len; ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}

See, it is possible! If you can't remember, feel free to bookmark this page(typically ctrl+d) and copy(ctrl+c) and paste(ctrl+v).

If peer pressure is pushing you towards excessive use of strlen, JUST SAY NO*.


*strlen may be permitted for medical, artistic, electoral, hunting or driving purposes in some states. Please consult your local strlen provider.

Govtcheez
02-10-2005, 07:35 AM
> Writing a loop condition without strlen is actually quite achievable, even for the most experienced coder.

Well, I would hope an experienced coder could write it ;)

SMurf
02-10-2005, 07:52 AM
I think it's reasonably well-known if you've been round the block a few times not to use functions in loop conditions.

Might be one for the FAQ though, just in case someone wonders why their simple program takes minutes instead of seconds to work. ;)

Fordy
02-10-2005, 07:55 AM
It's good practice to take notice of what's placed as the "test" for a loop like that. The same can be said of testing a container's length() function each time (though the actual overhead might be less according to how the container is written).


If peer pressure is pushing you towards excessive use of strlen, JUST SAY NO*.


*strlen may be permitted for medical, artistic, electoral, hunting or driving purposes in some states. Please consult your local strlen provider.


:)

Thantos
02-10-2005, 09:26 AM
for ( int ix = 0; ix < strlen(a_str); ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}

Depending on the compiler this may not be as bad as you think. Since the body of the loop doesn't change the length its very possible for the compiler to modify the code so that the function is called once and it saves the return value and uses that in the condition.



for ( int ix = 0; a_str[ix] != '\0'; ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}

This is inefficent also. Everytime through the loop it has to add x *sizeof(char) to a_str to get the address. Now since sizeof(char)==1 it might be optimized in that regard. The compiler might optimize it to:


for ( char *tempptr = a_str; *tempptr != '\0'; tempptr++)
{
*tempptr = tolower( (unsigned char) *tempptr );
}

CornedBee
02-10-2005, 09:38 AM
Depending on the compiler this may not be as bad as you think. Since the body of the loop doesn't change the length its very possible for the compiler to modify the code so that the function is called once and it saves the return value and uses that in the condition.
Or not. Given that the compiler might not to what toupper does, it can't perform this optimization. What if toupper return '\0'? Suddenly the string is a lot shorter.
(Mind you, the '\0' test method fails just the same, but that isn't relevant.)



This is inefficent also. Everytime through the loop it has to add x *sizeof(char) to a_str to get the address.
Questionable. On x86, at least, such offset-addressing can be done in a single instruction.

Thantos
02-10-2005, 09:51 AM
Considering that you are calling toupper() on the same location as you are putting it into there is no way for toupper() to return '\0' prior to reaching the actual null terminator. I would expect your better compilers to know this and be able to optimize it.


Questionable. On x86, at least, such offset-addressing can be done in a single instruction.
On a char array it should be possible, but for other arrays I don't think so but its been awhile since I looked at all the op codes.

Edit: Yeah I remember one now that let you use a scaler for the offset. While it is one instruction its not nearly as fast of an instruction as say adding a fixed value.

Hunter2
02-10-2005, 10:31 AM
>>I would expect your better compilers to know this and be able to optimize it.
Possibly, but it's hardly a 'good' assumption to make :) I really don't know how smart compilers are these days, but this assumes that the compiler will (a) recognize the function is named toupper(), (b) find out whether it's a user-defined function or not, (c) recognize the function strlen(), (d) find out whether it's a user-defined function or not, (e) recognize the fact that given the above conditions it is safe to optimize the construct, and (f) know how to optimize it.

Knowing that compilers are smarter than we are these days, this is probably a reasonable assumption to make, but it's always safest just to write your code properly in the first place. After all, not every compiler is omniscient :)

Sang-drax
02-10-2005, 10:34 AM
On my system strlen is not exactly implemented as a for-loop. It is implemented with a special string instruction.
Nevertheless, your point is valid.

CornedBee
02-10-2005, 10:37 AM
And the compiler would also need to know that it's not our intention to break the moment toupper returns '\0', which would happen with repeated strlen calls. In other words, the optimization would change the observable behaviour of the program beyond speed changes, and this is not tolerable.

Thantos
02-10-2005, 10:39 AM
CornedBee please tell me in what case toupper() returns '\0' when the input isn't '\0'. Because well I looked at the man page and it made no mention of that return value.

CornedBee
02-10-2005, 11:36 AM
It doesn't. But unless the compiler has been programmed to know this (why should it?) or is capable to look into every single one of the locale-dependent translation lookup tables, the compiler can't know that.

Brian
02-10-2005, 11:46 AM
strlen killed my cpu and also my family :(

Thantos
02-10-2005, 12:06 PM
But unless the compiler has been programmed to know this (why should it?)
Maybe because toupper() and strlen() are defined by the implentation. Also regardless of the locale there is only one way for toupper() to return '\0' so it really wouldn't be that much of a strecth for them to know the length of the string doesn't change during that loop.


Possibly, but it's hardly a 'good' assumption to make I really don't know how smart compilers are these days, but this assumes that the compiler will (a) recognize the function is named toupper(), (b) find out whether it's a user-defined function or not, (c) recognize the function strlen(), (d) find out whether it's a user-defined function or not, (e) recognize the fact that given the above conditions it is safe to optimize the construct, and (f) know how to optimize it.

Well you could argue that it doesn't really have to check for b and d because you aren't allowed to create user defined functions with those names. Doing so would be undefined behaviour.

major_small
02-10-2005, 03:45 PM
for ( int ix = 0; ix < strlen(a_str); ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}
I myself am guilty of this, but only in code written in haste as an example to show somebody something... the only reason I write it like that in the first place is because I had it pounded into my head by some bad teacher in high school...

hi, my name is Major_Small, and I'm an addict that has started the short path to recovery.


anonytmouse: have you written a tip (http://www.cprogramming.com/tips/add.php) about it?

Hunter2
02-10-2005, 04:22 PM
because you aren't allowed to create user defined functions with those names.Why not? Just change the parameter type to char* and return type to int, and you're all good - no ambiguity. Besides, the library functions are supposed to be in the std:: namespace.



#include <iostream>
#include <string>
#include <cstring>
#include <cstdlib>

int strlen(char* x)
{ return 200; }

int main()
{
char* y = "1234";
for(int i = 0; i < strlen(y); ++i)
std::cout << y[i];

std::cin.get();
return 0;
}
Output:
1234 4I B lI OB @I B T I B I B B [B B B IB B B B dB oB 3B {B $I lB B B B jB B B B B B B B B

Magos
02-10-2005, 05:54 PM
std::string

Problem solved!

Hunter2
02-10-2005, 05:57 PM
Spoilsport. :o

Thantos
02-10-2005, 06:00 PM
Why not? Just change the parameter type to char* and return type to int, and you're all good - no ambiguity. Besides, the library functions are supposed to be in the std:: namespace.
Who said this was c++? I might be wrong but I believe that C99 lets you declare variables inside a for loop, which is the only indication that the code might be C++.

Also changing the return type has no barring on overloading. Changing the parameter char * instead of const char * will cause ambiguity for the user. They might expect to get your function but instead will get the standard function if they use a const pointer or const array.

CornedBee
02-10-2005, 06:04 PM
Maybe because toupper() and strlen() are defined by the implentation.
But they aren't built into the compiler, are they? And even if they are, are their specific characteristics built-in too? In some implementations (GNU libc is an example, I think), it is in fact impossible for the compiler to deduce the behaviour of toupper from the code, because the lookup tables are loaded at locale setting time from files on the disk. In other words, the exact behaviour of toupper is not known at compile time.

Which means that the knowledge, "toupper only returns 0 when passed 0" must be built into the compiler.
(On a side note, the _toupper macro that appears in the MS CRT ctype.h header doesn't fulfill this guarantee.)

In other words:

1) The compiler must have been explicitely told that toupper returns 0 only when being passed 0.
2) The compiler must either have been explicitely told, or able to deduce from the (suddenly available) code of strlen (which very often is implemented in assembly), that a loop over a sequence passed to strlen and controlled by its return value can never hit a 0 character in its body.
3) The compiler must either have been explicitely told, or able to deduce from strlen's source, that its return value only changes if a change to the sequence results in a new location of the first 0 character.
4) The compiler must either have been explicitely told, or able to deduce from strlen's source, that strlen has no additional side effects.

Given these three pieces of information, the compiler must follow this line of logic:
Since the loop is controlled by strlen, only non-0 characters will appear in it. The only change to the sequence is the assignment of a value that comes from the same position in the sequence and has been passed through toupper. toupper will only ever return 0 if 0 was passed in, which means that it will never return 0 in this loop, and thus no new 0 characters are introduced inside the loop. Since strlen's return value can never change under these circumstances, strlen can be considered a loop-time constant and can safely be fetched only once.

I think you're asking a lot, even from modern compilers.

Hunter2
02-10-2005, 06:08 PM
Changing the parameter char * instead of const char * will cause ambiguity for the user. They might expect to get your function but instead will get the standard function if they use a const pointer or const array.Indeed, but it's a possibility that the compiler must take into account.

Hunter2
02-10-2005, 06:24 PM
Ok, well I just ran a test (MSVC 2005 release build):


#include <iostream>
#include <cstring>
#include <cctype>
#include <windows.h>

int main()
{
char a_str[1000];
for(int i = 0; i < 1000; ++i)
a_str[i] = 'x';
a_str[999] = '\0';

DWORD j = GetTickCount();
for(int p = 0; p < 10000; ++p)
{
for ( int ix = 0; ix < strlen(a_str); ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}
}
DWORD end = GetTickCount() - j;
std::cout << end << std::endl;

j = GetTickCount();
for(int p = 0; p < 10000; ++p)
{
int n = strlen(a_str);
for ( int ix = 0; ix < n; ix++)
{
a_str[ix] = tolower( (unsigned char) a_str[ix] );
}
}
end = GetTickCount() - j;
std::cout << end << std::endl;

std::cin.get();
return 0;
}
Output on my machine:


7031
47
Strangely enough, under Debug build:


3235
93
Apparently, the compiler anti-optimized the strlen version for Release build, and optimized the 'proper' version. However, either way the Release build of the strlen version took 149.6 times as long, and the Debug build took 34.8 times as long.

okinrus
02-10-2005, 06:34 PM
Depending on the compiler this may not be as bad as you think. Since the body of the loop doesn't change the length its very possible for the compiler to modify the code so that the function is called once and it saves the return value and uses that in the condition.

I don't think the compiler can optimize here. Iit doesn't know the side effects strlen has. For instance, strlen could modify errno or some other static variable. Without special code to detect strlen, and if the C library is dynamic or static linked and strlen isn't an inline function, the complier is going to have to assume the worst.

sean
02-10-2005, 09:11 PM
strlen killed my cpu and also my family

You killed them yourself with poor usage :p

Zach L.
02-11-2005, 01:34 PM
Yeah, I always see strlen hanging around in back alleys in the shady parts of town.