Originally Posted by
Roaring_Tiger
I am using Borland C++ 4.0 for building a embedded module. Module is required to do very time critical tasks. Code is already developed using functions like sscanf and sprintf. And I heard from my colleague that these functions are comparatively slow. Is this true? And if yes, what is alternative routine I can use?
I have faced with this issue myself recently. My program has to process the string consisting of two columns: words and the document ids where those words are located. The string was of the following format:
word1 docID1
word1 docID1
word1 docID2
word2 docID1
........................
wordi docIDj
In my program I had to create an inverted index (list of ints representing all documents where a certain word is present - in example, for word1, docID1 and docID2 would be the contents of the index). Also, in case the same word is met in the same document, the frequency must be populated
I have decided to use sscanf function on the string in the following way:
Code:
while ( sscanf(bigString, "%s %d", tempWord, tempDocID) )
{
// if the size of the current line is 0, end of string is reached
if ( (lineSize = currentLineLength(subArray)) == 0 ) break;
< Process the obtained string and integer >
bigString = bigString + lineSize; // iterate through the string to scan the next row
}
The size of bigString was more than a million characters. As a result, processing it took 3 minute on average. I have modified the code to split the bigString into small 20K strings and process all of them. The following modifications were made:
Code:
while ( bigString[bigStringCntr + smallStringCntr] != '\0' )
{
smallString = malloc(20000*sizeof(char));
// copy the data from bigString to smallString
while ( smallStringCntr <= 20000 )
{
smallString[smallStringCntr] = bigString[bigStringCntr + smallStringCntr];
if ( (smallStringCntr > 15000 && smallString[smallStringCntr] == '\n') || smallString[smallStringCntr] == '\0' ) break;
smallStringCntr++;
}
smallString[smallStringCntr] = '\0';
while ( sscanf(smallString, "%s %d", tempWord, tempDocID) )
{
// if the size of the current line is 0, end of string is reached
if ( (lineSize = currentLineLength(smallString)) == 0 ) break;
< Process the obtained string and integer >
smallString = smallString + lineSize; // iterate through the string to scan the next row
}
bigStringCntr = bigStringCntr + smallStringCntr; // iterate through the big string to process the next 20K small string
}
The code has many details missing but that's not important. The point is - processing sscanf on many small strings rather than one large string is way more efficient. For example, from 3 minutes per 1 - 2 mbs of data processing time went down to less than a second to process the same exact data but in a smaller substrings. This has been a great discovery for me and a great leap forward for my project