# Thread: Is there a standard function?

1. Alright, I had to do it...
Code:
```int iMalcsFormat(long snum, char s[])
{
unsigned long num = snum;
unsigned int len = 0;
if (snum < 0) {
s[len++] = '-';
num = -snum;
}
if (num < 10) len += 1;
else if (num < 100) len += 2;
else if (num < 1000) len += 3;
else if (num < 10000) len += 5;
else if (num < 100000) len += 6;
else if (num < 1000000) len += 7;
else if (num < 10000000) len += 9;
else if (num < 100000000) len += 10;
else if (num < 1000000000) len += 11;
else len += 13;
unsigned int x = len;
s[x] = '\0';
do {
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = ',';
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = ',';
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = '0' + (char)(num % 10); num /= 10; if (!num) break;
s[--x] = ',';
s[--x] = '0' + (char)(num);
} while (0);
return len;
}```
There's still the possibility of using multiplicative inverses for div/mod by ten for some of the above, but other than that probably the only faster option is hand-rolled assembly.

Code:
```void format_int(int num)
{
static int initialized = 0;

char group[1000][4];

if (!initialized)
{
int i;
for (i = 0; i < 1000; ++i)
{
group[i][0] = ',';
group[i][1] = '0' + i / 100;
group[i][2] = '0' + (i / 10) % 10;
group[i][3] = '0' + i % 10;
}

initialized = 1;
}

char *pos = buffer + BUF_LEN - 4;

while (num)
{
int this_group = num % 1000;
num /= 1000;

*((int *) pos) = *((int *) &group[this_group][0]);

pos -= 4;
}

char *start = pos + 4;

while ((*start) == '0' || (*start) == ',')
{
++start;
}

size_t len = BUF_LEN - (start - buffer);

memmove(buffer, start, len);

buffer[len] = '\0';
}```
This is the fastest I can think of. I'm hoping the compiler is smart enough to just do one div and use both the quotient and remainder.

Uses 4KB RAM. Should fit in L1 cache.

Oh and it assumes 32-bit ints. I believe it's endian safe, although it may not look like it.

3. I get 0.279s for 10000000 iterations of 1234567890.

But of course, that's not a fair comparison, because my compiler is vastly superior .

4. Ah! and there is a subtle bug in my code.

It breaks the strict aliasing rule by accessing chars through an int pointer.

I'm not sure how to fix that without incurring overhead.

With gcc you'll need -fno-strict-aliasing.

5. Alright, I succummed to using gotos (unfortunately we can't really use duff's device for this). I blame this on the fact that I'm at home sick today with nothing better to do. Actually me using a goto already says that I'm sick.
This version reduces the number of branches considerably (3 - 6 now instead of 3 - 19):
Code:
```int iMalcsInsaneFormat(long snum, char s[])
{
char *ps = s;
unsigned long num = snum, div;
if (snum < 0) {
*ps++ = '-';
num = -snum;
}
if (num < 100000) {
if (num < 10) goto L1;
if (num < 100) goto L2;
if (num < 1000) goto L3;
if (num < 10000) goto L4;
goto L5;
} else {
if (num < 1000000) goto L6;
if (num < 10000000) goto L7;
if (num < 100000000) goto L8;
if (num < 1000000000) goto L9;
}
*ps++ = '0' + (char)(num / 1000000000); num %= 1000000000;
*ps++ = ',';
L9:
*ps++ = '0' + (char)(num / 100000000); num %= 100000000;
L8:
*ps++ = '0' + (char)(num / 10000000); num %= 10000000;
L7:
*ps++ = '0' + (char)(num / 1000000); num %= 1000000;
*ps++ = ',';
L6:
*ps++ = '0' + (char)(num / 100000); num %= 100000;
L5:
*ps++ = '0' + (char)(num / 10000); num %= 10000;
L4:
*ps++ = '0' + (char)(div = (num*8389UL)>>23); num -= div*1000;
*ps++ = ',';
L3:
*ps++ = '0' + (char)(div = (num*5243UL)>>19); num -= div*100;
L2:
*ps++ = '0' + (char)(div = (num*6554UL)>>16); num -= div*10;
L1:
*ps++ = '0' + (char)(num);
*ps = '\0';
return ps - s;
}```
Edit: Fixed! Okay, one last change... Now using some multiplicative inverses and I've actually tested the code this time!

6. Now I see why the language doesn't have this as a standard function - there's too many ways to go about it. They couldn't decide on which one to use.

We'll have to ask Frktons to catalog all these on his system, and post up the results. I expected something like what iMalc posted up, would be a very fast choice. Then I thought "bit fields would be even faster", but it seems surreal to write a bit field program for such a mundane task as putting comma's in a string of numbers.

@Cyberfish, would you give us a description of how your program works, and what system you were timing this on?

@iMalc - laughing as he succumbs to the Dark Side of the Force.

I'm not on a POSIX system. That is the MSVC (Windows) and Turbo C (DOS) version.
respectively.

You are the one who keeps posting these assertions, which you know are incorrect and/or irrelevant.
You are blaming me for code that you used?
I merely compiled your code, and I got that warning.
I cannot be held responsible for your use of stoneage compilers and usage of non-standard compliant code.

8. Okay final version. Pulled out all the stops now... Only two divs or mods in the whole thing. I even got rid of two gotos! Time to retire this one:
Code:
```int iMalcsUltimateFormat(long snum, char s[])
{
char *ps = s;
unsigned long num1 = snum, num2, num3, div;
if (snum < 0) {
*ps++ = '-';
num1 = -snum;
}
if (num1 < 10000) {
if (num1 < 10) goto L1;
if (num1 < 100) goto L2;
if (num1 < 1000) goto L3;
} else {
num2 = num1 / 10000;
num1 -= num2 * 10000;
if (num2 < 10000) {
if (num2 < 10) goto L5;
if (num2 < 100) goto L6;
if (num2 < 1000) goto L7;
} else {
num3 = num2 / 10000;
num2 -= num3 * 10000;
if (num3 >= 10) {
*ps++ = '0' + (char)(div = (num3*6554UL)>>16); num3 -= div*10;
*ps++ = ',';
}
*ps++ = '0' + (char)(num3);
}
*ps++ = '0' + (char)(div = (num2*8389UL)>>23); num2 -= div*1000;
L7:
*ps++ = '0' + (char)(div = (num2*5243UL)>>19); num2 -= div*100;
*ps++ = ',';
L6:
*ps++ = '0' + (char)(div = (num2*6554UL)>>16); num2 -= div*10;
L5:
*ps++ = '0' + (char)(num2);
}
*ps++ = '0' + (char)(div = (num1*8389UL)>>23); num1 -= div*1000;
*ps++ = ',';
L3:
*ps++ = '0' + (char)(div = (num1*5243UL)>>19); num1 -= div*100;
L2:
*ps++ = '0' + (char)(div = (num1*6554UL)>>16); num1 -= div*10;
L1:
*ps++ = '0' + (char)(num1);
*ps = '\0';
return ps - s;
}```

9. @Cyberfish, would you give us a description of how your program works, and what system you were timing this on?
I precompute substrings for all numbers from 0 to 999, including a leading comma.

eg, group[123] = ",123"

Then, I just build the string part by part.

Conveniently, it's 4 bytes per part, so 32-bit wide integer copy works.

That's pretty much it.

Then I scan for the first non-zero and non-comma character. That would be the start of the actual number. Then I move it to the front.

10. @iMalc:

A switch will work, too, no? Should be the same thing if it's compiled into a jump table.

Code:
```int length;
f (num1 < 10000) {
if (num1 < 10) length = 2;
if (num1 < 100) length = 3;
if (num1 < 1000) length = 4;
..

switch (length)```

11. Originally Posted by cyberfish
@iMalc:

A switch will work, too, no? Should be the same thing if it's compiled into a jump table.

Code:
```int length;
f (num1 < 10000) {
if (num1 < 10) length = 2;
if (num1 < 100) length = 3;
if (num1 < 1000) length = 4;
..

switch (length)```
Yup, that's the way I considered doing it, then having a bunch of cases with no break between them.
If there were a fast integer log10 function then it'd be sweet, but otherwise since the code was already going for speed over readability I got down and dirty with gotos.

In case anyone wants more information on how I've performed divisions using multiply and shift. The theory is that instead of doing answer = x / y you are doing answer = ((x << n) / y) >> n (the shifts logically cancel out). Rewritting the shift as a multiply by a power of two answer = (x * 2^n / y) >> n and all I'm doing is pre-calculating 2^n / y. From there it's simply a matter of picking a large enough n whilst avoiding overflow, and making sure you round up.
Take this (num2*5243)>>19 it performs a divide by 100 for any number up to 10000...

To pick those magic numbers I first picked a power of two that was large enough that when divided by the divisor it was at least half as big as the number I want to support and that when multiplied by answer, doesn't exceed 2^32
2^19 = 524288
divide by divisor -> 524288 / 100 = 5242.88 -> 5243 (rounding up) (at least half as big as 10000 - check)
multiplied by quotient -> 5243 * 10000 = 52430000 (less than 2^32 - check)

I could instead have picked 2^20 giving (num2*10486)>>20
2^20 = 1048576
divide by divisor -> 1048576/ 100 = 10485.76 -> 10486 (rounding up) (at least half as big as 10000 - check)
multiplied by quotient -> 10486 * 10000 = 104860000 (less than 2^32 - check)

12. There's supposed to be a log10 version here that takes ... maybe 20-40ish cycles?
SynthMaker Forum &bull; View topic - fast log10
I don't know how to use it, though, but if anyone else knows, it might be of help.

13. Originally Posted by Elysia
You are blaming me for code that you used?
I merely compiled your code, and I got that warning.
I cannot be held responsible for your use of stoneage compilers and usage of non-standard compliant code.
No, I'm blaming you for whining about the fact that different compilers, have different functions included in them. What was your first clue on that marvelous discovery?

If you don't want to run my programs/code, don't run them, but stop your whining.

14. How about you try writing standard compliant code instead? Then it will work regardless of what compiler you use. Clever, huh?

15. Originally Posted by cyberfish
I get 0.279s for 10000000 iterations of 1234567890.

But of course, that's not a fair comparison, because my compiler is vastly superior .
Hi cyberfish,
in order to have some realistic performance tests, try to test your
code with random generated numbers, so the compiler has to
really find a way to otimize, without some nasty trick about using
always the same number. :-)

Actually your test is about 5 times faster than the one I used last.
It is however faster than assembly tests somebody did, so it is quite
improbable that it is just the better compiler, something else should be
the reason.

With random and different numbers for each formatting we'll have more
accurate and reliable results.

Let me know if these miracolous results are still there after this kind of test.