Testpad for itoa()

**iMalc** · 07-09-2010

Originally Posted by Elysia

I have no idea how iMalc implemented it using minus and multiplication, though.

Here's a good link with an explanation of the technique. Hopefully not too math heavy.

**iMalc** · 07-09-2010

Here's an updated version which includes some small lookup tables for a futher slight improvement.
I haven't updated the testbed at this stage though.

**frktons** · 07-10-2010

Originally Posted by iMalc

Here's an updated version which includes some small lookup tables for a futher slight improvement.
I haven't updated the testbed at this stage though.

Thanks iMalc.

Can you explain why you are declaring:

Code:

int itoa(long snum, char s[])

and using

Code:

    return ps - s;

?

What are these for? I rewrote them this way:

Code:

void itoa(long snum, char s[])

and

Code:

    return;

Because we are not using a return value, so this pushing
to the stack a return value is using CPU cycles for no purpose,
at least in this context.

By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)

A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.

What would be the difference between your version and the version
a smart compiler produces?

Thanks again for your collaboration. The idea behind this optimization
is very clever and I hope I'll be able to grasp it one day. :-)

**laserlight** · 07-10-2010

Originally Posted by frktons

What are these for?

To return the length of the string.

**frktons** · 07-10-2010

Originally Posted by laserlight

To return the length of the string.

Thanks. It could be useful in another context. ;-)

**whiteflags** · 07-10-2010

By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)

A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.

What would be the difference between your version and the version
a smart compiler produces?

Not to answer in iMalc's stead but I did try to warn you that you've hit a performance plateau.

And the compiler's doing instructions better is a positive thing.

The fastest version is not meant to be a nearly unreadable set of instructions like you've been getting. Normally you would use math to do this, as iMalc and several people have, and minimize complicated logic. Then, you turn on the compiler's optimization switch (provided you understand what your options are), and the compiler will do things in the underlying assembly, say, to get speed.

iMalc has gone into detail about said optimization as it applies to his code.

If you attempt to do such optimizations in plain code, without knowing what you are doing more than anyone usually does, you will just convince the compiler to use different assembly instructions for the operation, which may or may not be faster.

**frktons** · 07-10-2010

Originally Posted by whiteflags

Not to answer in iMalc's stead but I did try to warn
you that you've hit a performance plateau.

And the compiler's doing instructions better is a positive thing.

The fastest version is not meant to be a nearly unreadable set of instructions like you've been getting. Normally you would use math to do this, as iMalc and several people have, and minimize complicated logic. Then, you turn on the compiler's optimization switch (provided you understand what your options are), and the compiler will do things in the underlying assembly, say, to get speed.

iMalc has gone into detail about said optimization as it applies to his code.

If you attempt to do such optimizations in plain code, without knowing what you are doing more than anyone usually does, you will just convince the compiler to use different assembly instructions for the operation, which may or may not be faster.

I ask because I've some doubts. When somebody posted a zero ms result
versus 100-200 ms, my logic freaked out. So I asked some smart Assembly
guys who just replied:"NO WAY, show us the Assembly produced by the
compiler and we'll show you what nasty trick the compiler is playing".

Now, with random numbers the miracle of "0 ms performance" is over.
Those guys explained me what more or less the compiler was doing.

The performance is not the main subject of this thread, we are testing for
correctness the various posted solutions, so if we reached a performance plateau
it doesn't really matter.

I'm here to learn, or to have a look at what to learn in the future.
Curiosity is not an unforgiveable sin. I just asked iMalc why he
is not happy about applying the math that he showed us.

I can understand your warnings, they are pretty correct. Unfortunately the
infinite debate among Aseembly guys and Compiler adepts is not going to
stop. I'm not in that debate, and I don't want to be involved.

Try to think this way: some 30 years ago I was playing with COBOL on
big old mainframes. Now that I'm free from that hardware and I can play
with new toys, I really would appreciate my freedom of doing a lot of mistakes
I was not allowed to in that age.

Bear with me my friend, let me learn it the hard way. :-)

**Elysia** · 07-10-2010

Originally Posted by frktons

Because we are not using a return value, so this pushing
to the stack a return value is using CPU cycles for no purpose,
at least in this context.

You are being paranoid. The time taken there is negligible compared to the rest of the code. It probably won't impact the result by anything.

As for the 0ms thingy, it's like I said. The compiler simply removes all of the code because it detects that it actually does nothing. You calculate a result and then throws it away. That's unnecessary because you're not storing it anywhere, so the compiler could easily remove the code and gain speed.
So the code itself isn't super hyper ultra fast--it's gone. That's what I meant. iMalc's code allowed the compiler to figure out what it was doing enough to deem it unnecessary.

**frktons** · 07-10-2010

Originally Posted by Elysia

You are being paranoid. The time taken there is negligible compared to the rest of the code. It probably won't impact the result by anything.

As for the 0ms thingy, it's like I said. The compiler simply removes all of the code because it detects that it actually does nothing. You calculate a result and then throws it away. That's unnecessary because you're not storing it anywhere, so the compiler could easily remove the code and gain speed.
So the code itself isn't super hyper ultra fast--it's gone. That's what I meant. iMalc's code allowed the compiler to figure out what it was doing enough to deem it unnecessary.

Thanks Elysia, if I correctly remember, the same nasty trick GCC was
doing with my code as well, not only with iMalcs, as cyberfish revealed.

About the first part of your comment I agree almost completely, except for
the very first part. :-) Don't worry about my mental sanity, I can make it. ;-)

**iMalc** · 07-10-2010

Originally Posted by frktons

Thanks iMalc.

Can you explain why you are declaring:

Yup, string length as already noted my laserlight

By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)

I got a slight improvement using visual studio, but it certainly small enough that it's likely to vary from slightly better to slightly worse depending on compiler and platform etc. It trades a multiplication for a table lookup, which isn't always a win I guess.

A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.

What would be the difference between your version and the version
a smart compiler produces?

Because the multiplicative inverse I was using only allowed me to perform the simulated division on numbers up to 10000 but not 100000 you'll notice that I had to split the number up into three variables num1, num2, and num3, each being a most 9999. When the compiler performs the same multiplicative inverse trick it can use instructions that basically calculate the full 64-bit multiplication result and then just keep the highest 32-bits from that. This means that it doesn't need the extra step of breaking it up into num1, num2 and num3 first, possibly freeing up some registers as well as avoiding the two divides I had.
However, on the plus side the code I wrote may end up around the same size overall since the constants used in the machine code are smaller than the constants the compiler would have used since it can't assume the numbers are below 10000, and it may generate slightly more code when doing the same thing itself, to make sure it doesn't overflow a 64-bit multiplication as it may in some cases if you look at the article I linked earlier.

Thanks again for your collaboration. The idea behind this optimization
is very clever and I hope I'll be able to grasp it one day. :-)

It probably helps to think of it in several parts:

Loop unrolling.
Multiplicative inverse to perform division.
Reducing total branch count by reordering tests.
Pointer arithmetic instead of array indexing.
Lookup tables.

I did make a version which had the same branching scheme that reduced the number of gotos to two, just to see how low I could get it, but I'm actually going to have to admit that the version with gotos was clearer as code wasnt nested about ten levels deep.

Originally Posted by Elysia

So the code itself isn't super hyper ultra fast--it's gone. That's what I meant. iMalc's code allowed the compiler to figure out what it was doing enough to deem it unnecessary.

Which is pretty impressive of the compiler if you ask me! It presumably would have to had determined that there were
no calls to other functions,
no accessing globals,
only writing to an array that was never read from,
and not using the return value.
Violating perhaps any one of those would probably have foiled its efforts.

**cyberfish** · 07-10-2010

@frktons:

Your machine is 64-bit. There is no "partial" 64-bit mode. If you are running a 64-bit OS, the CPU is running in 64-bit mode.

The C2D is 64-bit capable.

And I've decided to scrap my version. It's nothing special, just good old lookup tables.

Also, the C standard only guarantees a minimum RAND_MAX of 2^16. So it's possible (and very likely) that it's only generating numbers between ~-32000 to 32000. To make sure you have the full range, do something like,

Code:

int random_number = (rand() << 16) ^ rand();

And you should really exclude the random number generation time from the timing, since it would depend on the implementation (different compilers implement rand() differently. Some go for higher quality random numbers, some go for speed).

srand(0) does NOT make sure we are using the same random numbers. Using the same seed and guarantees you will get the same sequence FROM THE SAME IMPLEMENTATION. But that probably doesn't matter if iterations is big enough.

**Elysia** · 07-11-2010

Random numbers are not very good for measuring performance, since we get different numbers a lot of the times.
But they are better at testing and making sure we try a greater range of numbers and stop the compiler from assuming the input will only one kind of number.
Using srand(NULL) is better than not using random numbers.

**KBriggs** · 07-11-2010

Why don't you just use a set distributions of numbers that covered the whole range?

Code:

(for i=0;i<maximum;i+=maximum/num_iterations)
     format i

sort of thing

**cyberfish** · 07-11-2010

Originally Posted by KBriggs

Why don't you just use a set distributions of numbers that covered the whole range?

Code:

(for i=0;i<maximum;i+=maximum/num_iterations)
     format i

sort of thing

If you have a large lookup table, for example, that won't give you accurate memory performance because most of the digits will stay the same.

**iMalc** · 07-11-2010

Not only that, but if maximum is 0x7FFFFFFF then you cound easily overflow.
How about a random stepsize that gradually increases over time?

Code:

long stepsize = 0;
(for long i=0x80000000; i <= 0x7FFFFFFF-stepsize; i += stepsize))
{
    itoa(i, buffer);
    stepsize = 1 + rand() % (((((unsigned long)i)>>7)+1);
}

Thread: Testpad for itoa()

Thread Tools

Search Thread

Display