Here's a good link with an explanation of the technique. Hopefully not too math heavy.
Here's a good link with an explanation of the technique. Hopefully not too math heavy.
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"
Here's an updated version which includes some small lookup tables for a futher slight improvement.
I haven't updated the testbed at this stage though.
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"
Thanks iMalc.
Can you explain why you are declaring:
and usingCode:int itoa(long snum, char s[])
?Code:return ps - s;
What are these for? I rewrote them this way:
andCode:void itoa(long snum, char s[])
Because we are not using a return value, so this pushingCode:return;
to the stack a return value is using CPU cycles for no purpose,
at least in this context.
By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)
A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.
What would be the difference between your version and the version
a smart compiler produces?
Thanks again for your collaboration. The idea behind this optimization
is very clever and I hope I'll be able to grasp it one day. :-)
Last edited by frktons; 07-10-2010 at 02:23 AM.
To return the length of the string.Originally Posted by frktons
Look up a C++ Reference and learn How To Ask Questions The Smart WayOriginally Posted by Bjarne Stroustrup (2000-10-14)
Not to answer in iMalc's stead but I did try to warn you that you've hit a performance plateau.By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)
A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.
What would be the difference between your version and the version
a smart compiler produces?
And the compiler's doing instructions better is a positive thing.
The fastest version is not meant to be a nearly unreadable set of instructions like you've been getting. Normally you would use math to do this, as iMalc and several people have, and minimize complicated logic. Then, you turn on the compiler's optimization switch (provided you understand what your options are), and the compiler will do things in the underlying assembly, say, to get speed.
iMalc has gone into detail about said optimization as it applies to his code.
If you attempt to do such optimizations in plain code, without knowing what you are doing more than anyone usually does, you will just convince the compiler to use different assembly instructions for the operation, which may or may not be faster.
I ask because I've some doubts. When somebody posted a zero ms result
versus 100-200 ms, my logic freaked out. So I asked some smart Assembly
guys who just replied:"NO WAY, show us the Assembly produced by the
compiler and we'll show you what nasty trick the compiler is playing".
Now, with random numbers the miracle of "0 ms performance" is over.
Those guys explained me what more or less the compiler was doing.
The performance is not the main subject of this thread, we are testing for
correctness the various posted solutions, so if we reached a performance plateau
it doesn't really matter.
I'm here to learn, or to have a look at what to learn in the future.
Curiosity is not an unforgiveable sin. I just asked iMalc why he
is not happy about applying the math that he showed us.
I can understand your warnings, they are pretty correct. Unfortunately the
infinite debate among Aseembly guys and Compiler adepts is not going to
stop. I'm not in that debate, and I don't want to be involved.
Try to think this way: some 30 years ago I was playing with COBOL on
big old mainframes. Now that I'm free from that hardware and I can play
with new toys, I really would appreciate my freedom of doing a lot of mistakes
I was not allowed to in that age.
Bear with me my friend, let me learn it the hard way. :-)
Last edited by frktons; 07-10-2010 at 04:27 AM.
You are being paranoid. The time taken there is negligible compared to the rest of the code. It probably won't impact the result by anything.
As for the 0ms thingy, it's like I said. The compiler simply removes all of the code because it detects that it actually does nothing. You calculate a result and then throws it away. That's unnecessary because you're not storing it anywhere, so the compiler could easily remove the code and gain speed.
So the code itself isn't super hyper ultra fast--it's gone. That's what I meant. iMalc's code allowed the compiler to figure out what it was doing enough to deem it unnecessary.
Thanks Elysia, if I correctly remember, the same nasty trick GCC was
doing with my code as well, not only with iMalcs, as cyberfish revealed.
About the first part of your comment I agree almost completely, except for
the very first part. :-) Don't worry about my mental sanity, I can make it. ;-)
Yup, string length as already noted my laserlightI got a slight improvement using visual studio, but it certainly small enough that it's likely to vary from slightly better to slightly worse depending on compiler and platform etc. It trades a multiplication for a table lookup, which isn't always a win I guess.By the way, the performance I get is not any better than the previous
one, even 2-3 ms slower. (?)
Because the multiplicative inverse I was using only allowed me to perform the simulated division on numbers up to 10000 but not 100000 you'll notice that I had to split the number up into three variables num1, num2, and num3, each being a most 9999. When the compiler performs the same multiplicative inverse trick it can use instructions that basically calculate the full 64-bit multiplication result and then just keep the highest 32-bits from that. This means that it doesn't need the extra step of breaking it up into num1, num2 and num3 first, possibly freeing up some registers as well as avoiding the two divides I had.A last question, if you like to elaborate it:
You said the algorithm you are applying here is not the best you
can get, because compilers already do that even better.
What would be the difference between your version and the version
a smart compiler produces?
However, on the plus side the code I wrote may end up around the same size overall since the constants used in the machine code are smaller than the constants the compiler would have used since it can't assume the numbers are below 10000, and it may generate slightly more code when doing the same thing itself, to make sure it doesn't overflow a 64-bit multiplication as it may in some cases if you look at the article I linked earlier.
It probably helps to think of it in several parts:Thanks again for your collaboration. The idea behind this optimization
is very clever and I hope I'll be able to grasp it one day. :-)
- Loop unrolling.
- Multiplicative inverse to perform division.
- Reducing total branch count by reordering tests.
- Pointer arithmetic instead of array indexing.
- Lookup tables.
I did make a version which had the same branching scheme that reduced the number of gotos to two, just to see how low I could get it, but I'm actually going to have to admit that the version with gotos was clearer as code wasnt nested about ten levels deep.
Which is pretty impressive of the compiler if you ask me! It presumably would have to had determined that there were
no calls to other functions,
no accessing globals,
only writing to an array that was never read from,
and not using the return value.
Violating perhaps any one of those would probably have foiled its efforts.
Last edited by iMalc; 07-10-2010 at 02:34 PM.
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"
@frktons:
Your machine is 64-bit. There is no "partial" 64-bit mode. If you are running a 64-bit OS, the CPU is running in 64-bit mode.
The C2D is 64-bit capable.
And I've decided to scrap my version. It's nothing special, just good old lookup tables.
Also, the C standard only guarantees a minimum RAND_MAX of 2^16. So it's possible (and very likely) that it's only generating numbers between ~-32000 to 32000. To make sure you have the full range, do something like,
And you should really exclude the random number generation time from the timing, since it would depend on the implementation (different compilers implement rand() differently. Some go for higher quality random numbers, some go for speed).Code:int random_number = (rand() << 16) ^ rand();
srand(0) does NOT make sure we are using the same random numbers. Using the same seed and guarantees you will get the same sequence FROM THE SAME IMPLEMENTATION. But that probably doesn't matter if iterations is big enough.
Random numbers are not very good for measuring performance, since we get different numbers a lot of the times.
But they are better at testing and making sure we try a greater range of numbers and stop the compiler from assuming the input will only one kind of number.
Using srand(NULL) is better than not using random numbers.
Why don't you just use a set distributions of numbers that covered the whole range?
sort of thingCode:(for i=0;i<maximum;i+=maximum/num_iterations) format i
Not only that, but if maximum is 0x7FFFFFFF then you cound easily overflow.
How about a random stepsize that gradually increases over time?
Code:long stepsize = 0; (for long i=0x80000000; i <= 0x7FFFFFFF-stepsize; i += stepsize)) { itoa(i, buffer); stepsize = 1 + rand() % (((((unsigned long)i)>>7)+1); }
Last edited by iMalc; 07-11-2010 at 01:37 PM.
My homepage
Advice: Take only as directed - If symptoms persist, please see your debugger
Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"