Is there a standard function?

**kermit** · 07-04-2010

Originally Posted by cyberfish

Core 2 Duo at 3GHz, 64-bit Linux, GCC 4.4.1.

Un-optimized version runs in 150 seconds, so your compiler is most certainly optimizing, just not nearly as much.
.

Did you make the fixes to the code in order to get rid of the compiler warnings, or did you run it 'as is'?

**Elysia** · 07-04-2010

Passing the length of the number to the function as an argument allowed the compiler to inline it again. The new optimization to remove log10 shaved off a second off that time (9 seconds now).

**cyberfish** · 07-04-2010

I only got warnings for using %d for time_t, so I ran it as is.

**cyberfish** · 07-04-2010

I'm not sure about MSVC, but gcc has a compiler flag that sets inline aggressiveness.

**Elysia** · 07-04-2010

Yes, there is __forceinline to try to force it to inline. I've been successful so far to make it inline. Now trying a division table.

**frktons** · 07-04-2010

Originally Posted by cyberfish

Core 2 Duo at 3GHz, 64-bit Linux, GCC 4.4.1.

Un-optimized version runs in 150 seconds, so your compiler is most certainly optimizing, just not nearly as much.

Yeah I was talking to frktons.

Well, mine is 2.4 Ghz, about 20% slower, and now I got curious about the
difference GCC and 64 bit mode are doing. As I have time enough I'll give
them a look.

**Elysia** · 07-04-2010

Funny how my test all hate lookup tables. They always turn out slower.
Both in filling and accessing. It seems it's faster to do the division twice.

**cyberfish** · 07-04-2010

64 bit doesn't make any difference. I got similar results in 32 bit. It's the compiler.

**Elysia** · 07-04-2010

Compilers are picky things. I know I got it down to around 9 seconds, but I cannot seem to reproduce it now.

**frktons** · 07-04-2010

Originally Posted by Elysia

Compilers are picky things. I know I got it down to around 9 seconds, but I cannot seem to reproduce it now.

What HW/SW are you using? Win7 64 bit/MSV2010 and what else?
What are your timings with the last code I posted ?

**Elysia** · 07-04-2010

AMD Athlon II X2 250 (3 GHz).
I will try yours shortly.

EDIT: Your code: 39 seconds.
EDIT2: My code with your integer: 29 seconds.

Code:

 	Address     	Line 	Source                     	Code Bytes 	Timer samples 	
 	0x13f3f11e1 	181  	        remain = num % 10; 	           	48.99

o_O

**frktons** · 07-04-2010

Originally Posted by Elysia

AMD Athlon II X2 250 (3 GHz).
I will try yours shortly.

EDIT: Your code: 39 seconds.
EDIT2: My code with your integer: 29 seconds.

Code:

 	Address     	Line 	Source                     	Code Bytes 	Timer samples 	
 	0x13f3f11e1 	181  	        remain = num % 10; 	           	48.99

o_O

OK. Thanks Elysia. Enough for today. As time permits I'll be back with new
ideas. Have a nice time. :-)

**cyberfish** · 07-04-2010

One optimization I would do is to precompute a mapping of integers 0-999 to their 3 characters strings.

That would take 1000*3 = ~3KB of memory. No null terminators needed because you know they are exactly 3 characters long.

Then you can do % 1000 instead of % 10. I'm guessing that will make it at least twice as fast.

It will also make the logic simpler (since you want a comma every 3 digits), and eliminate a bunch of branching, which are also slow with modern processors.

And switching to a modern compiler will make this all not matter

.

**Elysia** · 07-04-2010

Branching? Slow? I don't buy that argument so much.
Today's processors are pretty clever, and since the loop is pretty deterministic, it should easily be able to avoid branch misprediction.
I'm just speculating, though.

But that idea would probably see some speed gains.
I'm going to try it.

EDIT: 14 seconds vs 29 on x64.
EDIT2: With a slightly larger table and x64, it is possible to reduce the runtime to ~8 seconds (if you are willing to sacrifice ~8 MB memory).
The only problem is that you have to correct the numbers:

Code:

 The value of num is: -1234567890

 init_time = 1278283073

 The formatted value of num is: -001,234,567,890

 end_time  = 1278283081


 The routine test has taken about 8 seconds

 to perform +000,500,000,000 cycles of the formatting function

**cyberfish** · 07-04-2010

Hmm when you go to 8MB of memory, I don't see how it can be faster.

8MB is larger than the L2 of most/all CPUs, and the values are pretty much random. Cache misses should be much more expensive than doing the calculations.

The loop is fine. The compiler will probably at least partially unroll it anyways.

I was talking about this one

Code:

if (count == 3)

It's true once every 3 iterations in each function call (not globally). I'm not sure if the branch predictor is that smart.

But the biggest difference would be the elimination of division (%).