PGO is amazing

**cyberfish** · 02-05-2009

I just tried out PGO (profile-guided optimization) on my program using GCC 4.3 on Linux. The result was quite amazing.

Sped up my program (a chess engine, 100% CPU) by almost 18%-20% (repeated a few times).

Code:

g++ -O3 -march=native -pg -fprofile-generate ...
//run my program's benchmark
g++ -O3 -march=native -fprofile-use ...

vs

Code:

g++ -O3 -march=native ...

I guess it could be due to the number of hard-to-predict branches in my code (I don't expect GCC to know there are usually no more than 2 queens on the board...).

I remember trying it out in the earlier (GCC 4.0?), and it didn't really help. Surprised to see such a huge gain this time.

I'm beginning to fall in love with GCC.

So my suggestion: try it out if you haven't, it's quite a bit of performance gain for little cost (especially if your program already has a benchmark function, then everything can be coded into Makefile, and the only cost then would be that compilation takes longer). Of course, this only applies to CPU-intensive programs.

I suspect MSVC has something like this, too, but I haven't looked into that.

**matsp** · 02-05-2009

I'm not sure msvc has such an option, but I know some of the other commercial compilers do (Intel, PGI).

And yes, it can be very effective, not only for hard facts on branches, but also for stuff like "do we inline function X or function Y", as well as using registers in the places that count.

--
Mats

**brewbuck** · 02-05-2009

If only a 20% speed improvement meant that you can search 20% deeper. Of course, it doesn't...

For years I've hacked on programs which play the game of Amazons. This game has an astounding branching factor of 2176 on the first move. Even assuming perfect alpha-beta, that's a tree growth factor per half-ply of 47 in the opening game.

What this means, depressingly, is that to search only a single ply deeper (remembering that a ply is a move followed by a counter move) in a given time, I must speed up my program by 2176 times.

Counterintuitively, it is better to instead make your evaluation heuristic more "intelligent" even if it makes it slower. You aren't going to lose much (maybe half a ply) but your estimates are more robust and so you can perform better at those shallower search depths. Or to think of it another way, even if I made my evaluation function 2176 times more complicated, I'd only lose a single ply.

Chess is somewhat different in that the branching factor is MUCH lower, and it can vary significantly from ply to ply. A 20% speed increase is nothing to sneeze at, but I have to wonder how this has actually affected your game play performance.

**tabstop** · 02-05-2009

I don't know if the fish has implemented it, but most chess games are played on the clock, so even taking 20% less time on every move means more moves you can make before going into "panic mode".

**brewbuck** · 02-05-2009

Originally Posted by tabstop

I don't know if the fish has implemented it, but most chess games are played on the clock, so even taking 20% less time on every move means more moves you can make before going into "panic mode".

Yes, but if you could have gotten another ply deeper if you had only searched for 0.3 seconds longer, it would be sad to miss out on the better evaluation (and possibly better move). The problem is being able to gauge how deep you can search from your current position in a given amount of time, and weigh that against how much time you have left on the clock.

In Amazons for instance, it's pointless to search more than one ply on the first move, because you just ain't going to get it in any realistic amount of time. So say you can make a move within 0.5 seconds. A dozen moves later, you might be able to get an extra ply in 15 seconds -- much longer than 0.5 seconds, but hey, it's an extra ply.

The complexity is deciding how to make the tradeoff, at what phase of the game, etc.

**mike_g** · 02-05-2009

I agree that a good heuristic is far more important than bit poking optimizations for this kind of stuff. But I guess a 20% speed increase is worth having if its no extra effort.

**cyberfish** · 02-05-2009

If only a 20% speed improvement meant that you can search 20% deeper. Of course, it doesn't...

For years I've hacked on programs which play the game of Amazons. This game has an astounding branching factor of 2176 on the first move. Even assuming perfect alpha-beta, that's a tree growth factor per half-ply of 47 in the opening game.

What this means, depressingly, is that to search only a single ply deeper (remembering that a ply is a move followed by a counter move) in a given time, I must speed up my program by 2176 times.

Counterintuitively, it is better to instead make your evaluation heuristic more "intelligent" even if it makes it slower. You aren't going to lose much (maybe half a ply) but your estimates are more robust and so you can perform better at those shallower search depths. Or to think of it another way, even if I made my evaluation function 2176 times more complicated, I'd only lose a single ply.

Chess is somewhat different in that the branching factor is MUCH lower, and it can vary significantly from ply to ply. A 20% speed increase is nothing to sneeze at, but I have to wonder how this has actually affected your game play performance.

In chess, for a naive implementation, the branching factor is about 30 on average, which means, assuming 2Mnps (million nodes per second), which is about right for programs running on a modern PC, a program can search about 5 plies (log(20,000,000)/log(30)) in 10 seconds (or 10 plies maximum using alpha-beta with perfect move ordering, realistically 6-7 plies). That is obviously no good. Human masters can often look much further ahead than that.

Therefore, people made chess engines much more selective about what moves to search further (extensions on captures, checks, promotions, for example), and what move to search shallower (reductions on moves that "fail low" in alpha-beta after a shallow search even after the moving side skips a move). And then there are things like transposition table. Since a position is likely to be searched multiple times in the tree through different paths (transpositions), results from a previous search is used if the depth is sufficient.

With all kinds of clever tricks, that cuts down the "effective branching factor" in modern chess engines to about 2-4, and that allows programs to search to about 13-15 plies in mid game (>20 on supercomputers).

With an effective branching factor that small, a 20% speed increase is quite substantial. It's estimated that every doubling of computing power (or speed, or time) gives ~50 Elo points, so 20% would bring about 50*log2(1.2), about 13 Elo points, which is definitely measurable (the one stronger by 13 Elo has a winning probability of ~52%).

My program has a website if anyone's interested
http://cyberfish.wecheer.com/Brainless/

**cyberfish** · 02-05-2009

I just tested PGO with ICC 11 (free Linux version),

Code:

Compiler - Mnps (best of 3 runs)
GCC          - 2.736
GCC+PGO - 3.045
ICC           - 2.794
ICC+PGO  - 2.813

Seems like ICC produces slightly faster binary by default, and GCC produces A LOT faster binary with PGO.

Google says that MSVC 2005 and up also has PGO support, and is available as a dll (?) that is only included in the Pro version, so I can't test that.

For ICC, I used

Code:

icc -fast -march=core2 -prof-gen ...
//benchmark run
icc -fast -march=core2 -prof-use ...

**Thaidog** · 02-13-2009

Originally Posted by cyberfish

I just tested PGO with ICC 11 (free Linux version),

Code:

Compiler - Mnps (best of 3 runs)
GCC          - 2.736
GCC+PGO - 3.045
ICC           - 2.794
ICC+PGO  - 2.813

Seems like ICC produces slightly faster binary by default, and GCC produces A LOT faster binary with PGO.

Google says that MSVC 2005 and up also has PGO support, and is available as a dll (?) that is only included in the Pro version, so I can't test that.

For ICC, I used

Code:

icc -fast -march=core2 -prof-gen ...
//benchmark run
icc -fast -march=core2 -prof-use ...

I am working on a project were we are compiling the kernel with ICC: www.linuxdna.com

We are working on getting PGO working for that... perhaps you are interested?

Thread: PGO is amazing

Thread Tools

Search Thread

Display

PGO is amazing

Similar Threads

Amazing Pool Tricks

Hiking in the Rockies..Amazing!

fastest sort algorithm ever??? it's just been found!!! simply amazing!

It's amazing...

Its...Its...Its AMAZING!!!