I just tried out PGO (profile-guided optimization) on my program using GCC 4.3 on Linux. The result was quite amazing.

Sped up my program (a chess engine, 100% CPU) by almost 18%-20% (repeated a few times).

Code:
g++ -O3 -march=native -pg -fprofile-generate ...
//run my program's benchmark
g++ -O3 -march=native -fprofile-use ...
vs
Code:
g++ -O3 -march=native ...
I guess it could be due to the number of hard-to-predict branches in my code (I don't expect GCC to know there are usually no more than 2 queens on the board...).

I remember trying it out in the earlier (GCC 4.0?), and it didn't really help. Surprised to see such a huge gain this time.

I'm beginning to fall in love with GCC.

So my suggestion: try it out if you haven't, it's quite a bit of performance gain for little cost (especially if your program already has a benchmark function, then everything can be coded into Makefile, and the only cost then would be that compilation takes longer). Of course, this only applies to CPU-intensive programs.

I suspect MSVC has something like this, too, but I haven't looked into that.