View Full Version : Best speed switches for gcc
So far I've noticed that:
-O3 and -fomit-frame-pointer significantly speed up my proggies.
Are there other goodies I missed for maxing speed?
Barring for now an assembly rewrite of the choke points. I'll get there later.
Yes, I can RTFM, but what really works? vs. what's available.
Agreed!
But I'm trying to milk an O(n^2) algo for all it's worth. The O(n^2) is written in stone & I know what the hot spot is, in this rare case I knew this before I started coding. Of course there's plenty of stuff a newbie like me isn't doing efficiently but I'm trying to coast as long as I can before I have to pedal hard.
OK, when I do this I get burried in the output
-m elf_i386
Sooo... given that I'm working with an athlon I should reinstall gcc for my target processor? Giving it a better chance to optimize for the chip when I do the -O3? It should also allow it to use all those fancy MMX registers &c. Or is there a better way?
Thanks! I'm currently using gcc 3.2 so it has the Athlon switches. I'll experiment & time against my prog.
A few extra ticks thanks. Given how heavily this program will be hit in the next few months the difference *IS* significant.
When I use "-O3 -fomit-frame-pointer" the times cluster near:
real 1m10.404s
user 1m10.220s
sys 0m0.170s
By adding then adding "-mcpu=athlon-xp" the following is fairly typical:
real 1m5.608s
user 1m5.500s
sys 0m0.100s
Oh yea, without any of the above switches it takes over 4 times as long.
This is good to have before getting into hard to maintain coding tricks.
Thanks again.
I know that program optimization can be fickle. But I'm trying to learn what I can about gcc switches & I figured I'd share my findings. It is odd to see how many of the switches slow things down, but of course they may do the reverse in other situations.
I was able to lose another second with -fdelete-null-pointer-checks & -fschedule-insns2. But i would be skeptical about using them in other situations & my testing shows little or no gain as a rule.
On the other hand, the -O3, -fomit-frame-pointer & -mcpu=whatever seem to be winners on all of the compute bound programs I tested. Still a small N tho & limited to integer math.
Easy-to-read source code optimizations (only in the problem area) have also been big winners.
No. While reading the man pages on gcc I read:
-mcpu=cpu-type
This is identical to specifying both -march and -mtune.
So, your hypothesis is that -mtune could have a negative effect on speed? I didn't think of that & seeing how the other options can have a negative effect I'll give it a try.
Truthfully, my biggest surprise was that the option for pretouching the memory to load the cache line didn't have a net gain. This suggests to me that I'd better make sure that my malloc() structs are properly aligned, I thought that malloc() did that for me by default. This is of course a source code tweak.
Thanks for all the help. So much for me to learn so few brain cells to do it with.
Nix that! Whoa! User time dropped from 0m48.910s to 0m37.400s.
Great catch!:D :D :D :D :D
A few more source code tweaks & the user time is now at 0m31.610s. I'm only posting this because those same tweaks with the -mcpu vs -march switch were actually *slowing down* the times. This was causing me much confusion because it should have been stuffing more data into the cache for the inner loop & speeding things up. Now I know that I wasn't giving the compiler the correct info. An important lesson.
> Well they will be data aligned for sure -
> meaning you can store a data type with the
> most restrictive type (usually a double) at
> the address returned to you.
Yup, my tests confirm that I was writing nonsense about malloc() alignments. Everything is aligned to the 8s.
>It might be worth looking at the -fbranch-probabilities option
Unfortunately, the -fbranch-probabilities actually slows the times down by almost exacly one second. Another switch that seems to be context sensitive. The problem area is a nested for loop so the switch seems to be to be a logical choice but I've hoisted about as much as I can out of it. It's pretty lean at this point.
FYI: I'm doing a variant on the old edit distance problem, AKA dynamic programming. This algo searches for a best score in a 2D array. Each cell's score depends upon its neighbors to the North, East, and Northeast. The matrix cell structs are down to 2 ints. A nearly ideal size for caching. With some other tricks I've been able to reduce the "matrix" to 2 rows (well there is one more for the 1st iteration) which I toggle between. And because I'm not using an actual matrix and the rows are a few hundred cells on average it tends to be quite cachable.
Hm, that gives me another idea... Anyway, with your help I'm learning a lot going thru this exercise. I'm going to have to pull the plug on this sooner or later but given that you've helped me take a 2 week task down to under one. I think that the time has been well spent so far & deserves a few more edits. Thanks again.
vBulletin® v3.7.0, Copyright ©2000-2009, Jelsoft Enterprises Ltd.