Best speed switches for gcc [Archive] - C Board

PDA

View Full Version : Best speed switches for gcc


rafe
11-01-2002, 03:28 PM
So far I've noticed that:

-O3 and -fomit-frame-pointer significantly speed up my proggies.

Are there other goodies I missed for maxing speed?

Barring for now an assembly rewrite of the choke points. I'll get there later.


Yes, I can RTFM, but what really works? vs. what's available.

rafe
11-01-2002, 04:05 PM
Agreed!

But I'm trying to milk an O(n^2) algo for all it's worth. The O(n^2) is written in stone & I know what the hot spot is, in this rare case I knew this before I started coding. Of course there's plenty of stuff a newbie like me isn't doing efficiently but I'm trying to coast as long as I can before I have to pedal hard.

rafe
11-04-2002, 09:39 AM
OK, when I do this I get burried in the output
-m elf_i386

Sooo... given that I'm working with an athlon I should reinstall gcc for my target processor? Giving it a better chance to optimize for the chip when I do the -O3? It should also allow it to use all those fancy MMX registers &c. Or is there a better way?

rafe
11-04-2002, 01:18 PM
Thanks! I'm currently using gcc 3.2 so it has the Athlon switches. I'll experiment & time against my prog.

rafe
11-04-2002, 02:46 PM
A few extra ticks thanks. Given how heavily this program will be hit in the next few months the difference *IS* significant.

When I use "-O3 -fomit-frame-pointer" the times cluster near:
real 1m10.404s
user 1m10.220s
sys 0m0.170s

By adding then adding "-mcpu=athlon-xp" the following is fairly typical:
real 1m5.608s
user 1m5.500s
sys 0m0.100s

Oh yea, without any of the above switches it takes over 4 times as long.

This is good to have before getting into hard to maintain coding tricks.

Thanks again.

rafe
11-05-2002, 01:25 PM
I know that program optimization can be fickle. But I'm trying to learn what I can about gcc switches & I figured I'd share my findings. It is odd to see how many of the switches slow things down, but of course they may do the reverse in other situations.

I was able to lose another second with -fdelete-null-pointer-checks & -fschedule-insns2. But i would be skeptical about using them in other situations & my testing shows little or no gain as a rule.

On the other hand, the -O3, -fomit-frame-pointer & -mcpu=whatever seem to be winners on all of the compute bound programs I tested. Still a small N tho & limited to integer math.

Easy-to-read source code optimizations (only in the problem area) have also been big winners.

rafe
11-06-2002, 10:16 AM
No. While reading the man pages on gcc I read:
-mcpu=cpu-type
This is identical to specifying both -march and -mtune.
So, your hypothesis is that -mtune could have a negative effect on speed? I didn't think of that & seeing how the other options can have a negative effect I'll give it a try.

Truthfully, my biggest surprise was that the option for pretouching the memory to load the cache line didn't have a net gain. This suggests to me that I'd better make sure that my malloc() structs are properly aligned, I thought that malloc() did that for me by default. This is of course a source code tweak.

Thanks for all the help. So much for me to learn so few brain cells to do it with.

rafe
11-06-2002, 10:28 AM
Nix that! Whoa! User time dropped from 0m48.910s to 0m37.400s.

Great catch!:D :D :D :D :D

rafe
11-06-2002, 11:26 AM
A few more source code tweaks & the user time is now at 0m31.610s. I'm only posting this because those same tweaks with the -mcpu vs -march switch were actually *slowing down* the times. This was causing me much confusion because it should have been stuffing more data into the cache for the inner loop & speeding things up. Now I know that I wasn't giving the compiler the correct info. An important lesson.

rafe
11-06-2002, 12:43 PM
> Well they will be data aligned for sure -
> meaning you can store a data type with the
> most restrictive type (usually a double) at
> the address returned to you.
Yup, my tests confirm that I was writing nonsense about malloc() alignments. Everything is aligned to the 8s.

>It might be worth looking at the -fbranch-probabilities option
Unfortunately, the -fbranch-probabilities actually slows the times down by almost exacly one second. Another switch that seems to be context sensitive. The problem area is a nested for loop so the switch seems to be to be a logical choice but I've hoisted about as much as I can out of it. It's pretty lean at this point.

FYI: I'm doing a variant on the old edit distance problem, AKA dynamic programming. This algo searches for a best score in a 2D array. Each cell's score depends upon its neighbors to the North, East, and Northeast. The matrix cell structs are down to 2 ints. A nearly ideal size for caching. With some other tricks I've been able to reduce the "matrix" to 2 rows (well there is one more for the 1st iteration) which I toggle between. And because I'm not using an actual matrix and the rows are a few hundred cells on average it tends to be quite cachable.

Hm, that gives me another idea... Anyway, with your help I'm learning a lot going thru this exercise. I'm going to have to pull the plug on this sooner or later but given that you've helped me take a 2 week task down to under one. I think that the time has been well spent so far & deserves a few more edits. Thanks again.