Weird Performance

Printable View

Show 80 post(s) from this thread on one page

10-12-2007
khdani

Weird Performance

Hello,
I've come upon the following issue and can't figure out why this is happening....
I've written a particles engine on C/OpenGL, my hardware is:
-----------------------------------
Intel DUO 1.6GhZ
1 GB RAM
ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
-----------------------------------
I ran exactly the same program on:
----------------------------------
Pentium 4 2.6GhZ
512MB RAM
Intel OnBoard Video Card 64MB
----------------------------------
and the program on Pentium4 computer ran much more faster, I could even
make 4 times more and 9 times larger particles while still the program ran faster than initially on my PC...
anyone please can explain me why my PC significantly much slower ?
10-12-2007
VirtualAce

Without code it's anyone's guess.
10-12-2007
brewbuck

The 2.4 GHz Pentium may have some instructions (which your program uses) that are a LOT faster than the simple ratio of clock speeds would suggest. Your compiler might be producing SIMD code as well as emulated-SIMD code, and the emulated code is what ends up running on the slower system, making the faster one look MUCH faster. Or you could be having cache issues. Perhaps the cache on the 2.4 GHz Pentium is more well-behaved in your case.

I've seen lots of problems like this. Just this week I've been dealing with two very similar versions of the same program, one of which runs 3x slower than the other -- but only on certain systems. On a few systems, it runs FASTER. And on some others, it runs in the same time. The problem is variations in the amount and quality of CPU cache on various systems. Small tweaks to algorithms have very small impact on certain machines, on others they absolutely kill performance.
10-12-2007
BobMcGee123

aliens
10-13-2007
khdani

Thanks brewbuck :)
If I correctly understand you, I need to rewrite my program for multithreading if I want it to run faster on DUO, right ?
10-13-2007
CornedBee

That's one option, but it probably won't help you in the graphics case.

Also, a Core 1 1.6GHz CPU should be able to keep up with a P4 2.6Ghz. Perhaps your compiler is tuned to the P4 and generates code that is considerably more efficient there.

But you'd need detailed profiling to find out.
10-16-2007
khdani

Yes, the compiler indeed generates code for P4...but OpenGL libraries are 32bit,
so I can't recompile it for 64bit, is there any other options ?
10-16-2007
CornedBee

A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.
10-16-2007
matsp

Quote:

Originally Posted by CornedBee

A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.

Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

--
Mats
10-16-2007
CornedBee

*shrug* Just a suggestion.

Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.
10-16-2007
matsp

That is, indeed, a good suggestion. May I suggest a trial version of VTune. (Or use oprofile if it's Linux).

By the way, a built-in graphcis card with small amount of memory may actually be faster than one with large amounts of memory when it comes to WRITING the display-list in OpenGL, so if you are spending a lot of time writing the display-list, and the drawing of the object is trivial (so much that the basic Intel graphics card isn't "pushed" to it's limits), then the basic model graphics card may have a big advantage.

--
Mats
10-16-2007
Frobozz

Quote:

Originally Posted by khdani

-----------------------------------
Intel DUO 1.6GhZ
1 GB RAM
ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
-----------------------------------
I ran exactly the same program on:
----------------------------------
Pentium 4 2.6GhZ
512MB RAM
Intel OnBoard Video Card 64MB
----------------------------------

Could we have a little more detail on the computers? It could very well be the Intel video is faster somehow. The only information I can find on the X1050 indicates it uses DDR1/2 memory.
10-16-2007
abachler

Quote:

Originally Posted by matsp

Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

--
Mats

actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.

8 cores is going to be the limit as far as multicore, since that is the point at which one or more cores will almost always be stalled due to a cache miss. The only way to improve performance at that point will be to start using more advanced memory architectures.
10-16-2007
brewbuck

Quote:

Originally Posted by CornedBee

*shrug* Just a suggestion.

Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.

Unfortunately, profiling is often not helpful in these situations. Getting back to the example I was dealing with last week, when the code was compiled in profile mode, NO DIFFERENCE WAS OBSERVED in run time. How's that for frustration?

The reason is that the performance issue was cache-based. The sampling profiler absolutely blows the cache efficiency out of the water, because it is constantly stopping the program, checking where it is, and incrementing a counter somewhere. So if the performance problem is caused by cache, running the program under a profiler will MASK IT.
10-16-2007
matsp

Quote:

Originally Posted by abachler

actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.

A long pipeline has nothing to do with OOE (Out Of Order execution). That is all about having a large number of potential instructions to choose from "ready to execute". But once an instruction is actually in the pipeline, it has to either be skipped or executed. A longer pipeline doesn't make that any better.

The length of the pipeline determines [indirectly] how fast you can clock the processor. By increasing the length of the pipeline, the amount of work done in each pipeline step is less, which means that the time it takes to pass a certain pipeline stage is shorter, meaning that the clock-cycles can be shorter without causing problems with the timing of each pipeline stage.

I still would say that a long pipeline is not as good as a short pipeline, unless the code is specifically written to fit in long pipeline [no unpredictable branches, long sequences of instructions with no branches, that sort of thing]. And even then, a short pipeline will not show any drawback in itself - it may not clock as high, but with correct design, this can be overcome by more work per cycle, as proven by both AMD and more recent Intel processors.

--
Mats

Show 80 post(s) from this thread on one page