Weird Performance

**khdani** · 10-12-2007

Hello,
I've come upon the following issue and can't figure out why this is happening....
I've written a particles engine on C/OpenGL, my hardware is:
-----------------------------------
Intel DUO 1.6GhZ
1 GB RAM
ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
-----------------------------------
I ran exactly the same program on:
----------------------------------
Pentium 4 2.6GhZ
512MB RAM
Intel OnBoard Video Card 64MB
----------------------------------
and the program on Pentium4 computer ran much more faster, I could even
make 4 times more and 9 times larger particles while still the program ran faster than initially on my PC...
anyone please can explain me why my PC significantly much slower ?

**VirtualAce** · 10-12-2007

Without code it's anyone's guess.

**brewbuck** · 10-12-2007

The 2.4 GHz Pentium may have some instructions (which your program uses) that are a LOT faster than the simple ratio of clock speeds would suggest. Your compiler might be producing SIMD code as well as emulated-SIMD code, and the emulated code is what ends up running on the slower system, making the faster one look MUCH faster. Or you could be having cache issues. Perhaps the cache on the 2.4 GHz Pentium is more well-behaved in your case.

I've seen lots of problems like this. Just this week I've been dealing with two very similar versions of the same program, one of which runs 3x slower than the other -- but only on certain systems. On a few systems, it runs FASTER. And on some others, it runs in the same time. The problem is variations in the amount and quality of CPU cache on various systems. Small tweaks to algorithms have very small impact on certain machines, on others they absolutely kill performance.

**BobMcGee123** · 10-12-2007

aliens

**khdani** · 10-13-2007

Thanks brewbuck

If I correctly understand you, I need to rewrite my program for multithreading if I want it to run faster on DUO, right ?

**CornedBee** · 10-13-2007

That's one option, but it probably won't help you in the graphics case.

Also, a Core 1 1.6GHz CPU should be able to keep up with a P4 2.6Ghz. Perhaps your compiler is tuned to the P4 and generates code that is considerably more efficient there.

But you'd need detailed profiling to find out.

**khdani** · 10-16-2007

Yes, the compiler indeed generates code for P4...but OpenGL libraries are 32bit,
so I can't recompile it for 64bit, is there any other options ?

**CornedBee** · 10-16-2007

A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.

**matsp** · 10-16-2007

Originally Posted by CornedBee

A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.

Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

--
Mats

**CornedBee** · 10-16-2007

*shrug* Just a suggestion.

Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.

**matsp** · 10-16-2007

That is, indeed, a good suggestion. May I suggest a trial version of VTune. (Or use oprofile if it's Linux).

By the way, a built-in graphcis card with small amount of memory may actually be faster than one with large amounts of memory when it comes to WRITING the display-list in OpenGL, so if you are spending a lot of time writing the display-list, and the drawing of the object is trivial (so much that the basic Intel graphics card isn't "pushed" to it's limits), then the basic model graphics card may have a big advantage.

--
Mats

**Frobozz** · 10-16-2007

Originally Posted by khdani

-----------------------------------
Intel DUO 1.6GhZ
1 GB RAM
ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
-----------------------------------
I ran exactly the same program on:
----------------------------------
Pentium 4 2.6GhZ
512MB RAM
Intel OnBoard Video Card 64MB
----------------------------------

Could we have a little more detail on the computers? It could very well be the Intel video is faster somehow. The only information I can find on the X1050 indicates it uses DDR1/2 memory.

**abachler** · 10-16-2007

Originally Posted by matsp

Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

--
Mats

actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.

8 cores is going to be the limit as far as multicore, since that is the point at which one or more cores will almost always be stalled due to a cache miss. The only way to improve performance at that point will be to start using more advanced memory architectures.

**brewbuck** · 10-16-2007

Originally Posted by CornedBee

*shrug* Just a suggestion.

Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.

Unfortunately, profiling is often not helpful in these situations. Getting back to the example I was dealing with last week, when the code was compiled in profile mode, NO DIFFERENCE WAS OBSERVED in run time. How's that for frustration?

The reason is that the performance issue was cache-based. The sampling profiler absolutely blows the cache efficiency out of the water, because it is constantly stopping the program, checking where it is, and incrementing a counter somewhere. So if the performance problem is caused by cache, running the program under a profiler will MASK IT.

**matsp** · 10-16-2007

Originally Posted by abachler

actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.

A long pipeline has nothing to do with OOE (Out Of Order execution). That is all about having a large number of potential instructions to choose from "ready to execute". But once an instruction is actually in the pipeline, it has to either be skipped or executed. A longer pipeline doesn't make that any better.

The length of the pipeline determines [indirectly] how fast you can clock the processor. By increasing the length of the pipeline, the amount of work done in each pipeline step is less, which means that the time it takes to pass a certain pipeline stage is shorter, meaning that the clock-cycles can be shorter without causing problems with the timing of each pipeline stage.

I still would say that a long pipeline is not as good as a short pipeline, unless the code is specifically written to fit in long pipeline [no unpredictable branches, long sequences of instructions with no branches, that sort of thing]. And even then, a short pipeline will not show any drawback in itself - it may not clock as high, but with correct design, this can be overcome by more work per cycle, as proven by both AMD and more recent Intel processors.

--
Mats

Thread: Weird Performance

Thread Tools

Search Thread

Display

Weird Performance

Similar Threads

Created a new class, weird performance. To fast in my opinion

Performance and footprint of virtual function

File map performance

Observer Pattern and Performance questions

inheritance and performance