Thread: Weird Performance

  1. #1
    Registered User khdani's Avatar
    Join Date
    Oct 2007
    Posts
    42

    Weird Performance

    Hello,
    I've come upon the following issue and can't figure out why this is happening....
    I've written a particles engine on C/OpenGL, my hardware is:
    -----------------------------------
    Intel DUO 1.6GhZ
    1 GB RAM
    ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
    -----------------------------------
    I ran exactly the same program on:
    ----------------------------------
    Pentium 4 2.6GhZ
    512MB RAM
    Intel OnBoard Video Card 64MB
    ----------------------------------
    and the program on Pentium4 computer ran much more faster, I could even
    make 4 times more and 9 times larger particles while still the program ran faster than initially on my PC...
    anyone please can explain me why my PC significantly much slower ?

  2. #2
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    Without code it's anyone's guess.

  3. #3
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    The 2.4 GHz Pentium may have some instructions (which your program uses) that are a LOT faster than the simple ratio of clock speeds would suggest. Your compiler might be producing SIMD code as well as emulated-SIMD code, and the emulated code is what ends up running on the slower system, making the faster one look MUCH faster. Or you could be having cache issues. Perhaps the cache on the 2.4 GHz Pentium is more well-behaved in your case.

    I've seen lots of problems like this. Just this week I've been dealing with two very similar versions of the same program, one of which runs 3x slower than the other -- but only on certain systems. On a few systems, it runs FASTER. And on some others, it runs in the same time. The problem is variations in the amount and quality of CPU cache on various systems. Small tweaks to algorithms have very small impact on certain machines, on others they absolutely kill performance.

  4. #4

    Join Date
    May 2005
    Posts
    1,042
    aliens
    I'm not immature, I'm refined in the opposite direction.

  5. #5
    Registered User khdani's Avatar
    Join Date
    Oct 2007
    Posts
    42
    Thanks brewbuck
    If I correctly understand you, I need to rewrite my program for multithreading if I want it to run faster on DUO, right ?

  6. #6
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    That's one option, but it probably won't help you in the graphics case.

    Also, a Core 1 1.6GHz CPU should be able to keep up with a P4 2.6Ghz. Perhaps your compiler is tuned to the P4 and generates code that is considerably more efficient there.

    But you'd need detailed profiling to find out.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  7. #7
    Registered User khdani's Avatar
    Join Date
    Oct 2007
    Posts
    42
    Yes, the compiler indeed generates code for P4...but OpenGL libraries are 32bit,
    so I can't recompile it for 64bit, is there any other options ?

  8. #8
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    A Core1 doesn't have 64-bit mode. It's not what I meant, anyway. I'm saying the compiler is tuned to the P4. That means that the way it generates instructions is optimized for the P4's endless instruction pipe, instead of the Core's far shorter one.
    Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    *shrug* Just a suggestion.

    Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    That is, indeed, a good suggestion. May I suggest a trial version of VTune. (Or use oprofile if it's Linux).

    By the way, a built-in graphcis card with small amount of memory may actually be faster than one with large amounts of memory when it comes to WRITING the display-list in OpenGL, so if you are spending a lot of time writing the display-list, and the drawing of the object is trivial (so much that the basic Intel graphics card isn't "pushed" to it's limits), then the basic model graphics card may have a big advantage.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Registered User Frobozz's Avatar
    Join Date
    Dec 2002
    Posts
    546
    Quote Originally Posted by khdani View Post
    -----------------------------------
    Intel DUO 1.6GhZ
    1 GB RAM
    ATI Radeon x1050 PCI Express 512MB RAM (HyperMemory)
    -----------------------------------
    I ran exactly the same program on:
    ----------------------------------
    Pentium 4 2.6GhZ
    512MB RAM
    Intel OnBoard Video Card 64MB
    ----------------------------------
    Could we have a little more detail on the computers? It could very well be the Intel video is faster somehow. The only information I can find on the X1050 indicates it uses DDR1/2 memory.

  13. #13
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by matsp View Post
    Somehow I find this rather unlikely. In my experience, code written for long pipelines will run well on short pipeline processors (e.g. Athlon-64 vs. P4 using the same code), but the other way around isn't always true.

    --
    Mats
    actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

    When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.

    8 cores is going to be the limit as far as multicore, since that is the point at which one or more cores will almost always be stalled due to a cache miss. The only way to improve performance at that point will be to start using more advanced memory architectures.
    Last edited by abachler; 10-16-2007 at 04:17 PM.

  14. #14
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by CornedBee View Post
    *shrug* Just a suggestion.

    Anyway, what you really should do is profile the program and find out if the performance difference comes from a general slowdown or if there is a specific bit of code that is so much slower.
    Unfortunately, profiling is often not helpful in these situations. Getting back to the example I was dealing with last week, when the code was compiled in profile mode, NO DIFFERENCE WAS OBSERVED in run time. How's that for frustration?

    The reason is that the performance issue was cache-based. The sampling profiler absolutely blows the cache efficiency out of the water, because it is constantly stopping the program, checking where it is, and incrementing a counter somewhere. So if the performance problem is caused by cache, running the program under a profiler will MASK IT.

  15. #15
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by abachler View Post
    actually you have that backwards. The longer a pipeline (assuming its an efficiently implemented one) the more effecive it is at minimizing small innefficiencies in the instruction stream. Shorter pipelines are statistically less likely to catch every possible chance to perform OOE. They had to shorten the pipelines in the core2 because of going multicore, where pipeline refills due to failed branch prediction have a larger impact on cache performance. Its a trade-off that wont always benefit every program.

    When you start measuring AMD vs Intel, all bets are off, because they implement their ALU's in completely different ways. They both execute similar instruction sets, but internally they are like apples and oranges.
    A long pipeline has nothing to do with OOE (Out Of Order execution). That is all about having a large number of potential instructions to choose from "ready to execute". But once an instruction is actually in the pipeline, it has to either be skipped or executed. A longer pipeline doesn't make that any better.

    The length of the pipeline determines [indirectly] how fast you can clock the processor. By increasing the length of the pipeline, the amount of work done in each pipeline step is less, which means that the time it takes to pass a certain pipeline stage is shorter, meaning that the clock-cycles can be shorter without causing problems with the timing of each pipeline stage.

    I still would say that a long pipeline is not as good as a short pipeline, unless the code is specifically written to fit in long pipeline [no unpredictable branches, long sequences of instructions with no branches, that sort of thing]. And even then, a short pipeline will not show any drawback in itself - it may not clock as high, but with correct design, this can be overcome by more work per cycle, as proven by both AMD and more recent Intel processors.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 6
    Last Post: 02-27-2009, 04:43 PM
  2. Performance and footprint of virtual function
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-31-2008, 07:34 PM
  3. File map performance
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-04-2008, 04:18 AM
  4. Observer Pattern and Performance questions
    By Scarvenger in forum C++ Programming
    Replies: 2
    Last Post: 09-21-2007, 11:12 PM
  5. inheritance and performance
    By kuhnmi in forum C++ Programming
    Replies: 5
    Last Post: 08-04-2004, 12:46 PM