Thread: Odd assembly problem

  1. #61
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by CornedBee View Post
    Actually, most developers who need to process small amounts of data are going over to SSE, too. GCC generates SSE code by default for x64 targets. The reason is that the x87 execution model was always broken by design, and compiler writers detested the thing.
    Yes, and combined with 16 register for SSE, it makes it really easy to generate decent SSE code, and it's never been particularly easy for any compiler to do REALLY good FPU code.
    It's true that the Core2 is the first CPU to have a 128-bit execution bandwidth and thus to be able to do SSE operations in one step, whereas all previous CPUs, including AMD's, need to process them in two steps, thus taking twice as many cycles for all operations.
    What is not true is that AMD's x87 unit is slower than Intel's.
    Yes, and it's fairly well published that the new quad core processor has 128-bit units capable of doing SSE as fast as a Core2.

    And if memory bandwidth is really the limiting factor, the AMD processors were definitely better before Core2 came out [which is a major change to Intel's top-end architecture, as it's slower [in MHz] than the predecessors, have a shorter pipeline, no hyperthreading, and better memory bandwidth].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  2. #62
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by matsp View Post
    And if memory bandwidth is really the limiting factor, the AMD processors were definitely better before Core2 came out [which is a major change to Intel's top-end architecture, as it's slower [in MHz] than the predecessors, have a shorter pipeline, no hyperthreading, and better memory bandwidth].

    --
    Mats
    Lower MHz is a detractor, so is no hyperthreading. The shorter pipeline is a tradeoff as discussed in that other thread to reduce cache useage on mispredictions.

    The memory bandwidth limitation is purely with getting the data from main memory into the L2 or L1 cache. The OOE and tentative execution as well as hyperthreading (on intel) hides the latency of the actual calculations and judicious use of software prefetching keeps the cache working at nearly 100% efficiency. The fact that the calculations are taking place on multiple threads spreads/hides teh latency even more, to the point that the physical rate at which data can be brought into the chip is the limiting factor. On the core2 it reaches a point where its almost the latency of the processor, but thats with a dual core2 and 800 MHz memory, a quad core woudl again keep it saturated even at 1333.
    Last edited by abachler; 12-07-2007 at 08:17 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bin packing problem....
    By 81N4RY_DR460N in forum C++ Programming
    Replies: 0
    Last Post: 08-01-2005, 05:20 AM
  2. Words and lines count problem
    By emo in forum C Programming
    Replies: 1
    Last Post: 07-12-2005, 03:36 PM
  3. half ADT (nested struct) problem...
    By CyC|OpS in forum C Programming
    Replies: 1
    Last Post: 10-26-2002, 08:37 AM
  4. binary tree problem - help needed
    By sanju in forum C Programming
    Replies: 4
    Last Post: 10-16-2002, 05:18 AM