Thread: Multithreading no performance gain?

  1. #16
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    That's possible. It's also possible that the cache coherence protocol is totally different on the 64-bit architecture, in a way that makes it less of a problem. Regardless of architecture though, you should try to avoid having multiple threads working on the same piece of memory.
    Of course.

    In this case, I believe it is because on Core 2 Duos, the 2 cores actually share L2 cache.

  2. #17
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    It makes no sense to create two additional threads and then leave the one that started them both idle! Only create one extra thread and just make a direct call from the current thread. You will have two threads doing work at the same time without the need to set up and tear down two more threads.
    True, I did it this way because in my original code where I saw the problem (not posted) the main thread is doing something else (not CPU-intensive).

    I assume that you do know about the binary power method?
    Yeap, O(logn). In fact, I created a thread asking about it about 2 years back, and I think you replied .

    Here I just need some work for the CPUs to do, that the compiler won't be smart enough to optimize out.

  3. #18
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I tried it with different optimization levels, and interestingly enough, the result is as expected for -O0 and -O1 (multithreaded version twice as fast). -O2 is like -O3. I will try disabling random optimizations to see which one caused it.

  4. #19
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I tried starting with -O2 and disabling the additional optimizations (over -O1) one by one.

    with just -O2 it's
    2.10/2.25

    with all -O2 addtional optimizations disabled
    2.16/2.20

    Not sure what is going on now... undocumented optimizations?
    -O2 -fno-thread-jumps -fno-align-functions -fno-align-jumps -fno-align-loops -fno-align-labels -fno-caller-saves -fno-crossjumping -fno-cse-follow-jumps -fno-cse-skip-blocks -fno-delete-null-pointer-checks -fno-expensive-optimizations -fno-gcse -fno-gcse-lm -fno-indirect-inlining -fno-optimize-sibling-calls -fno-peephole2 -fno-regmove -fno-reorder-blocks -fno-reorder-functions -fno-rerun-cse-after-loop -fno-sched-interblock -fno-sched-spec -fno-schedule-insns -fno-schedule-insns2 -fno-strict-aliasing -fno-strict-overflow -fno-tree-switch-conversion -fno-tree-pre -fno-tree-vrp
    That's from Optimize Options - Using the GNU Compiler Collection (GCC)

  5. #20
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    It might be quicker to just examine the assembly code for -O1 and -O2.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #21
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    It might be quicker to just examine the assembly code for -O1 and -O2.
    Yeah, I was trying to avoid that because the compiler-generated assembly didn't make much sense to me. I only know the basics of x86 asm.

  7. #22
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I tried making the multithreaded version use single thread... and the result is terrifying.
    Code:
        int r[1024*1024]; //~1MB apart
        std::thread thread1(slow_exp, base, exp, &r[0]);
        //std::thread thread2(slow_exp, base, exp - exp / 2, &r[1024*1024-1]);
        thread1.join();
        //thread2.join();
        return r[0];
    That took twice as long as the "real" single-threaded version!

  8. #23
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by cyberfish View Post
    I tried making the multithreaded version use single thread... and the result is terrifying.
    Code:
        int r[1024*1024]; //~1MB apart
        std::thread thread1(slow_exp, base, exp, &r[0]);
        //std::thread thread2(slow_exp, base, exp - exp / 2, &r[1024*1024-1]);
        thread1.join();
        //thread2.join();
        return r[0];
    That took twice as long as the "real" single-threaded version!
    Creating and destroying threads isn't quite dirt cheap, hence the reason I suggested not creating more than you have to earlier.

    I don't suppose that the first time you create a std::thread there is a lot of one-time setup done that isn't repeated when you create more threads? On Windows, loading a DLL (plus it's symbols if you're in the debugger) for example, would account for this kind of thing.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multithreading (flag stopping a thread, ring buffer) volatile
    By ShwangShwing in forum C Programming
    Replies: 3
    Last Post: 05-19-2009, 07:27 AM
  2. Performance and footprint of virtual function
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-31-2008, 07:34 PM
  3. File map performance
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-04-2008, 04:18 AM
  4. inheritance and performance
    By kuhnmi in forum C++ Programming
    Replies: 5
    Last Post: 08-04-2004, 12:46 PM
  5. Multithreading, time gain?
    By Magos in forum Windows Programming
    Replies: 7
    Last Post: 05-08-2003, 05:35 PM