Thread: Multithreading no performance gain?

  1. #1
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229

    Multithreading no performance gain?

    I am faced with a strange performance problem in my code (made it multithreaded, with almost no extra work and no data sharing at all, and while the result was still right, it became slightly slower), and thought I would try to reproduce it in simpler code.

    Much to my surprise, I could actually reproduce it.

    Here is my test code, using std::thread from C++0x -
    Code:
    #include <iostream>
    #include <cstdlib>
    
    #ifdef MT
    #include <thread>
    #endif
    
    void slow_exp(int base, long long int exp, int *r) {
        *r = 1;
        for (long long int i = 0; i < exp; ++i) {
            *r *= base;
        }
    }
    
    int f(int base, long long int exp) {
    #ifndef MT
        int r;
        slow_exp(base, exp, &r);
        return r;
    #else
        int r1, r2;
        std::thread thread1(slow_exp, base, exp / 2, &r1);
        std::thread thread2(slow_exp, base, exp - exp / 2, &r2);
        thread1.join();
        thread2.join();
        return r1*r2;
    #endif
    }
    
    int main() {
        std::cout << f(3, 4000000000LL) << std::endl;
    }
    Defining MT makes it multithreaded.

    Just needed something that runs O(n) and the compiler is too dumb to optimize out completely.

    The result -
    Code:
    cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -std=gnu++0x tt.cpp -pthread
    cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
    1818370049
    
    real	0m2.116s
    user	0m2.110s
    sys	0m0.000s
    cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -DMT -std=gnu++0x tt.cpp -pthread
    cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
    1818370049
    
    real	0m2.252s
    user	0m4.360s
    sys	0m0.000s
    The result (taking into account overflows and such) was the same, but the multithreaded version is slightly slower! user time is almost twice, suggesting it's doing twice as much work. How can this be? There isn't even any data sharing.

    This is run on a Core 2 Duo (dual core) machine, and the single threaded version uses 50% CPU, while the 2 threaded version uses 100%.

    Any guesses?

    I'm hoping it's something fundamentally wrong with how I am doing it, and can transport the "fix" back to my real program.

    Thanks

  2. #2
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    It's possible that the compiler is inlining the call to slow_exp (and maybe even f too), which doesn't seem possible in the multi-threaded case as it needs its address.

    I've run into similar problems before, when trying to optimize code by using multiple threads. Most problems arise when it's an inner loop that you want to run in parallel, and the threads must wait for each other at the end of each iteration of the outer loop, and since the inner loop ought to be pretty tight, even this simple sync cost might still outweigh the gain.

  3. #3
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    slow_exp is only called once for single threaded, and twice for multi threaded, so I don't think inlining has much effect.

    There is only 1 synchronization here, at the end of the 2 seconds.

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    The problem is the long long int. If you reduce your exp slightly so that it fits in a regular int, and change all the long longs to just int, it behaves more like you would expect.

    I'll let you have the pleasure of figuring out why THAT is...
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Well, there's actually two problems. I realized that I changed your code slightly on an earlier hunch and didn't change it back.

    The first, and truly major, problem, is that you are calculating directly into *r. This produces a memory access on every iteration of the loop. Because r1 and r2 are adjacent in memory, they reside on the same cache line, and the IPC cache snooping totally SCREWS your performance. You should calculate into a local variable, and then store into *r at the very end.

    Before I fixed that, the multi-threaded version ran *50 times* slower than the single threaded. But even after fixing it, it still didn't make sense. It wasn't until I switched to regular int for exp that things started to look normal.

    I believe the CPU is doing something really, really weird but can't figure it out quite yet.

    (BTW, I do not have c++0x so I rewrote the code to use pthreads, so the problem is definitely not the std::thread class)
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #6
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    no the problem is that thread.join is a blocking call that blocks until the thread terminates.

  7. #7
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by abachler View Post
    no the problem is that thread.join is a blocking call that blocks until the thread terminates.
    No. join() waits for a thread to finish. The thread launches and begins executing the moment it is declared. Moving the joins around in the way you suggest would CAUSE the problem you are imagining. If what you said was true, then the execution would effective single-thread itself and the user and real times would be equal to each other (or at least close). The fact that user == 2*real is proof that the two threads are executing in parallel.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  8. #8
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Well, the OP is claiming that the operation is taking the same run-time regardless of whether they run it with threads or not. So unless they are mistakenly assuming user time == run time, there is a problem, and using posix or windows threads does not produce this issue.

    Yeah I saw the *r thing, which is definitely bad, but using a long long shouldn't effect performance enough to matter.
    Last edited by abachler; 12-30-2009 at 12:53 AM.

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by abachler View Post
    Well, the OP is claiming that the operation is taking the same run-time regardless of whether they run it with threads or not. So unless they are mistakenly assuming user time == run time, there is a problem, and using posix or windows threads does not produce this issue.
    I don't know what cyberfish may or may not be assuming, but *I* understand that if user time == num_threads * real_time, then the threading is working. There is obviously some kind of inter-CPU effect occurring behind the scenes which is slowing both threads down.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by brewbuck View Post
    I don't know what cyberfish may or may not be assuming, but *I* understand that if user time == num_threads * real_time, then the threading is working. There is obviously some kind of inter-CPU effect occurring behind the scenes which is slowing both threads down.
    Well, the *r thing will definately slow it to a crawl, but other than that I don't see any issues, well, other than teh fact that the slow_exp function is highly unoptimized to begin with but I assume that's why it is called slow_exp.

  11. #11
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by abachler View Post
    Well, the *r thing will definately slow it to a crawl, but other than that I don't see any issues, well, other than teh fact that the slow_exp function is highly unoptimized to begin with but I assume that's why it is called slow_exp.
    I don't see any issues either, but that doesn't change the fact that when I try it myself, it does exactly what cyberfish describes. Maybe on your system it doesn't, but that's yet more evidence that it's somehow related to the interaction between the two cores.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  12. #12
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    The problem is the long long int. If you reduce your exp slightly so that it fits in a regular int, and change all the long longs to just int, it behaves more like you would expect.

    I'll let you have the pleasure of figuring out why THAT is...
    But this is running on 64-bit Linux . Sorry I didn't mention it.

    I changed it to unsigned int and the results are virtually the same (same real time, user time 2x for 2 threads version).

    The first, and truly major, problem, is that you are calculating directly into *r. This produces a memory access on every iteration of the loop. Because r1 and r2 are adjacent in memory, they reside on the same cache line, and the IPC cache snooping totally SCREWS your performance. You should calculate into a local variable, and then store into *r at the very end.
    That definitely makes sense, but I tried it and it didn't change the result. Perhaps the compiler is too smart for its own good and decided to "optimize" it to eliminate the variable? I will check the asm.

  13. #13
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    That's not it.

    I changed it to
    Code:
    int r[1024*1024]; //~1MB apart
    std::thread thread1(slow_exp, base, exp / 2, &r[0]);
    std::thread thread2(slow_exp, base, exp - exp / 2, &r[1024*1024-1]);
    thread1.join();
    thread2.join();
    return r[0]*r[1024*1024-1];
    and it behaves the same way.

  14. #14
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by cyberfish View Post
    But this is running on 64-bit Linux . Sorry I didn't mention it.
    The problem just gets more and more fascinating...

    That definitely makes sense, but I tried it and it didn't change the result. Perhaps the compiler is too smart for its own good and decided to "optimize" it to eliminate the variable? I will check the asm.
    That's possible. It's also possible that the cache coherence protocol is totally different on the 64-bit architecture, in a way that makes it less of a problem. Regardless of architecture though, you should try to avoid having multiple threads working on the same piece of memory.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #15
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    It makes no sense to create two additional threads and then leave the one that started them both idle! Only create one extra thread and just make a direct call from the current thread. You will have two threads doing work at the same time without the need to set up and tear down two more threads.
    Code:
    void slow_exp(int base, int exp, int *r) {
        int retVal = 1;
        for (int i = 0; i < exp; ++i) {
            retVal *= base;
        }
        *r = retVal;
    }
    
    int f(int base, int exp) {
    #ifndef MT
        int r;
        slow_exp(base, exp, &r);
        return r;
    #else
        int r1, r2;
        std::thread thread1(slow_exp, base, exp / 2, &r1);
        slow_exp(base, exp - exp / 2, &r2);
        thread1.join();
        return r1*r2;
    #endif
    }
    There's no use using long long int for 'exp' when the result is only an int.

    I assume that you do know about the binary power method?
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multithreading (flag stopping a thread, ring buffer) volatile
    By ShwangShwing in forum C Programming
    Replies: 3
    Last Post: 05-19-2009, 07:27 AM
  2. Performance and footprint of virtual function
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-31-2008, 07:34 PM
  3. File map performance
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-04-2008, 04:18 AM
  4. inheritance and performance
    By kuhnmi in forum C++ Programming
    Replies: 5
    Last Post: 08-04-2004, 12:46 PM
  5. Multithreading, time gain?
    By Magos in forum Windows Programming
    Replies: 7
    Last Post: 05-08-2003, 05:35 PM