Multithreading no performance gain?

**cyberfish** · 12-29-2009

I am faced with a strange performance problem in my code (made it multithreaded, with almost no extra work and no data sharing at all, and while the result was still right, it became slightly slower), and thought I would try to reproduce it in simpler code.

Much to my surprise, I could actually reproduce it.

Here is my test code, using std::thread from C++0x -

Code:

#include <iostream>
#include <cstdlib>

#ifdef MT
#include <thread>
#endif

void slow_exp(int base, long long int exp, int *r) {
    *r = 1;
    for (long long int i = 0; i < exp; ++i) {
        *r *= base;
    }
}

int f(int base, long long int exp) {
#ifndef MT
    int r;
    slow_exp(base, exp, &r);
    return r;
#else
    int r1, r2;
    std::thread thread1(slow_exp, base, exp / 2, &r1);
    std::thread thread2(slow_exp, base, exp - exp / 2, &r2);
    thread1.join();
    thread2.join();
    return r1*r2;
#endif
}

int main() {
    std::cout << f(3, 4000000000LL) << std::endl;
}

Defining MT makes it multithreaded.

Just needed something that runs O(n) and the compiler is too dumb to optimize out completely.

The result -

Code:

cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -std=gnu++0x tt.cpp -pthread
cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
1818370049

real	0m2.116s
user	0m2.110s
sys	0m0.000s
cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -DMT -std=gnu++0x tt.cpp -pthread
cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
1818370049

real	0m2.252s
user	0m4.360s
sys	0m0.000s

The result (taking into account overflows and such) was the same, but the multithreaded version is slightly slower! user time is almost twice, suggesting it's doing twice as much work. How can this be? There isn't even any data sharing.

This is run on a Core 2 Duo (dual core) machine, and the single threaded version uses 50% CPU, while the 2 threaded version uses 100%.

Any guesses?

I'm hoping it's something fundamentally wrong with how I am doing it, and can transport the "fix" back to my real program.

Thanks

**Ronix** · 12-30-2009

It's possible that the compiler is inlining the call to slow_exp (and maybe even f too), which doesn't seem possible in the multi-threaded case as it needs its address.

I've run into similar problems before, when trying to optimize code by using multiple threads. Most problems arise when it's an inner loop that you want to run in parallel, and the threads must wait for each other at the end of each iteration of the outer loop, and since the inner loop ought to be pretty tight, even this simple sync cost might still outweigh the gain.

**cyberfish** · 12-30-2009

slow_exp is only called once for single threaded, and twice for multi threaded, so I don't think inlining has much effect.

There is only 1 synchronization here, at the end of the 2 seconds.

**brewbuck** · 12-30-2009

The problem is the long long int. If you reduce your exp slightly so that it fits in a regular int, and change all the long longs to just int, it behaves more like you would expect.

I'll let you have the pleasure of figuring out why THAT is...

**brewbuck** · 12-30-2009

Well, there's actually two problems. I realized that I changed your code slightly on an earlier hunch and didn't change it back.

The first, and truly major, problem, is that you are calculating directly into *r. This produces a memory access on every iteration of the loop. Because r1 and r2 are adjacent in memory, they reside on the same cache line, and the IPC cache snooping totally SCREWS your performance. You should calculate into a local variable, and then store into *r at the very end.

Before I fixed that, the multi-threaded version ran *50 times* slower than the single threaded. But even after fixing it, it still didn't make sense. It wasn't until I switched to regular int for exp that things started to look normal.

I believe the CPU is doing something really, really weird but can't figure it out quite yet.

(BTW, I do not have c++0x so I rewrote the code to use pthreads, so the problem is definitely not the std::thread class)

**abachler** · 12-30-2009

no the problem is that thread.join is a blocking call that blocks until the thread terminates.

**brewbuck** · 12-30-2009

Originally Posted by abachler

no the problem is that thread.join is a blocking call that blocks until the thread terminates.

No. join() waits for a thread to finish. The thread launches and begins executing the moment it is declared. Moving the joins around in the way you suggest would CAUSE the problem you are imagining. If what you said was true, then the execution would effective single-thread itself and the user and real times would be equal to each other (or at least close). The fact that user == 2*real is proof that the two threads are executing in parallel.

**abachler** · 12-30-2009

Well, the OP is claiming that the operation is taking the same run-time regardless of whether they run it with threads or not. So unless they are mistakenly assuming user time == run time, there is a problem, and using posix or windows threads does not produce this issue.

Yeah I saw the *r thing, which is definitely bad, but using a long long shouldn't effect performance enough to matter.

**brewbuck** · 12-30-2009

Originally Posted by abachler

Well, the OP is claiming that the operation is taking the same run-time regardless of whether they run it with threads or not. So unless they are mistakenly assuming user time == run time, there is a problem, and using posix or windows threads does not produce this issue.

I don't know what cyberfish may or may not be assuming, but *I* understand that if user time == num_threads * real_time, then the threading is working. There is obviously some kind of inter-CPU effect occurring behind the scenes which is slowing both threads down.

**abachler** · 12-30-2009

Originally Posted by brewbuck

I don't know what cyberfish may or may not be assuming, but *I* understand that if user time == num_threads * real_time, then the threading is working. There is obviously some kind of inter-CPU effect occurring behind the scenes which is slowing both threads down.

Well, the *r thing will definately slow it to a crawl, but other than that I don't see any issues, well, other than teh fact that the slow_exp function is highly unoptimized to begin with but I assume that's why it is called slow_exp.

**brewbuck** · 12-30-2009

Originally Posted by abachler

Well, the *r thing will definately slow it to a crawl, but other than that I don't see any issues, well, other than teh fact that the slow_exp function is highly unoptimized to begin with but I assume that's why it is called slow_exp.

I don't see any issues either, but that doesn't change the fact that when I try it myself, it does exactly what cyberfish describes. Maybe on your system it doesn't, but that's yet more evidence that it's somehow related to the interaction between the two cores.

**cyberfish** · 12-30-2009

The problem is the long long int. If you reduce your exp slightly so that it fits in a regular int, and change all the long longs to just int, it behaves more like you would expect.

I'll let you have the pleasure of figuring out why THAT is...

But this is running on 64-bit Linux

. Sorry I didn't mention it.

I changed it to unsigned int and the results are virtually the same (same real time, user time 2x for 2 threads version).

The first, and truly major, problem, is that you are calculating directly into *r. This produces a memory access on every iteration of the loop. Because r1 and r2 are adjacent in memory, they reside on the same cache line, and the IPC cache snooping totally SCREWS your performance. You should calculate into a local variable, and then store into *r at the very end.

That definitely makes sense, but I tried it and it didn't change the result. Perhaps the compiler is too smart for its own good and decided to "optimize" it to eliminate the variable? I will check the asm.

**cyberfish** · 12-30-2009

That's not it.

I changed it to

Code:

int r[1024*1024]; //~1MB apart
std::thread thread1(slow_exp, base, exp / 2, &r[0]);
std::thread thread2(slow_exp, base, exp - exp / 2, &r[1024*1024-1]);
thread1.join();
thread2.join();
return r[0]*r[1024*1024-1];

and it behaves the same way.

**brewbuck** · 12-30-2009

Originally Posted by cyberfish

But this is running on 64-bit Linux

. Sorry I didn't mention it.

The problem just gets more and more fascinating...

That definitely makes sense, but I tried it and it didn't change the result. Perhaps the compiler is too smart for its own good and decided to "optimize" it to eliminate the variable? I will check the asm.

That's possible. It's also possible that the cache coherence protocol is totally different on the 64-bit architecture, in a way that makes it less of a problem. Regardless of architecture though, you should try to avoid having multiple threads working on the same piece of memory.

**iMalc** · 12-30-2009

It makes no sense to create two additional threads and then leave the one that started them both idle! Only create one extra thread and just make a direct call from the current thread. You will have two threads doing work at the same time without the need to set up and tear down two more threads.

Code:

void slow_exp(int base, int exp, int *r) {
    int retVal = 1;
    for (int i = 0; i < exp; ++i) {
        retVal *= base;
    }
    *r = retVal;
}

int f(int base, int exp) {
#ifndef MT
    int r;
    slow_exp(base, exp, &r);
    return r;
#else
    int r1, r2;
    std::thread thread1(slow_exp, base, exp / 2, &r1);
    slow_exp(base, exp - exp / 2, &r2);
    thread1.join();
    return r1*r2;
#endif
}

There's no use using long long int for 'exp' when the result is only an int.

I assume that you do know about the binary power method?

Thread: Multithreading no performance gain?

Thread Tools

Search Thread

Display

Multithreading no performance gain?

Similar Threads

Multithreading (flag stopping a thread, ring buffer) volatile

Performance and footprint of virtual function

File map performance

inheritance and performance

Multithreading, time gain?