I am faced with a strange performance problem in my code (made it multithreaded, with almost no extra work and no data sharing at all, and while the result was still right, it became slightly slower), and thought I would try to reproduce it in simpler code.

Much to my surprise, I could actually reproduce it.

Here is my test code, using std::thread from C++0x -
Code:
#include <iostream>
#include <cstdlib>

#ifdef MT
#include <thread>
#endif

void slow_exp(int base, long long int exp, int *r) {
    *r = 1;
    for (long long int i = 0; i < exp; ++i) {
        *r *= base;
    }
}

int f(int base, long long int exp) {
#ifndef MT
    int r;
    slow_exp(base, exp, &r);
    return r;
#else
    int r1, r2;
    std::thread thread1(slow_exp, base, exp / 2, &r1);
    std::thread thread2(slow_exp, base, exp - exp / 2, &r2);
    thread1.join();
    thread2.join();
    return r1*r2;
#endif
}

int main() {
    std::cout << f(3, 4000000000LL) << std::endl;
}
Defining MT makes it multithreaded.

Just needed something that runs O(n) and the compiler is too dumb to optimize out completely.

The result -
Code:
cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -std=gnu++0x tt.cpp -pthread
cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
1818370049

real	0m2.116s
user	0m2.110s
sys	0m0.000s
cyberfish@cyberfish-desktop:/tmp/tt$ g++ -O3 -DMT -std=gnu++0x tt.cpp -pthread
cyberfish@cyberfish-desktop:/tmp/tt$ time ./a.out
1818370049

real	0m2.252s
user	0m4.360s
sys	0m0.000s
The result (taking into account overflows and such) was the same, but the multithreaded version is slightly slower! user time is almost twice, suggesting it's doing twice as much work. How can this be? There isn't even any data sharing.

This is run on a Core 2 Duo (dual core) machine, and the single threaded version uses 50% CPU, while the 2 threaded version uses 100%.

Any guesses?

I'm hoping it's something fundamentally wrong with how I am doing it, and can transport the "fix" back to my real program.

Thanks