What is specs of the machine you are running things on Linux - multi-threading only gives advantages if there is spare processing capacity on more than one CPU.
--
Mats
What is specs of the machine you are running things on Linux - multi-threading only gives advantages if there is spare processing capacity on more than one CPU.
--
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
Both systems are on a notebook with Intel Pentium Dual Core (2 cores clocked 1.6 GHz each). I notice that both cores max with two threads, while only one maxes with one thread (this is true on both Vista and Ubuntu). Yet on Vista this translates to a performance gain if a lot of processing has to be done. For Linux, I find it does not. I was just wondering what is the most probable cause. I'm thinking either the library builds or the compilers (MSVC++ vs GCC).
"If you tell the truth, you don't have to remember anything"
-Mark Twain
sounds ratehr strange - I'd expect to see the same factor improvement on Linux as on Windows. But without knowing what the code does and what is taking up it's time, it's pretty hard to say for sure.
--
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
I think only a profiler can answer this question.
All the buzzt!
CornedBee
"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
I should warn that this is ugly code. This is a "toy" program.
The calls to barrier::wait() are most expensive. One can see how the size of the array would matter more than the number of iterations, since wait() is called twice per iteration regardless. Still, no matter how large the array is, on Linux I gain no time.Code://Applies the method of relaxation to an array of floating point values //in order to approximate the solution to the wave equation, via a diffusion //equation, with boundary condition of zero at the edges. // i.e. Applies a function to every member of an array (v), stores it in another // array (lap), and then does v += lap #include <iostream> #include <boost/thread/thread.hpp> #include <boost/thread/barrier.hpp> #include <cmath> //sin() #include <cassert> #define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_)) template<class T1, class T2> std::ostream & operator << (std::ostream & os, std::pair<T1,T2> & p) { os << p.first << ',' << p.second; return os; } inline void randomize(double * arg, int n) { for(unsigned int i = 0; i < n; ++i) arg[i] = std::rand(); } void print_it(double * v, unsigned int sz) { if(sz < 150) { DO_FOR(i,1,sz-1) std::cout << ( v[i] / sin(pi/(sz-1)*i) ) << ' '; } } typedef std::pair<int,int> interval; struct task { void operator()() //This is run per thread { for(unsigned long int t = 0; t < ntimes; ++t) { for(int i = iv.first; i < iv.second; ++i) lap[i] = (v[i-1] + v[i+1] - 2.0*v[i]) / 2.0; bar.wait(); //Important! Don't comment out! for(int i = iv.first; i < iv.second; ++i) v[i] += dt*lap[i]; bar.wait(); } } task(double * v_arg, double * lap_arg, unsigned long int ntimes_arg, double dt_arg, boost::barrier & bar_arg, interval iv_arg) : v(v_arg), lap(lap_arg), ntimes(ntimes_arg), dt(dt_arg), bar(bar_arg), iv(iv_arg) {} private: double * v; double * lap; unsigned long int ntimes; double dt; boost::barrier & bar; interval iv; }; int main() { using namespace std; const double pi = 2.0*asin(1.0); double dt = 0.1; cout << "Size: "; unsigned long int sz; cin >> sz; cout << "Iterations: "; unsigned long int count; cin >> count; cout << "Number of threads: "; unsigned int nthreads; cin >> nthreads; //randomize v double * v = new double[sz]; randomize(v,sz); v[0] = v[sz-1] = 0.0; //splits v into nthreads intervals interval * thread_intervals = new interval[nthreads]; ldiv_t d = div(sz-2, static_cast<long>(nthreads)); unsigned int offset = 0; DO_FOR(i, 0, nthreads) { unsigned int x = 1 + i*d.quot + offset; if(d.rem-- > 0) ++offset; thread_intervals[i] = interval(x, 1 + (i+1)*d.quot + offset); if(i == nthreads-1) assert(1 + (i+1)*d.quot + offset == sz - 1); } assert(thread_intervals.size() == nthreads); //Prints out intervals DO_FOR(i,0,nthreads) cout << thread_intervals[i] << endl; cout << endl; //uses multiple threads; stopwatch double * lap = new double[sz]; boost::barrier bar(nthreads); boost::thread_group threads; clock_t time_with_threads; clock_t beg = clock(); DO_FOR(i,0,nthreads) threads.create_thread( task(v,lap,count,dt,bar,thread_intervals[i]) ); threads.join_all(); clock_t end = clock(); cout << "Below are the results for " << nthreads << " threads, which took " << (time_with_threads = (end-beg)) << " ticks." << endl; print_it(v, sz); cout << '\n' << endl; //Randomize v again randomize(v,sz); v[0] = v[sz-1] = 0.0; //Now do with just one thread; stopwatch clock_t time_without_threads; beg = clock(); DO_FOR(i,0,count) //This is run once after threads finish, for comparison. { DO_FOR(k,1,sz-1) lap[k] = (v[k+1] + v[k-1] - 2.0*v[k]) / 2.0; DO_FOR(j,1,sz-1) v[j] += dt*lap[j]; } end = clock(); cout << "Below are the results for one thread, which took " << (time_without_threads = (end-beg)) << " ticks." << endl; print_it(v, sz); cout << endl; delete [] lap; delete [] v; delete [] thread_intervals; cout << "\nDone. Press enter to return." << endl; cin.ignore(); cin.get(); }
Last edited by CodeMonkey; 12-16-2008 at 10:30 AM.
"If you tell the truth, you don't have to remember anything"
-Mark Twain
Making our own version of the C language, by introducing new constructs for existing language features, are we?Code:#define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_))
bar::wait() does what, exactly? Is it possible that this is implemented differently for Windows and Linux?
Do I take it that each thread operates on the every nthread-th element in the lap array? If so, you'd have cache-coherence complications if the CPU's do not have fully shared caches [I know of no multicore CPU that has fully shared caches].
Finally, clock is different between Windows and Linux - in Linux, if you have two threads that run at the same time, Linux will report twice the amount of CPU time - and that is what clock() returns. So you probably need to measure the actual time in seconds (gettimeofday() for example) to see how long it takes with one and many threads.
--
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.
clock() works differently on Windows and Linux. I blame that, but I first need to check this.
All the buzzt!
CornedBee
"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
Hm ok.
1. bar::wait() uses a scoped mutex to halt a thread until a certain number (nthreads) have called it. I don't know the system-level implementation.
2. Yes, each thread operates on a unique portion of the array. I'll look into cache-coherence (never heard of it).
3. clock() is called before and after the threads run. Still, I'll try another function as you suggested.
"If you tell the truth, you don't have to remember anything"
-Mark Twain
Stupid me. I can't validate this on a single-core machineFor me, two threads actually take about twice as long as one.
Here's the modified code, try it out.
Code://Applies the method of relaxation to an array of floating point values //in order to approximate the solution to the wave equation, via a diffusion //equation, with boundary condition of zero at the edges. #include <iostream> #include <boost/thread/thread.hpp> #include <boost/thread/barrier.hpp> #include <cmath> //sin() #include <cassert> #include <sys/time.h> #define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_)) const double pi = M_PI; template<class T1, class T2> std::ostream & operator << (std::ostream & os, std::pair<T1,T2> & p) { os << p.first << ',' << p.second; return os; } inline void randomize(double * arg, int n) { for(unsigned int i = 0; i < n; ++i) arg[i] = std::rand(); } void print_it(double * v, unsigned int sz) { if(sz < 150) { DO_FOR(i,1,sz-1) std::cout << ( v[i] / sin(pi/(sz-1)*i) ) << ' '; } } typedef std::pair<int,int> interval; struct task { void operator()() //This is run per thread { for(unsigned long int t = 0; t < ntimes; ++t) { for(int i = iv.first; i < iv.second; ++i) lap[i] = (v[i-1] + v[i+1] - 2.0*v[i]) / 2.0; bar.wait(); //Important! Don't comment out! for(int i = iv.first; i < iv.second; ++i) v[i] += dt*lap[i]; bar.wait(); } } task(double * v_arg, double * lap_arg, unsigned long int ntimes_arg, double dt_arg, boost::barrier & bar_arg, interval iv_arg) : v(v_arg), lap(lap_arg), ntimes(ntimes_arg), dt(dt_arg), bar(bar_arg), iv(iv_arg) {} private: double * v; double * lap; unsigned long int ntimes; double dt; boost::barrier & bar; interval iv; }; int main() { using namespace std; const double pi = 2.0*asin(1.0); double dt = 0.1; cout << "Size: "; unsigned long int sz; cin >> sz; cout << "Iterations: "; unsigned long int count; cin >> count; cout << "Number of threads: "; unsigned int nthreads; cin >> nthreads; //randomize v double * v = new double[sz]; randomize(v,sz); v[0] = v[sz-1] = 0.0; //splits v into nthreads intervals interval * thread_intervals = new interval[nthreads]; ldiv_t d = div(sz-2, static_cast<long>(nthreads)); unsigned int offset = 0; DO_FOR(i, 0, nthreads) { unsigned int x = 1 + i*d.quot + offset; if(d.rem-- > 0) ++offset; thread_intervals[i] = interval(x, 1 + (i+1)*d.quot + offset); if(i == nthreads-1) assert(1 + (i+1)*d.quot + offset == sz - 1); } //Prints out intervals DO_FOR(i,0,nthreads) cout << thread_intervals[i] << endl; cout << endl; //uses multiple threads; stopwatch double * lap = new double[sz]; boost::barrier bar(nthreads); boost::thread_group threads; //clock_t time_with_threads; //clock_t beg = clock(); struct timeval beg; gettimeofday(&beg, 0); DO_FOR(i,0,nthreads) threads.create_thread( task(v,lap,count,dt,bar,thread_intervals[i]) ); threads.join_all(); //clock_t end = clock(); struct timeval end; gettimeofday(&end, 0); long long llbeg = (beg.tv_sec * 1000000LL) + beg.tv_usec; long long llend = (end.tv_sec * 1000000LL) + end.tv_usec; long long time_with_threads; cout << "Below are the results for " << nthreads << " threads, which took " << (time_with_threads = (llend-llbeg)) << " usecs." << endl; print_it(v, sz); cout << '\n' << endl; //Randomize v again randomize(v,sz); v[0] = v[sz-1] = 0.0; //Now do with just one thread; stopwatch //beg = clock(); gettimeofday(&beg, 0); DO_FOR(i,0,count) //This is run once after threads finish, for comparison. { DO_FOR(k,1,sz-1) lap[k] = (v[k+1] + v[k-1] - 2.0*v[k]) / 2.0; DO_FOR(j,1,sz-1) v[j] += dt*lap[j]; } //end = clock(); gettimeofday(&end, 0); llbeg = (beg.tv_sec * 1000000LL) + beg.tv_usec; llend = (end.tv_sec * 1000000LL) + end.tv_usec; long long time_without_threads; cout << "Below are the results for one thread, which took " << (time_without_threads = (llend-llbeg)) << " usecs." << endl; print_it(v, sz); cout << endl; delete [] lap; delete [] v; delete [] thread_intervals; cout << "\nDone. Press enter to return." << endl; cin.ignore(); cin.get(); }
All the buzzt!
CornedBee
"There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
- Flon's Law
I had modified my code exactly as you had. Thank you for taking the time. But yes, now I get a decidedly smaller number for two threads. Thanks.
"If you tell the truth, you don't have to remember anything"
-Mark Twain
Actually, the gain from two threads is rather more dramatic than on Vista. I think maybe program performance is like the stock market -- a lot of people know a lot about it, but nobody really has a clue. Or maybe just I don't have a clue. Thanks again for all of your help (everyone).
"If you tell the truth, you don't have to remember anything"
-Mark Twain
I _THINK_ I know a whole lot more about performance than I _THINK_ I know about the stock market. But you are right, the concept of both is pretty simple - in performance, we essentially want as few cycles as possible, and as many of the cycles as possible should "do something useful" - knowing how to get that out of the code can be trickier than the principle says, of course. The stock-market principle is also simple: Buy when prices are low, and sell when prices are high. Knowing when prices are low and when they are high is the tricky part (because you either buy before they bottom out, or on the up, and sell before the peak or after the peak - very hard to hit the peak EXACTLY).
--
Mats
Compilers can produce warnings - make the compiler programmers happy: Use them!
Please don't PM me for help - and no, I don't do help over instant messengers.