Code::Blocks, g++, Boost

**matsp** · 12-16-2008

What is specs of the machine you are running things on Linux - multi-threading only gives advantages if there is spare processing capacity on more than one CPU.

--
Mats

**CodeMonkey** · 12-16-2008

Both systems are on a notebook with Intel Pentium Dual Core (2 cores clocked 1.6 GHz each). I notice that both cores max with two threads, while only one maxes with one thread (this is true on both Vista and Ubuntu). Yet on Vista this translates to a performance gain if a lot of processing has to be done. For Linux, I find it does not. I was just wondering what is the most probable cause. I'm thinking either the library builds or the compilers (MSVC++ vs GCC).

**matsp** · 12-16-2008

sounds ratehr strange - I'd expect to see the same factor improvement on Linux as on Windows. But without knowing what the code does and what is taking up it's time, it's pretty hard to say for sure.

--
Mats

**CornedBee** · 12-16-2008

I think only a profiler can answer this question.

**matsp** · 12-16-2008

Originally Posted by CornedBee

I think only a profiler can answer this question.

Or at least, that's ONE WAY to find out where the code is spending time. If the code isn't very large or complex, code inspection may also find out what is going on (at least to the point of having one or two possible candidates).

--
Mats

**CodeMonkey** · 12-16-2008

I should warn that this is ugly code. This is a "toy" program.

Code:

//Applies the method of relaxation to an array of floating point values
//in order to approximate the solution to the wave equation, via a diffusion
//equation, with boundary condition of zero at the edges.
//    i.e. Applies a function to every member of an array (v), stores it in another
//          array (lap), and then does v += lap 
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/thread/barrier.hpp>
#include <cmath>     //sin()
#include <cassert>

#define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_))

template<class T1, class T2>
std::ostream & operator << (std::ostream & os, std::pair<T1,T2> & p)
{
	os << p.first << ',' << p.second;
	return os;
}

inline void randomize(double * arg, int n)
{
	for(unsigned int i = 0; i < n; ++i) arg[i] = std::rand();
}

void print_it(double * v, unsigned int sz)
{
        if(sz < 150)
	{
		DO_FOR(i,1,sz-1)
			std::cout << ( v[i] / sin(pi/(sz-1)*i) ) << ' ';
	}
}

typedef std::pair<int,int> interval;

struct task
{
	void operator()()     //This is run per thread
	{
		for(unsigned long int t = 0; t < ntimes; ++t)
		{
			for(int i = iv.first; i < iv.second; ++i)
				lap[i] = (v[i-1] + v[i+1] - 2.0*v[i]) / 2.0;
			bar.wait();	//Important! Don't comment out!
			for(int i = iv.first; i < iv.second; ++i)
				v[i] += dt*lap[i];
			bar.wait();
		}
	}
	task(double * v_arg, double * lap_arg, unsigned long int ntimes_arg,
		 double dt_arg, boost::barrier & bar_arg, interval iv_arg)
		: v(v_arg), lap(lap_arg), ntimes(ntimes_arg), dt(dt_arg), bar(bar_arg), iv(iv_arg) {}
private:
	double * v;
	double * lap;
	unsigned long int ntimes;
	double dt;
	boost::barrier & bar;
	interval iv;
};

int main()
{
	using namespace std;
	const double pi = 2.0*asin(1.0);
	double dt = 0.1;

	cout << "Size: ";
	unsigned long int sz;
	cin >> sz;
	cout << "Iterations: ";
	unsigned long int count;
	cin >> count;
	cout << "Number of threads: ";
	unsigned int nthreads;
	cin >> nthreads;

	//randomize v
	double * v = new double[sz];
	randomize(v,sz);
	v[0] = v[sz-1] = 0.0;

	//splits v into nthreads intervals
	interval * thread_intervals = new interval[nthreads];
	ldiv_t d = div(sz-2, static_cast<long>(nthreads));
	unsigned int offset = 0;
	DO_FOR(i, 0, nthreads)
	{
		unsigned int x = 1 + i*d.quot + offset;
		if(d.rem-- > 0) ++offset;
		thread_intervals[i] = interval(x, 1 + (i+1)*d.quot + offset);
		if(i == nthreads-1) assert(1 + (i+1)*d.quot + offset == sz - 1);
	}
	assert(thread_intervals.size() == nthreads);

	//Prints out intervals
	DO_FOR(i,0,nthreads)
		cout << thread_intervals[i] << endl;
	cout << endl;

	//uses multiple threads; stopwatch
	double * lap = new double[sz];
	boost::barrier bar(nthreads);
	boost::thread_group threads;
	clock_t time_with_threads;
	clock_t beg = clock();
	DO_FOR(i,0,nthreads)
		threads.create_thread( task(v,lap,count,dt,bar,thread_intervals[i]) );
       threads.join_all();
	clock_t end = clock();
	cout << "Below are the results for " << nthreads << " threads, which took "
		<< (time_with_threads = (end-beg)) << " ticks." << endl;

        print_it(v, sz);
	cout << '\n' << endl;

	//Randomize v again
	randomize(v,sz);
	v[0] = v[sz-1] = 0.0;

	//Now do with just one thread; stopwatch
	clock_t time_without_threads;
	beg = clock();
	DO_FOR(i,0,count)      //This is run once after threads finish, for comparison.
	{
		DO_FOR(k,1,sz-1)
			lap[k] = (v[k+1] + v[k-1] - 2.0*v[k]) / 2.0;
		DO_FOR(j,1,sz-1)
			v[j] += dt*lap[j];
	}
	end = clock();
	cout << "Below are the results for one thread, which took "
		<< (time_without_threads = (end-beg)) << " ticks." << endl;

	print_it(v, sz);
	cout << endl;

	delete [] lap;
	delete [] v;
	delete [] thread_intervals;

	cout << "\nDone. Press enter to return." << endl;
	cin.ignore();
	cin.get();
}

The calls to barrier::wait() are most expensive. One can see how the size of the array would matter more than the number of iterations, since wait() is called twice per iteration regardless. Still, no matter how large the array is, on Linux I gain no time.

**matsp** · 12-16-2008

Code:

#define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_))

Making our own version of the C language, by introducing new constructs for existing language features, are we?

bar::wait() does what, exactly? Is it possible that this is implemented differently for Windows and Linux?

Do I take it that each thread operates on the every nthread-th element in the lap array? If so, you'd have cache-coherence complications if the CPU's do not have fully shared caches [I know of no multicore CPU that has fully shared caches].

Finally, clock is different between Windows and Linux - in Linux, if you have two threads that run at the same time, Linux will report twice the amount of CPU time - and that is what clock() returns. So you probably need to measure the actual time in seconds (gettimeofday() for example) to see how long it takes with one and many threads.

--
Mats

**CornedBee** · 12-16-2008

clock() works differently on Windows and Linux. I blame that, but I first need to check this.

**CodeMonkey** · 12-16-2008

Hm ok.

1. bar::wait() uses a scoped mutex to halt a thread until a certain number (nthreads) have called it. I don't know the system-level implementation.

2. Yes, each thread operates on a unique portion of the array. I'll look into cache-coherence (never heard of it).

3. clock() is called before and after the threads run. Still, I'll try another function as you suggested.

**CornedBee** · 12-16-2008

Stupid me. I can't validate this on a single-core machine

For me, two threads actually take about twice as long as one.

Here's the modified code, try it out.

Code:

//Applies the method of relaxation to an array of floating point values
//in order to approximate the solution to the wave equation, via a diffusion
//equation, with boundary condition of zero at the edges.
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/thread/barrier.hpp>
#include <cmath>     //sin()
#include <cassert>
#include <sys/time.h>

#define DO_FOR(i_,beg_,end_) for(int (i_) = (beg_); (i_) < (end_); ++(i_))

const double pi = M_PI;

template<class T1, class T2>
std::ostream & operator << (std::ostream & os, std::pair<T1,T2> & p)
{
        os << p.first << ',' << p.second;
        return os;
}

inline void randomize(double * arg, int n)
{
        for(unsigned int i = 0; i < n; ++i) arg[i] = std::rand();
}

void print_it(double * v, unsigned int sz)
{
        if(sz < 150)
        {
                DO_FOR(i,1,sz-1)
                        std::cout << ( v[i] / sin(pi/(sz-1)*i) ) << ' ';
        }
}

typedef std::pair<int,int> interval;

struct task
{
        void operator()()     //This is run per thread
        {
                for(unsigned long int t = 0; t < ntimes; ++t)
                {
                        for(int i = iv.first; i < iv.second; ++i)
                                lap[i] = (v[i-1] + v[i+1] - 2.0*v[i]) / 2.0;
                        bar.wait();     //Important! Don't comment out!
                        for(int i = iv.first; i < iv.second; ++i)
                                v[i] += dt*lap[i];
                        bar.wait();
                }
        }
        task(double * v_arg, double * lap_arg, unsigned long int ntimes_arg,
                 double dt_arg, boost::barrier & bar_arg, interval iv_arg)
                : v(v_arg), lap(lap_arg), ntimes(ntimes_arg), dt(dt_arg), bar(bar_arg), iv(iv_arg) {}
private:
        double * v;
        double * lap;
        unsigned long int ntimes;
        double dt;
        boost::barrier & bar;
        interval iv;
};

int main()
{
        using namespace std;
        const double pi = 2.0*asin(1.0);
        double dt = 0.1;

        cout << "Size: ";
        unsigned long int sz;
        cin >> sz;
        cout << "Iterations: ";
        unsigned long int count;
        cin >> count;
        cout << "Number of threads: ";
        unsigned int nthreads;
        cin >> nthreads;

        //randomize v
        double * v = new double[sz];
        randomize(v,sz);
        v[0] = v[sz-1] = 0.0;

        //splits v into nthreads intervals
        interval * thread_intervals = new interval[nthreads];
        ldiv_t d = div(sz-2, static_cast<long>(nthreads));
        unsigned int offset = 0;
        DO_FOR(i, 0, nthreads)
        {
                unsigned int x = 1 + i*d.quot + offset;
                if(d.rem-- > 0) ++offset;
                thread_intervals[i] = interval(x, 1 + (i+1)*d.quot + offset);
                if(i == nthreads-1) assert(1 + (i+1)*d.quot + offset == sz - 1);
        }

        //Prints out intervals
        DO_FOR(i,0,nthreads)
                cout << thread_intervals[i] << endl;
        cout << endl;

        //uses multiple threads; stopwatch
        double * lap = new double[sz];
        boost::barrier bar(nthreads);
        boost::thread_group threads;
        //clock_t time_with_threads;
        //clock_t beg = clock();
        struct timeval beg;
        gettimeofday(&beg, 0);
        DO_FOR(i,0,nthreads)
                threads.create_thread( task(v,lap,count,dt,bar,thread_intervals[i]) );
        threads.join_all();
        //clock_t end = clock();
        struct timeval end;
        gettimeofday(&end, 0);
        long long llbeg = (beg.tv_sec * 1000000LL) + beg.tv_usec;
        long long llend = (end.tv_sec * 1000000LL) + end.tv_usec;
        long long time_with_threads;
        cout << "Below are the results for " << nthreads << " threads, which took "
                << (time_with_threads = (llend-llbeg)) << " usecs." << endl;

        print_it(v, sz);
        cout << '\n' << endl;

        //Randomize v again
        randomize(v,sz);
        v[0] = v[sz-1] = 0.0;

        //Now do with just one thread; stopwatch
        //beg = clock();
        gettimeofday(&beg, 0);
        DO_FOR(i,0,count)      //This is run once after threads finish, for comparison.
        {
                DO_FOR(k,1,sz-1)
                        lap[k] = (v[k+1] + v[k-1] - 2.0*v[k]) / 2.0;
                DO_FOR(j,1,sz-1)
                        v[j] += dt*lap[j];
        }
        //end = clock();
        gettimeofday(&end, 0);
        llbeg = (beg.tv_sec * 1000000LL) + beg.tv_usec;
        llend = (end.tv_sec * 1000000LL) + end.tv_usec;
        long long time_without_threads;
        cout << "Below are the results for one thread, which took "
                << (time_without_threads = (llend-llbeg)) << " usecs." << endl;

        print_it(v, sz);
        cout << endl;

        delete [] lap;
        delete [] v;
        delete [] thread_intervals;

        cout << "\nDone. Press enter to return." << endl;
        cin.ignore();
        cin.get();
}

**CodeMonkey** · 12-16-2008

I had modified my code exactly as you had. Thank you for taking the time. But yes, now I get a decidedly smaller number for two threads. Thanks.

**CodeMonkey** · 12-16-2008

Actually, the gain from two threads is rather more dramatic than on Vista. I think maybe program performance is like the stock market -- a lot of people know a lot about it, but nobody really has a clue. Or maybe just I don't have a clue. Thanks again for all of your help (everyone).

**matsp** · 12-16-2008

Originally Posted by CodeMonkey

Actually, the gain from two threads is rather more dramatic than on Vista. I think maybe program performance is like the stock market -- a lot of people know a lot about it, but nobody really has a clue. Or maybe just I don't have a clue. Thanks again for all of your help (everyone).

I _THINK_ I know a whole lot more about performance than I _THINK_ I know about the stock market. But you are right, the concept of both is pretty simple - in performance, we essentially want as few cycles as possible, and as many of the cycles as possible should "do something useful" - knowing how to get that out of the code can be trickier than the principle says, of course. The stock-market principle is also simple: Buy when prices are low, and sell when prices are high. Knowing when prices are low and when they are high is the tricky part (because you either buy before they bottom out, or on the up, and sell before the peak or after the peak - very hard to hit the peak EXACTLY).

--
Mats

Thread: Code::Blocks, g++, Boost

Thread Tools

Search Thread

Display

Dump

Similar Threads

library issues in CodeBlocks

Codeblocks crashes linux?

Boost Auto-Linking

Boost Adobe gil: undefined reference for jpeg related feature

Integrating Boost with STLPort