Thread: Splitting the task

  1. #1
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937

    Splitting the task

    Hello there,
    For a while now I've been working on a program that, at its core, uses the method of relaxation to find the transverse electric field in a cross section of some metal cylindrical cavity with arbitrary boundary conditions (shape). That's quite a mouthful.

    The algorithm, as you may know, involves calculating the laplacian of a large 2D array and subtracting it, along with a scale factor, from each point. This must be done in two sweeps, since the new value of each point depends on its four adjacent points. A running calculation gives the wrong result. And, of course, you do this millions of times.

    My code is terribly slow, but I don't see how I could expect much better then O(n) for this algorithm. So I'm wondering how this algorithm can be split, as to take advantage of multi-core processors. Any ideas? Here's some code:
    Please, don't mind the code dump. It's not important.
    Code:
    //The high-level look at the relaxation
    //g is of type 'grid_t' -- essentially a 2D array
    //reg is of type 'region', which specifies a subset of a grid
    std::vector<double> lap(grid_t::num_points(reg));  //laplacian field (as a 1D array)
    for(unsigned int t = 0; t < time_steps; ++t)
    {
    	g.iterate_region(reg,functions::calc_lap(lap));  //feeds laplacian into lap
    	g.iterate_region(reg,functions::relax(lap,dt));	 //adds dt*lap[] to each point
    }
    Code:
    //The grid object type
    class grid_t
    {
    public:
    	grid_t(void) : w(0), h(0) {}
    	grid_t(int w_in, int h_in, double fill = 0.0);
    	~grid_t(void);
    
    	struct span {
    		span(unsigned int x_in,unsigned int y_in,int dy_in) : x(x_in), y(y_in), dy(dy_in) {}
    		unsigned int x, y;
    		int dy;
    		};
    	typedef std::vector<span> region;
    
    	double & operator()(int x, int y);
    	bool iterate_grid(func2d & f);
    	bool iterate_region(region & reg, func2d & f);
    	region find_regions(double boundary = 0.0);
    	bool write_file(const std::string & path);
    
    	unsigned int const width() { return w; }
    	unsigned int const height(){ return h; }
    	unsigned long int const num_points() { return w*h; } //number of points in grid
    	static unsigned long int num_points(const region & r); //number of POINTS in r (not members, which are spans).
    private:
    	int w, h;
    	std::vector<std::vector<double> > g;
    };
    Code:
    //The virtual function object
    //Surprisingly, the virtual calls are optimized out (they don't depend on input)
    class func2d
    {
    public:
    	func2d() {}
    	~func2d() {}
    	virtual double operator()(int x, int y, grid_t& g) = 0;
    };
    
    namespace functions
    {
    
    class relax : public func2d
    {
    public:
    	relax(std::vector<double> & lap_ref, double dt_in) : lap(lap_ref), dt(dt_in), count(0), func2d() {}
    	~relax() {}
    	double operator()(int x, int y, grid_t & g)
    	{
    		g(x,y) = g(x,y) + dt*lap[count++];
    		return g(x,y);
    	}
    private:
    	std::vector<double> & lap;
    	const double dt;
    	unsigned long int count;
    };
    
    class calc_lap : public func2d
    {
    public:
    	calc_lap(std::vector<double> & lap_ref) : lap(lap_ref), count(0), func2d() {}
    	~calc_lap() {}
    	double operator()(int x, int y, grid_t & g)
    	{
    		lap[count++] = ( g(x-1,y) + g(x+1,y) + g(x,y-1) + g(x,y+1) - 4*g(x,y) );
    		return lap[count-1];
    	}
    private:
    	std::vector<double> & lap;
    	unsigned long int count;
    };
    
    //......... More in namespace........
    Code:
    bool grid_t::iterate_region(region & reg, func2d & f)
    {
    	for(unsigned int i = 0; i < reg.size(); ++i)	//over all spans
    	{
    		for(int j = 0; j < reg[i].dy; ++j)	//for each span
    		{
    			f(reg[i].x, reg[i].y + j, *this);
    		}
    	}
    	return true;
    }
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  2. #2
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    So you can split the loops. A classic approach. Well, without looking in great detail on your loop, I guess reg.size() is constant in the algorithm? Meaning it doesn't change (from calling f() for example). So you could split the outter loop. Like

    Code:
    for(unsigned int i = start; i < finish; ++i)
    where each thread will calculate its part.

    I noticed that in fun2d you have count++. So you need to change f() a bit. It has to take count as a parameter. So thread 2 for example will not start from count = 0 but from a bigger number.

    You get the general idea. The good thing is that you don't need to need any synchronization.

    edit: for the calc_lap you will need synchronization. I simple barrier between the two sweeps will do. So first sweep you calculate the even points, splitting the outter loop. Every thread synchronizes on a barrier. Then the second sweep and you are done.
    edit2: Post the results if you don't mind. I am interesting in knowing the speedup you gain
    Last edited by C_ntua; 12-07-2008 at 05:17 PM.

  3. #3
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Thank you, but it is not yet clear to me. You say to split the outer loop -- but every iteration of the loop depends on the result of everything prior. It would seem to me that I'd be forking the loops back and forth between (two) processors. How would I gain from this?

    Edit: In two dimensions, the laplacian at any point is effectively a function of every other point, since the point's surroundings depend on their respective surrounding, etc. How can it be split?
    Last edited by CodeMonkey; 12-07-2008 at 06:15 PM. Reason: Addition
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  4. #4
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by CodeMonkey View Post
    Thank you, but it is not yet clear to me. You say to split the outer loop -- but every iteration of the loop depends on the result of everything prior. It would seem to me that I'd be forking the loops back and forth between (two) processors. How would I gain from this?

    Edit: In two dimensions, the laplacian at any point is effectively a function of every other point, since the point's surroundings depend on their respective surrounding, etc. How can it be split?
    But it depends on their previous value that was there, does it not? You could run every single calculation in parallel., since none of the calculations require the new value. It's one sweep to calculate the new values, and then one sweep to replace the old values with the new ones, right?

  5. #5
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Right -- so there is nothing to gain by having one core calculate new values, and then the other apply them.

    You're suggesting that I split both sweeps in half, with a prescribed barrier to separate the two? This makes sense to me. So by specifying a sufficiently wide barrier, I can separate the influence of each sector from the other. But then I'd have to deal with how the barrier gets applied afterwards.

    Perhaps my understanding of the basic principles behind multithreading is deeply lacking.

    Is this the recipe you are giving me?
    Code:
    for()
    {
    Processor 1            parallel         Processor 2
    Calc. lap. for half                     Calc. lap. for other half
    Apply to half                           Apply to other half
    Do barrier                              wait for sync
    }
    Edit: I don't see how that could work either.

    Edit: Now I've got it. Off to boost::Thread!
    Last edited by CodeMonkey; 12-08-2008 at 03:01 PM.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  6. #6
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Preliminary stuff indicates that in the limit of large data sets, things will go twice as fast with two threads. If you have two processors, C_ntua, see what this yields:
    Code:
    #include <boost/thread/thread.hpp>
    #include <iostream>
    #include <ctime>
    
    //int count = 0;
    //boost::mutex mutex;
    double line[100];
    
    class task
    {
    public:
    	task(const std::pair<int,int> & range_in, int times_in) : r(range_in), t(times_in) {}
    	void operator()()	//will range over [r.first,r.second)
    	{
    		for(int i = 0; i < t; ++i)
    		{
    			for(int x = r.first; x < r.second; ++x) 
    				line[x] += 0.0001*(line[x-1] + line[x] - 2.0*line[x]);
    		}
    	}
    	const std::pair<int,int> & range() { return r; }
    	const int times() const { return t; }
    private:
    	std::pair<int,int> r;
    	int t;
    };
    
    int main(int argc, char* argv[])
    {
    	enum { TIMES = 10000000 };
    	for(int i = 0; i < 100; ++i) line[i] = std::rand();
    	line[0] = 0;
    	line[50] = 0;
    	line[99] = 0;
    	
    	boost::thread_group threads;
    	std::clock_t beg = std::clock();
    	for (int j = 0; j < 2; ++j) threads.create_thread( task(std::make_pair(50*j+1,100/(2-j)-1),TIMES) );
            threads.join_all();
    	std::clock_t end = std::clock();
    	std::cout << "That took " << (end-beg) << " ticks. Result:" << std::endl;
    	for(int k = 0; k < 100; ++k) std::cout << line[k] << ' ';
    
    	std::cout << std::endl;
    
    	for(int i = 0; i < 100; ++i) line[i] = std::rand();
    	line[0] = 0;
    	line[50] = 0;
    	line[99] = 0;
    	beg = std::clock();
    	for(int p = 0; p < TIMES; ++p)
    	{
    		for(int q = 1; q < 50; ++q) line[q] += 0.0001*(line[q-1] + line[q] - 2.0*line[q]);
    		for(int s = 51; s < 99; ++s) line[s] += 0.0001*(line[s-1] + line[s] - 2.0*line[s]);
    	}
    	end = std::clock();
    	std::cout << "That took " << (end-beg) << " ticks. Result:" << std::endl;
    	for(int k = 0; k < 100; ++k) std::cout << line[k] << ' ';
    
    	//std::cin.get();
    }
    Thanks for the help.

    EDIT: Of course, this is making the "running calculation" error I referred to first. But it still demonstrates that the overhead of boost::thread is insignificant for a big enough task.
    Last edited by CodeMonkey; 12-09-2008 at 12:11 PM.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  7. #7
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> Of course, this is making the "running calculation" error I referred to first. But it still demonstrates ...
    You mean "isn't" I assume. Since the calculation at [51] is using a value of 0 at [50] - as apposed to iterating a single [1, 99] range.

    Comments:
    Putting single statement for loops all on one line is not as readable as using 2 lines.

    Your range calculations for splitting up work don't really work for anything other than 2 threads. Something like this should split things correctly for a given number of threads and array size:
    Code:
    #include <iostream>
    using namespace std;
    
    int main()
    {
        cout << "Original calc:" << endl;
        for (int j = 0; j < 2; ++j)
            cout << 50*j+1 << ", " << 100/(2-j)-1 << endl;
        cout << endl;
    
    
        const int array_size = 100;
        const int num_t = 2;
        const int work_split = array_size / num_t;
    
        cout << "New calc: array_size=" << array_size 
             << ", num_threads=" << num_t << endl;
    
        int work_begin, 
            work_end = -1;
        for (int j = 0; j < num_t; ++j) 
        {
            work_begin = work_end + 2;
            
            // last thread always ends on array_size-1, in case num_t isn't evenly
            // divisible into array_size
            if (j == (num_t - 1))
                work_end = array_size - 1;
            else
                work_end = work_begin + work_split - 2;
            
            cout << work_begin << ", " << work_end << endl;
        }//for
        
        return 0;
    }//main
    Each thread would also need to do "line[r.first-1] = 0". Or do it in the thread creation loop.

    gg

  8. #8
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    I see. Thanks. I'm just excited that I've discovered this new toy known as multithreading!

    And it is making the mistake of adding the Laplacian to each member in sequence, rather than calculating the Laplacian everywhere and then applying it. I was referring to a mistake of concept, not of code.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> Of course, this is ...
    Ah yes - I misread

    gg

  10. #10
    Kung Fu Kitty Angus's Avatar
    Join Date
    Oct 2008
    Location
    Montreal, Canada
    Posts
    115
    Do you know much about the kernel you are running? You seem to use the terms multi-process and multi-thread interchangeably. They are in fact different, and some kernels do not spread threads across cores, but will assign all the threads in a process to one core.

  11. #11
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Yes, I had wondered about this as well. I didn't look much into it, but the geniuses at the Tech Board seem to think that any major modern system would run the threads on different cores, if nothing else was going on. I've seen this is true for my little tests on Ubuntu 8.04 and Vista 32-bit.

    But I see your point, Angus; I wouldn't want to make the mistake of assuming the behavior of the kernel.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  12. #12
    Kung Fu Kitty Angus's Avatar
    Join Date
    Oct 2008
    Location
    Montreal, Canada
    Posts
    115
    Ubuntu? Which Ubuntu? The free one? I had the idea that the stable free Linux kernel did one process-one processor, while certain Enterprise versions of the kernel (namely Suse) could put threads on different processors. And that such a feature in the free kernel was only experimental.

  13. #13
    Registered User
    Join Date
    Nov 2006
    Posts
    519
    Of course linux and windows native threads can be scheduled on different cores, if you don't explicitly restrict them to a subset of cores (which is not useful in most cases).
    if you set the thread priority higher than normal and the load is significant then the probability is high that your threads are scheduled on different cores most of the time.

    Thread systems which only use a single core are called green threads and are often found in VMs.

  14. #14
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Have tested lots of multicore programs on Ubuntu and they threads are splitted for sure

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Single task queue
    By jordanguyoflove in forum C Programming
    Replies: 2
    Last Post: 03-29-2009, 01:05 AM
  2. Replies: 2
    Last Post: 12-31-2007, 11:40 AM
  3. Where do a task get "wakeuped", blocked, preempted?
    By micke_b in forum Linux Programming
    Replies: 4
    Last Post: 12-28-2007, 04:49 PM
  4. Task Manager technic
    By NoFearXD in forum Windows Programming
    Replies: 10
    Last Post: 05-26-2007, 10:09 AM
  5. A Task Buffer for storing socket descriptors
    By cloudy in forum Networking/Device Communication
    Replies: 0
    Last Post: 09-09-2006, 01:08 PM