Hello there,
For a while now I've been working on a program that, at its core, uses the method of relaxation to find the transverse electric field in a cross section of some metal cylindrical cavity with arbitrary boundary conditions (shape). That's quite a mouthful.

The algorithm, as you may know, involves calculating the laplacian of a large 2D array and subtracting it, along with a scale factor, from each point. This must be done in two sweeps, since the new value of each point depends on its four adjacent points. A running calculation gives the wrong result. And, of course, you do this millions of times.

My code is terribly slow, but I don't see how I could expect much better then O(n) for this algorithm. So I'm wondering how this algorithm can be split, as to take advantage of multi-core processors. Any ideas? Here's some code:
Please, don't mind the code dump. It's not important.
Code:
```//The high-level look at the relaxation
//g is of type 'grid_t' -- essentially a 2D array
//reg is of type 'region', which specifies a subset of a grid
std::vector<double> lap(grid_t::num_points(reg));  //laplacian field (as a 1D array)
for(unsigned int t = 0; t < time_steps; ++t)
{
g.iterate_region(reg,functions::calc_lap(lap));  //feeds laplacian into lap
g.iterate_region(reg,functions::relax(lap,dt));	 //adds dt*lap[] to each point
}```
Code:
```//The grid object type
class grid_t
{
public:
grid_t(void) : w(0), h(0) {}
grid_t(int w_in, int h_in, double fill = 0.0);
~grid_t(void);

struct span {
span(unsigned int x_in,unsigned int y_in,int dy_in) : x(x_in), y(y_in), dy(dy_in) {}
unsigned int x, y;
int dy;
};
typedef std::vector<span> region;

double & operator()(int x, int y);
bool iterate_grid(func2d & f);
bool iterate_region(region & reg, func2d & f);
region find_regions(double boundary = 0.0);
bool write_file(const std::string & path);

unsigned int const width() { return w; }
unsigned int const height(){ return h; }
unsigned long int const num_points() { return w*h; } //number of points in grid
static unsigned long int num_points(const region & r); //number of POINTS in r (not members, which are spans).
private:
int w, h;
std::vector<std::vector<double> > g;
};```
Code:
```//The virtual function object
//Surprisingly, the virtual calls are optimized out (they don't depend on input)
class func2d
{
public:
func2d() {}
~func2d() {}
virtual double operator()(int x, int y, grid_t& g) = 0;
};

namespace functions
{

class relax : public func2d
{
public:
relax(std::vector<double> & lap_ref, double dt_in) : lap(lap_ref), dt(dt_in), count(0), func2d() {}
~relax() {}
double operator()(int x, int y, grid_t & g)
{
g(x,y) = g(x,y) + dt*lap[count++];
return g(x,y);
}
private:
std::vector<double> & lap;
const double dt;
unsigned long int count;
};

class calc_lap : public func2d
{
public:
calc_lap(std::vector<double> & lap_ref) : lap(lap_ref), count(0), func2d() {}
~calc_lap() {}
double operator()(int x, int y, grid_t & g)
{
lap[count++] = ( g(x-1,y) + g(x+1,y) + g(x,y-1) + g(x,y+1) - 4*g(x,y) );
return lap[count-1];
}
private:
std::vector<double> & lap;
unsigned long int count;
};

//......... More in namespace........```
Code:
```bool grid_t::iterate_region(region & reg, func2d & f)
{
for(unsigned int i = 0; i < reg.size(); ++i)	//over all spans
{
for(int j = 0; j < reg[i].dy; ++j)	//for each span
{
f(reg[i].x, reg[i].y + j, *this);
}
}
return true;
}```

2. So you can split the loops. A classic approach. Well, without looking in great detail on your loop, I guess reg.size() is constant in the algorithm? Meaning it doesn't change (from calling f() for example). So you could split the outter loop. Like

Code:
`for(unsigned int i = start; i < finish; ++i)`
where each thread will calculate its part.

I noticed that in fun2d you have count++. So you need to change f() a bit. It has to take count as a parameter. So thread 2 for example will not start from count = 0 but from a bigger number.

You get the general idea. The good thing is that you don't need to need any synchronization.

edit: for the calc_lap you will need synchronization. I simple barrier between the two sweeps will do. So first sweep you calculate the even points, splitting the outter loop. Every thread synchronizes on a barrier. Then the second sweep and you are done.
edit2: Post the results if you don't mind. I am interesting in knowing the speedup you gain

3. Thank you, but it is not yet clear to me. You say to split the outer loop -- but every iteration of the loop depends on the result of everything prior. It would seem to me that I'd be forking the loops back and forth between (two) processors. How would I gain from this?

Edit: In two dimensions, the laplacian at any point is effectively a function of every other point, since the point's surroundings depend on their respective surrounding, etc. How can it be split?

4. Originally Posted by CodeMonkey
Thank you, but it is not yet clear to me. You say to split the outer loop -- but every iteration of the loop depends on the result of everything prior. It would seem to me that I'd be forking the loops back and forth between (two) processors. How would I gain from this?

Edit: In two dimensions, the laplacian at any point is effectively a function of every other point, since the point's surroundings depend on their respective surrounding, etc. How can it be split?
But it depends on their previous value that was there, does it not? You could run every single calculation in parallel., since none of the calculations require the new value. It's one sweep to calculate the new values, and then one sweep to replace the old values with the new ones, right?

5. Right -- so there is nothing to gain by having one core calculate new values, and then the other apply them.

You're suggesting that I split both sweeps in half, with a prescribed barrier to separate the two? This makes sense to me. So by specifying a sufficiently wide barrier, I can separate the influence of each sector from the other. But then I'd have to deal with how the barrier gets applied afterwards.

Perhaps my understanding of the basic principles behind multithreading is deeply lacking.

Is this the recipe you are giving me?
Code:
```for()
{
Processor 1            parallel         Processor 2
Calc. lap. for half                     Calc. lap. for other half
Apply to half                           Apply to other half
Do barrier                              wait for sync
}```
Edit: I don't see how that could work either.

Edit: Now I've got it. Off to boost::Thread!

6. Preliminary stuff indicates that in the limit of large data sets, things will go twice as fast with two threads. If you have two processors, C_ntua, see what this yields:
Code:
```#include <boost/thread/thread.hpp>
#include <iostream>
#include <ctime>

//int count = 0;
//boost::mutex mutex;
double line[100];

{
public:
task(const std::pair<int,int> & range_in, int times_in) : r(range_in), t(times_in) {}
void operator()()	//will range over [r.first,r.second)
{
for(int i = 0; i < t; ++i)
{
for(int x = r.first; x < r.second; ++x)
line[x] += 0.0001*(line[x-1] + line[x] - 2.0*line[x]);
}
}
const std::pair<int,int> & range() { return r; }
const int times() const { return t; }
private:
std::pair<int,int> r;
int t;
};

int main(int argc, char* argv[])
{
enum { TIMES = 10000000 };
for(int i = 0; i < 100; ++i) line[i] = std::rand();
line[0] = 0;
line[50] = 0;
line[99] = 0;

std::clock_t beg = std::clock();
std::clock_t end = std::clock();
std::cout << "That took " << (end-beg) << " ticks. Result:" << std::endl;
for(int k = 0; k < 100; ++k) std::cout << line[k] << ' ';

std::cout << std::endl;

for(int i = 0; i < 100; ++i) line[i] = std::rand();
line[0] = 0;
line[50] = 0;
line[99] = 0;
beg = std::clock();
for(int p = 0; p < TIMES; ++p)
{
for(int q = 1; q < 50; ++q) line[q] += 0.0001*(line[q-1] + line[q] - 2.0*line[q]);
for(int s = 51; s < 99; ++s) line[s] += 0.0001*(line[s-1] + line[s] - 2.0*line[s]);
}
end = std::clock();
std::cout << "That took " << (end-beg) << " ticks. Result:" << std::endl;
for(int k = 0; k < 100; ++k) std::cout << line[k] << ' ';

//std::cin.get();
}```
Thanks for the help.

EDIT: Of course, this is making the "running calculation" error I referred to first. But it still demonstrates that the overhead of boost::thread is insignificant for a big enough task.

7. >> Of course, this is making the "running calculation" error I referred to first. But it still demonstrates ...
You mean "isn't" I assume. Since the calculation at [51] is using a value of 0 at [50] - as apposed to iterating a single [1, 99] range.

Putting single statement for loops all on one line is not as readable as using 2 lines.

Your range calculations for splitting up work don't really work for anything other than 2 threads. Something like this should split things correctly for a given number of threads and array size:
Code:
```#include <iostream>
using namespace std;

int main()
{
cout << "Original calc:" << endl;
for (int j = 0; j < 2; ++j)
cout << 50*j+1 << ", " << 100/(2-j)-1 << endl;
cout << endl;

const int array_size = 100;
const int num_t = 2;
const int work_split = array_size / num_t;

cout << "New calc: array_size=" << array_size
<< ", num_threads=" << num_t << endl;

int work_begin,
work_end = -1;
for (int j = 0; j < num_t; ++j)
{
work_begin = work_end + 2;

// last thread always ends on array_size-1, in case num_t isn't evenly
// divisible into array_size
if (j == (num_t - 1))
work_end = array_size - 1;
else
work_end = work_begin + work_split - 2;

cout << work_begin << ", " << work_end << endl;
}//for

return 0;
}//main```
Each thread would also need to do "line[r.first-1] = 0". Or do it in the thread creation loop.

gg

8. I see. Thanks. I'm just excited that I've discovered this new toy known as multithreading!

And it is making the mistake of adding the Laplacian to each member in sequence, rather than calculating the Laplacian everywhere and then applying it. I was referring to a mistake of concept, not of code.

9. >> Of course, this is ...

gg

10. Do you know much about the kernel you are running? You seem to use the terms multi-process and multi-thread interchangeably. They are in fact different, and some kernels do not spread threads across cores, but will assign all the threads in a process to one core.

11. Yes, I had wondered about this as well. I didn't look much into it, but the geniuses at the Tech Board seem to think that any major modern system would run the threads on different cores, if nothing else was going on. I've seen this is true for my little tests on Ubuntu 8.04 and Vista 32-bit.

But I see your point, Angus; I wouldn't want to make the mistake of assuming the behavior of the kernel.

12. Ubuntu? Which Ubuntu? The free one? I had the idea that the stable free Linux kernel did one process-one processor, while certain Enterprise versions of the kernel (namely Suse) could put threads on different processors. And that such a feature in the free kernel was only experimental.

13. Of course linux and windows native threads can be scheduled on different cores, if you don't explicitly restrict them to a subset of cores (which is not useful in most cases).
if you set the thread priority higher than normal and the load is significant then the probability is high that your threads are scheduled on different cores most of the time.

Thread systems which only use a single core are called green threads and are often found in VMs.

14. Have tested lots of multicore programs on Ubuntu and they threads are splitted for sure