Conway's game of life in parallel

**std10093** · 06-27-2013

Hello to all,

I am writing the code for the serial implementation for Conway's Game of Life.

After done with that, I must parallelize it with

MPI
OpenMP
Cuda

So my questions are:
1) Do I have to use a next stage array, or can I do it with just one?
2) I can select from having the array as periodical or not. What would be easier to code in a parallel environment (It's my first time for parallel programming).
By periodical, I mean, that position [0][0], will have as left neighbor, position [0][4]. If it is not periodical, then the bounds of the array are strict.

**Malcolm McLean** · 06-28-2013

Originally Posted by std10093

Hello to all,

I am writing the code for the serial implementation for Conway's Game of Life.

After done with that, I must parallelize it with

MPI
OpenMP
Cuda

So my questions are:
1) Do I have to use a next stage array, or can I do it with just one?
2) I can select from having the array as periodical or not. What would be easier to code in a parallel environment (It's my first time for parallel programming).
By periodical, I mean, that position [0][0], will have as left neighbor, position [0][4]. If it is not periodical, then the bounds of the array are strict.

I'd make it periodic, so that all the processors are essentially the same.
You need to store the 8 neighbouring processors, and on each step you need the adjacent line from four of them and (irritatingly) the neighbouring corner pixels from the other four. But you don't need a supervisor.
Watch out for chaining conditions, whereby if you have a line of processors going in the x direction, the leftmost one asks for a line, the the next leftmost waits for the request, fulfils it, then asks the next for a line, and so on, reducing your speed to a serial algorithm.

One way to solve it is to have a chessboard, where processors are black or white. White goes first and receives, and black sends. Then white sends and black receives. You then need the same system for the diagonals, but that's easy - odd lines ask, even lines receive, and so on.

**std10093** · 06-28-2013

Thanks for the answer.

**megafiddle** · 06-28-2013

I am also working on a version (win32) of Conway's game of life.

Just curious, what is a periodic array?
Is that a cell grid that wraps around at the left and right borders and the top and bottom borders?
Or something else?

**std10093** · 06-29-2013

From the way I understood it, if you get the position [0][0] of an array, then it has three neighbors , right, down and rightDown.
In a periodic array, it would have 8 neighbors, the three I mentioned above and as left neighbor, the position [0][M-1], leftDown the position [1][M-1], leftUp, the position [N-1][M-1], Up the position [N-1][0] and finally, upRight the position [N-1][1].

As I said I think that is the point. Every cell must have 8 neighbors (even the ones at the borders). I decided not to make the array periodic, but if you do, let me know

**megafiddle** · 06-29-2013

I believe my array (or grid) is periodic, then.

My leftmost column includes the rightmost column as it's neighbors. And the same for the rightmost column,
the top and bottom borders, and the four corners (which are a little more complicated).

This effectively makes the grid into a toroid. A glider exiting the right side, for example, re-enters on the left.

I evaluate the cells sequentially, not in parallel. But for maximum speed, I handle each corner, each border,
and the interior cells separately. so there is no testing for where I am in the grid.

**megafiddle** · 06-30-2013

I should add that I am interpreting a periodic array as being cyclical or repeating. It is one way of dealing with the borders.
The alternative is a bounded array with the neighbors just outside the array missing. These cells are usually implied as being
always dead, ie, they are simply not counted. There are basically two ways to do this. You can simply use a different process
for each of the four borders and corners, or you can evaluate only the inner cells, leaving a one cell border around the grid
that always contain dead cells. Then you can use the same process for each cell. For a 100 by 100 array, [0][0] to [99][99],
only [1][1] to [98][98] would be evaluated.
But you may already know this.

Also, I think that is what was meant here:

Originally Posted by Malcolm McLean

I'd make it periodic, so that all the processors are essentially the same.

My approach is different than any I have seen so far. I first used it many years ago in a DOS program. One of the common and
most straightforward methods is to examine each cell to determine it's state and count up the number of neighbors. This information
is then used to determine the next state of that cell. An initial cell population can be arbitratrily dense. But after a few generations,
the population becomes relatively sparse. Because so many cells are dead, there are various methods for recording regions of dead
cells and skipping over them.

What I have done instead, is to treat each cell as not having neighbors, but as being a neighbor. In other words, in a basic, typical
implimentation, each cell asks, "am I alive, and how many neighbors do I have?". In my implimentation, each cell asks, "am I alive,
and if so, which cells am I a neighbor to?". The advantage here is that if a cell is dead, the code does nothing; it moves on to the
next cell. In the basic, more common method, all eight neighbors for every cell must be examined and a conditional increment is
made to the neighbor count. The new state of the cell is then determined and stored in a second destination array.
I don't do anything unless a cell is alive. In that case, I increment the neighbor count in each of it's eight neighbors. Each cell is
represented by a char type, bit 4 stores the state of the cell, and bits 0-3 store the neighbor count. This 5 bit value is used as an
index into my "fate" table, which gives me the new state of the cell.
I then make a second pass through all the cells, to replace the current state/count value with the new state/(zero count) value.

The first time I ever saw saw Life was on a TRS80 model one (the Radio Shack 8085 computer). It was interesting as an example
of cellular automata, but the generation rate was very slow. When faster versions appeared on newer computers, the animation
effect gave Life an additional point of interest.
My first implimentation was on a Commodore64, written in assembly. I got a blazing 9 generations per second for a 25 x 40 cell
grid. I hadn't yet thought of my current algorithm, else it would have been pretty noteworthy back then.
Currently, with a Pentium 4, I am getting about 45 generations per second for a 1000 x 1000 grid (1,000,000 cells). That is with
a randomly generated seed population of about 8000 cells. The seed is 200 x 200 cells with a 0.2 density.
My goal is to keep it above 30 generations per second so I can limit it down to the video standard of 30 frames per second.

I would be interested in how fast your parallel implimentation runs.

I would also be interested if anyone knows whether anything similar to mine has ever been done.

**std10093** · 06-30-2013

Originally Posted by megafiddle

I would be interested in how fast your parallel implimentation runs.

I suggest you send me a pm and I will leave it pending, so that I will not forget.

**Nominal Animal** · 06-30-2013

Perhaps a bitmap would give better performance?

There are a few tricks involved. You use the widest word you can. The least significant and most significant bit are copied from the neighbouring words, so you actually get two fewer cells per word than the word has bits. If the words are horizontal, then the state is updated in word-wide vertical runs; these obviously are well suited for parallelization.

The actual work needed to compute a new state for the center word using three words, is based on splitting the bits into groups of four, so it is easy to sum the number of alive neighbours. If you then OR the state of the actual cell with the least significant bit in each group of four, the bit pattern will be 0011 if and only if the new state is alive; for every other value the new state is dead. Here's why:

Code:

      Alive       Bit pattern, cell
    neighbours     Dead     Alive
     0   0000      0000     0001
     1   0001      0001     0001
     2   0010      0010     0011 <
     3   0011      0011 <   0011 <
     4   0100      0100     0101
     5   0101      0101     0101
     6   0110      0110     0111
     7   0111      0111     0111
     8   1000      1000     1001
     (if you count more than 8 neighbours, you have a bug.)

As you can see, if the cell was dead, but has three neighbours alive, it will become alive. If the cell is alive, it will only stay alive if it has two or three neighbours.

You don't need any tests, since you can XOR the two highest bits (so 1111 is the only pattern where the cell will be/stay alive), and do some bit shifts and ANDs to get a 1 bit for alive, and 0 for dead.

Here is an example you can play with:

Code:

#include <stdint.h>

#define MASK_0001 ((uint64_t)0x1111111111111111ULL)
#define MASK_0010 ((uint64_t)0x2222222222222222ULL)
#define MASK_0011 ((uint64_t)0x3333333333333333ULL)
#define MASK_0100 ((uint64_t)0x4444444444444444ULL)
#define MASK_0101 ((uint64_t)0x5555555555555555ULL)
#define MASK_1000 ((uint64_t)0x8888888888888888ULL)
#define MASK_1100 ((uint64_t)0xCCCCCCCCCCCCCCCCULL)

/* Each 64-bit word contains 62 cells.
 * The highest bit, and the lowest bit, is the next lowest
 * and next highest bit from the neighbour.
 *
 * Summary of the logic:
 *
 * In each set of four bits, calculate the number of alive neighbours.
 * This is a value between 0 and 8, inclusive.
 * OR the least significant bit with the state of the cell itself.
 *
 * The value is now one of
 *      Cell dead    Cell alive
 *         0000        0001         0 no alive neighbours
 *         0001        0001         1 neighbour alive
 *         0010        0011 !       2 neighbours alive
 *         0011 !      0011 !       3 neighbours alive
 *         0100        0101         4 neighbours alive
 *         0101        0101         5 neighbours alive
 *         0110        0111         6 neighbours alive
 *         0111        0111         7 neighbours alive
 *         1000        1001         8 neighbours alive
 * Therefore, the next state for the cell
 * is alive, if the bit pattern is 0011,
 * and dead otherwise.
*/

uint64_t step(const uint64_t prev, const uint64_t curr, const uint64_t next)
{
    /* Sums of cells alive in prev and next */
    const uint64_t  two1 = ( prev       & MASK_0001) + ( next       & MASK_0001),
                    two2 = ((prev >> 1) & MASK_0001) + ((next >> 1) & MASK_0001),
                    two3 = ((prev >> 2) & MASK_0001) + ((next >> 2) & MASK_0001),
                    two4 = ((prev >> 3) & MASK_0001) + ((next >> 3) & MASK_0001);

    /* Sums of cells alive in prev, curr, and next */
    const uint64_t  three1 = two1 + ( curr       & MASK_0001),
                    three2 = two2 + ((curr >> 1) & MASK_0001),
                    three3 = two3 + ((curr >> 2) & MASK_0001),
                    three4 = two4 + ((curr >> 3) & MASK_0001);

    /* Four bit patterns */
    const uint64_t  q1 = ( (three1 + two2 + three3)
                         | ( curr       & MASK_0001) ) ^ MASK_1100,
                    q2 = ( (three2 + two3 + three4)
                         | ((curr >> 1) & MASK_0001) ) ^ MASK_1100,
                    q3 = ( (three3 + two4 + (three1 >> 4))
                         | ((curr >> 2) & MASK_0001) ) ^ MASK_1100,
                    q4 = ( (three4 + (two1 >> 4) + (three2 >> 4))
                         | ((curr >> 3) & MASK_0001) ) ^ MASK_1100;

    /* Two bit patterns */
    const uint64_t  p1 = (q1 >> 2) & q1 & MASK_0011,
                    p2 = (q2 >> 2) & q2 & MASK_0011,
                    p3 = (q3 << 2) & q3 & MASK_1100,
                    p4 = (q4 << 2) & q4 & MASK_1100;

    /* Result. */
    return ( (p1 & (p1 >> 1) & MASK_0001)
           | (p2 & (p2 << 1) & MASK_0010)
           | (p3 & (p3 >> 1) & MASK_0100)
           | (p4 & (p4 << 1) & MASK_1000)
           ) & (uint64_t)0x7FFFFFFFFFFFFFFEULL;
}

That last mask clears the highest and lowest bits, making it easy to copy the bits from adjacent words. two means prev+next, three means prev+curr+next; I couldn't think of better names just now that wouldn't be too long.

To update the boundary bits for left and right, assuming most significant bit is to the right (!), use

Code:

    right |= (left >> 62) & 1;
    left  |= (right & 2) << 62;

If the word itself is periodic, you can use e.g.

Code:

    uint64_t w = step(prev, curr, next);
    w |= ((w >> 62) & 1) | ((w & 2) << 62);

Remember, the SECOND highest or lowest bit is copied to the highest or lowest bit position.

If you want, you can obviously implement this as 32-bit (30 cells per 32-bit word) instead: just change the types to e.g. uint32_t, and drop the excess digits from the constants.

There is plenty of room for improvement, too. The order shown in the function is probably not the most efficient; I used this for the hopes it would be readable, and that it'd help the compiler to pick a good order for the operations. On my machine, using a map of 64 words (rows) with 62 cells in each 64-bit word, it reaches about 0.7 cycles per cell (2800 cycles per full map update).

Perhaps a squeezed chess board, say eight words per vertical run, might be a good optimization? If the words in a run OR together to a zero, then that run does not need to be updated, since all cells in it are dead.

**loserone+_+** · 07-01-2013

looks confusing

**std10093** · 07-01-2013

I got the pm.
Nomimal, that's clever. However, I have to code into MPI, openMP and cuda until sunday(included) and write a big report also. Moreover, one person I would like to have a team with, could not accept my offer (despite the fact that he wanted too), thus I need to do it all by myself. You know I appreciate your answer.
PS - Still waiting for the results on Architecture II

loserone, at first, it is very usual for something to look confusing.

**Nominal Animal** · 07-01-2013

Originally Posted by loserone+_+

looks confusing

Can't be helped much, I'm afraid. It's quite a complicated procedure done in very few lines of code.

The layout of the expressions is a bit weird, now that I look at it, and that may also make it harder to read the code. You can copy the code to your favourite IDE or editor, and see if editing the newlines and whitespace makes it easier to read.

Originally Posted by std10093

Nomimal, that's clever. However, I have to code into MPI, openMP and cuda until sunday(included) and write a big report also.

Ouch; I definitely recommend not going this route in that case.

Any implementation errors are *very* hard to track down, in this bit-packed method. I use a slow, robust, easily corrected cell-by-cell evaluating version for comparison, myself. So, this approach really doubles your work load, only for some performance gain.

The reason I know about this method is because it is also very well suited for SSE2, using 128-bit words. AVX could do 256-bit words, but I don't have a processor that supports AVX extensions.

When written for SSE2, consuming consecutive (vertical) word runs, it runs extremely fast. But boy, is the code difficult to debug! I use GCC SSE2 intrinsics; probably should port it to "mmintrin.h" intrinsics to make it more portable across compilers.

**megafiddle** · 07-23-2013

Interesting algorithm.
I imagine that is well suited to monochrome bitmap displays where the words translate directly into display?

I was thinking of adapting to my own evaluation algorithm, which doesn't count up live neighbors;
it increments a neighbor count in the neighboring cells if the cell being evaluted is alive. It does
require though that all the neighbors of a cell have been evaluated before the new state of the
cell can be determined. Also, intermediate (accumulated) neighbor counts have to be stored
until at least the next line of cells have been evaluated. Line N counts are not complete until
line N+1 has been evaluated.

**Nominal Animal** · 07-24-2013

Originally Posted by megafiddle

Interesting algorithm.
I imagine that is well suited to monochrome bitmap displays where the words translate directly into display?

Well, no. It has two major benefits: near-minimal RAM usage, and evaluating many cells at a time. (The RAM overhead is just two bits per word; my example code uses 64 bits per word, but the same logic works with 128-bit words using SSE2, and 256-bit words using AVX.)

For large worlds, large enough to not fit in the processor caches, the RAM bandwidth becomes the bottleneck. Storing the world in as little RAM as possible makes it as fast as is possible. So, it's all about speed and efficiency.

To display it, I use a function (one for each output array type) that decodes all cells in a word to the display array. If the display array only displays a part of the world, then I reserve one extra words worth of cells in it horizontally, so that the function can always decode full words. (That is, as opposed to having the function do partial decoding, I only display part of the display buffer.)

Originally Posted by megafiddle

I was thinking of adapting to my own evaluation algorithm, which doesn't count up live neighbors; it increments a neighbor count in the neighboring cells if the cell being evaluted is alive.

Unfortunately, that requires at least four bits per cell, which means that it will be slower for very large worlds (because the evaluation of the largest worlds is RAM bandwidth limited).

It sounds quite interesting for smaller worlds, though -- those that fit in (some level of) the CPU cache. (I bet it also allows you to do some nice visualizations very easily, too; like using different colors for the cause the cell died, stayed alive, or was born.)

**megafiddle** · 07-25-2013

Possibly I can still take advantage of evaluating 30 cells at a time (32 is my word size).

I would have to start by partially completing my first 'prev' word, by using it's prev word.
But I have to handle that separately anyway, as it's on the top boundary.

Then the primary function step() could use 'curr' to complete 'prev', and start on 'next'.
step() would return 'prev' each time.

I would also need to store the intermediate count values; 4 words each for 'prev_acc',
'curr_acc', and 'next_acc'. But the counting could be done somewhat like in yours, with
shifting and masks.

Plus I still have the speed advantage of completely skipping empty (all dead) words. I
would just need to shift prev_acc <-- curr_acc <-- next_acc <-- 0, for those.

I think.

Thread: Conway's game of life in parallel

Thread Tools

Search Thread

Display

Conway's game of life in parallel

Similar Threads

Conway's Game of Life using pointers!

Conway's Game of Life

Text Based Conway's game of Life in C

John Conway's Game Of Life

Conway's Life