Perhaps a bitmap would give better performance?

There are a few tricks involved. You use the widest word you can. The least significant and most significant bit are copied from the neighbouring words, so you actually get two fewer cells per word than the word has bits. If the words are horizontal, then the state is updated in word-wide vertical runs; these obviously are well suited for parallelization.

The actual work needed to compute a new state for the center word using three words, is based on splitting the bits into groups of four, so it is easy to sum the number of alive neighbours. If you then OR the state of the actual cell with the least significant bit in each group of four, the bit pattern will be 0011 if and only if the new state is alive; for every other value the new state is dead. Here's why:

Code:

Alive Bit pattern, cell
neighbours Dead Alive
0 0000 0000 0001
1 0001 0001 0001
2 0010 0010 0011 <
3 0011 0011 < 0011 <
4 0100 0100 0101
5 0101 0101 0101
6 0110 0110 0111
7 0111 0111 0111
8 1000 1000 1001
(if you count more than 8 neighbours, you have a bug.)

As you can see, if the cell was dead, but has three neighbours alive, it will become alive. If the cell is alive, it will only stay alive if it has two or three neighbours.

You don't need any tests, since you can XOR the two highest bits (so 1111 is the only pattern where the cell will be/stay alive), and do some bit shifts and ANDs to get a 1 bit for alive, and 0 for dead.

Here is an example you can play with:

Code:

#include <stdint.h>
#define MASK_0001 ((uint64_t)0x1111111111111111ULL)
#define MASK_0010 ((uint64_t)0x2222222222222222ULL)
#define MASK_0011 ((uint64_t)0x3333333333333333ULL)
#define MASK_0100 ((uint64_t)0x4444444444444444ULL)
#define MASK_0101 ((uint64_t)0x5555555555555555ULL)
#define MASK_1000 ((uint64_t)0x8888888888888888ULL)
#define MASK_1100 ((uint64_t)0xCCCCCCCCCCCCCCCCULL)
/* Each 64-bit word contains 62 cells.
* The highest bit, and the lowest bit, is the next lowest
* and next highest bit from the neighbour.
*
* Summary of the logic:
*
* In each set of four bits, calculate the number of alive neighbours.
* This is a value between 0 and 8, inclusive.
* OR the least significant bit with the state of the cell itself.
*
* The value is now one of
* Cell dead Cell alive
* 0000 0001 0 no alive neighbours
* 0001 0001 1 neighbour alive
* 0010 0011 ! 2 neighbours alive
* 0011 ! 0011 ! 3 neighbours alive
* 0100 0101 4 neighbours alive
* 0101 0101 5 neighbours alive
* 0110 0111 6 neighbours alive
* 0111 0111 7 neighbours alive
* 1000 1001 8 neighbours alive
* Therefore, the next state for the cell
* is alive, if the bit pattern is 0011,
* and dead otherwise.
*/
uint64_t step(const uint64_t prev, const uint64_t curr, const uint64_t next)
{
/* Sums of cells alive in prev and next */
const uint64_t two1 = ( prev & MASK_0001) + ( next & MASK_0001),
two2 = ((prev >> 1) & MASK_0001) + ((next >> 1) & MASK_0001),
two3 = ((prev >> 2) & MASK_0001) + ((next >> 2) & MASK_0001),
two4 = ((prev >> 3) & MASK_0001) + ((next >> 3) & MASK_0001);
/* Sums of cells alive in prev, curr, and next */
const uint64_t three1 = two1 + ( curr & MASK_0001),
three2 = two2 + ((curr >> 1) & MASK_0001),
three3 = two3 + ((curr >> 2) & MASK_0001),
three4 = two4 + ((curr >> 3) & MASK_0001);
/* Four bit patterns */
const uint64_t q1 = ( (three1 + two2 + three3)
| ( curr & MASK_0001) ) ^ MASK_1100,
q2 = ( (three2 + two3 + three4)
| ((curr >> 1) & MASK_0001) ) ^ MASK_1100,
q3 = ( (three3 + two4 + (three1 >> 4))
| ((curr >> 2) & MASK_0001) ) ^ MASK_1100,
q4 = ( (three4 + (two1 >> 4) + (three2 >> 4))
| ((curr >> 3) & MASK_0001) ) ^ MASK_1100;
/* Two bit patterns */
const uint64_t p1 = (q1 >> 2) & q1 & MASK_0011,
p2 = (q2 >> 2) & q2 & MASK_0011,
p3 = (q3 << 2) & q3 & MASK_1100,
p4 = (q4 << 2) & q4 & MASK_1100;
/* Result. */
return ( (p1 & (p1 >> 1) & MASK_0001)
| (p2 & (p2 << 1) & MASK_0010)
| (p3 & (p3 >> 1) & MASK_0100)
| (p4 & (p4 << 1) & MASK_1000)
) & (uint64_t)0x7FFFFFFFFFFFFFFEULL;
}

That last mask clears the highest and lowest bits, making it easy to copy the bits from adjacent words. *two* means prev+next, *three* means prev+curr+next; I couldn't think of better names just now that wouldn't be too long.

To update the boundary bits for *left* and *right*, assuming most significant bit is to the right (!), use

Code:

right |= (left >> 62) & 1;
left |= (right & 2) << 62;

If the word itself is periodic, you can use e.g.

Code:

uint64_t w = step(prev, curr, next);
w |= ((w >> 62) & 1) | ((w & 2) << 62);

Remember, the SECOND highest or lowest bit is copied to the highest or lowest bit position.

If you want, you can obviously implement this as 32-bit (30 cells per 32-bit word) instead: just change the types to e.g. *uint32_t*, and drop the excess digits from the constants.

There is plenty of room for improvement, too. The order shown in the function is probably not the most efficient; I used this for the hopes it would be readable, and that it'd help the compiler to pick a good order for the operations. On my machine, using a map of 64 words (rows) with 62 cells in each 64-bit word, it reaches about 0.7 cycles per cell (2800 cycles per full map update).

Perhaps a squeezed chess board, say eight words per vertical run, might be a good optimization? If the words in a run OR together to a zero, then that run does not need to be updated, since all cells in it are dead.