I wrote this small piece of code yesterday for the fun of it, after which I set my self a challenge, mainly to do everything in my power to bring down the initialization time, what was initially 40 seconds.

This I have done, and it's now around 18 seconds, but I think in theory it can be done even faster using threads, though I'm somewhat doubtful when it comes to reality.

Anyway, first the code:

Code:
#include <stdint.h>

// For convenience, the width of the desired type.
#define WORD_BITS ( 8 * sizeof( uint_fast32_t ) )
#define ARRAY_SIZE ( 0x10000000 / sizeof( uint_fast32_t ) )
#define ARRAY_SIZE_BITS ( 0x80000000 )

// allocate 2^31 bits of memory, to represent all uneven numbers 
// from 0 -> 2^32-1
static uint_fast32_t b[ ARRAY_SIZE ];

// should be selfexplanatory. Flip bits, and and assign.
static inline void resetBit( uint_fast32_t idx )
{
    b[ idx / WORD_BITS ] &= ~( ( uint_fast32_t ) 1 << ( idx % WORD_BITS ) );
}

// This is a bit more complicated, as I don't really need to return 0 or 1,
// but will be perfectly happy with 0 and !0. Thus there are two ways:
// 1&(x>>n) or x&(1<<n) and I would very much like to know which is more
// efficient.
// edit: after testing seems there's no difference.
static inline uint_fast32_t getBit( uint_fast32_t idx )
{
    return ( ( uint_fast32_t ) 1 & ( b[ idx / WORD_BITS ] >> ( idx % WORD_BITS ) ) );
}

void initialize()
{
    // set all bits to 1.
    for( uint_fast32_t n = 0; n < ARRAY_SIZE; n++ )
        b[ n ] = UINT_FAST32_MAX;
    uint_fast32_t p;
    // loop through all numbers from 0 -> 2^16.
    for( uint_fast32_t n = 0; n < 0x8000; n++ )
    {
        // If prime...
        if( getBit( n ) )
        {
            // calculate the actual value.
            // Also, would the compiler automatically optimize, if p
            // were to be eliminated in the for loop?
            p = 2 * n + 3;
             // loop through the bitset and reset required bits.
            for( uint_fast32_t i = p+n; i < ARRAY_SIZE_BITS; i += p )
                resetBit( i );
        }
    }
}
So, we basically have a rather large array of numbers, which is iterated through a large amount of times, and for each element we manipulate some data. The higher our prime or 'n', the quicker the iterations get.
I suspect there wouldn't be too much to be gained, but if possible, and depending how the OS is handing the workload, having 2 or more threads working in parallel, should still make a difference.
The only problem I see really, is working on the same data at the same time, which I see ways around. Also, you have to make sure the threads don't end up iterating using non primes, but that shouldn't be hard.

However, as I've never worked with threads before and it is a complicated subject, I thought it best to ask before I start pulling my hair out, due to an impossible task I've set my self.

There is also a bonus question or two, which relates to the LLP64, that is windows 64 bit architecture and types.
On my system I can count on a long being 64 bit, and in fact int_fast32_t is synonymous with long, or long int if you wish:

From stdint.h
Code:
typedef unsigned long int    uint_fast32_t;
typedef unsigned int        uint_fast32_t;
but as I understand it, things works a bit different on windows, though I've been unable to find definitive. So basically, if someone could do a sizeof(int_fast32_t) and sizeof(long) on a win64 system for me, I should get my answer.
I don't mind using stdint.h, but it seems rather silly if long does the same job.

Also, the cast I have to do int getBit() and ResetBit() is not exactly pretty. I could simply do 1LL, or 1ULL I suppose, but how would that work out on non 64 bit architectures? Would it simply be ignored, in cases where uint_fast32_t is in fact 32_bit?

Other than that, ideas are welcome and I bid you farewell for now.