1. ## Prime Sieve Optimization

I'm working on some old acm programming problems and one of them requires you to find all the primes upto 2^31 - 1. I've coded a pretty literal interpretation of the sieve of atkin and it takes about 41 seconds on my computer, however the time limit the acm site gives when you submit your code is 10 seconds so I must be doing something wrong. I was wondering if some of you could look at it and offer some tips on how to speed it up.

Code:
```#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <sys/time.h>
using namespace std;

void primes(int limit)
{
vector<bool> bits(limit + 1, false);
int sqrt_limit = sqrt(static_cast<double>(limit));

for(int x = 1; x <= sqrt_limit; ++x)
{
for(int y = 1; y <= sqrt_limit; ++y)
{
int x_squared = x * x, y_squared = y * y;
int r = 4 * x_squared + y_squared;

if(r <= limit && r % 12 == 1 && r % 12 == 5)
bits[r].flip();

int x_squared_times_3 = x_squared * 3;
int s = x_squared_times_3 + y_squared;

if(s <= limit && (s % 12) == 7)
bits[s].flip();

int t = x_squared_times_3 - y_squared;
if(x > y && t <= limit && (t % 12) == 11)
bits[t].flip();

}
}

for(int n = 5; n <= sqrt_limit; ++n)
{
if(bits[n])
{
int n_squared = n * n;
int m_max = limit / n_squared;
for(int m = 1; m <= m_max; ++m)
bits[n_squared * m] = false;
}
}

bits[2] = true;
bits[3] = true;
}

int main()
{
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);

primes(numeric_limits<int>::max());

gettimeofday(&tv2, NULL);
long seconds  = tv2.tv_sec  - tv1.tv_sec;
long useconds = tv2.tv_usec - tv1.tv_usec;

cout << "Elapsed time: " << seconds << " seconds + " <<  useconds << " microseconds" << endl;

return 0;
}```
Thanks

2. Your code is a near replica of the wikipedia pseudocode. Note the following from wikipedia:
This pseudocode is written for clarity. Repeated and wasteful calculations mean that it would run slower than the sieve of Eratosthenes. To improve its efficiency, faster methods must be used to find solutions to the three quadratics. At the least, separate loops could have tighter limits than [1, √limit].
Try writing your own version of the Seive of Eratosthenes, as it is simpler to implement. A well written implementation of it will run circles around the above "clarity targeted" code, while the above code requires some major changes to effectively implement the more complex Sieve of Atkin speed-wise.

3. x_squared and x_squared_times_3 do not change inside the inner y loop, so there's no sense in recalculating them again and again and again in the y loop. Do them once in the outer x loop.

Start with the above and see what difference that makes. I suspect it will make a some difference, but probably not enough.

I wouldn't use a vector for keeping track of 2^31 bits. Just get a hunk of heap and index into it.

But, as with any performance problem, running a profiler is always a better tactic than eyeballs or guessing.

Todd

4. Code:
`            if(r <= limit && r % 12 == 1 && r % 12 == 5)`
How can r % 12 ever equal both 1 and 5 simultaneously?

5. @CrazyNorman:
Thanks for the idea, I tried that and I got the time down to 34 seconds.
@Todd:
The assembly output stayed the same when I removed them from the inner loop so it looks like gcc was optimizing it away. I also tried using bool *bits = new bool[limit + 1] and it was actually slower then the vector. Can you tell me how to analyze a function in gprof? I've tried searching google but I must be using the wrong terms because I come up with results about analyzing time spent in different functions.
@iMalc:
Your right of course, I misread the or as an and. Unfortunately, this more then doubled the time it takes.

Thanks for all the suggestion so far.

6. When you reserve an array that large, if every "bool" takes up one byte, then the array of bools will take up 4 GB. Most machines don't have this much RAM, so you end up with swapping, and a general slow down. std::vector<bool> on the other hand implements an optimization, where eight bools are crammed into one byte (because you actually only need a bit for each one), thereby making the entire array only require 512 mb, which most computers will fit in RAM, or atleast have an easier time attempting to do. Chances are, std::vector<bool> will therefore provide a significant performance boost.

7. I think std::vector<bool> has some problems (I heard), so consider std::bitset too.

8. >> I think std::vector<bool> has some problems (I heard)
The problems are that it doesn't work like other vectors. If you don't try to use it like other vectors, then you'll be ok.

9. Originally Posted by rt454
@iMalc:
Your right of course, I misread the or as an and. Unfortunately, this more then doubled the time it takes.
Just make sure when you fix it that you also put in parenthesis around the OR statement as AND has higher precedence.
You could use the cool optimisation trick, that checks for being equal to 1 or 5 in one check:
Code:
`if(r <= limit && 1<<(r % 12) & ((1<<1)|(1<<5)))`
I doubt it'll shave off much time, but it's the first time I've found an appropriate place for that trick, so I'd do it for the coolness factor.

10. Originally Posted by rt454
@Todd:
...Can you tell me how to analyze a function in gprof? I've tried searching google but I must be using the wrong terms because I come up with results about analyzing time spent in different functions.
No, I can't. Not yet anyway. Still on my list to learn too.