@KCfromKC, @iMalc: On the exact same machine, just looking for the which of the first thousand million (1000000000) nonnegative integers are a sum of two nonnegative integers squared takes about 0.6 seconds. And that includes clearing the 125-million byte bitmap, and setting each bit for such sums:

Code:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
int main(void)
{
unsigned char *map;
size_t size;
unsigned int limit, sum1, add1, sum2, add2;
limit = 1000000000U;
size = (size_t)(limit / CHAR_BIT) + 1; /* At least one extra bit, aligns to a byte boundary */
map = malloc(size);
if (!map)
return 1;
memset(map, 0, size);
sum1 = 1U;
add1 = 3U;
while (sum1 <= limit) {
sum2 = sum1;
add2 = 1U;
while (sum2 <= limit) {
map[sum2 / CHAR_BIT] |= 1U << (sum2 % CHAR_BIT);
sum2 += add2;
add2 += 2U;
}
sum1 += add1;
add1 += 2U;
}
for (sum1 = 0; sum1 <= limit; sum1++)
if (map[sum1 / CHAR_BIT] & (1U << (sum1 % CHAR_BIT)))
printf("%u\n", sum1);
return 0;
}

Simply printing out the 173 million numbers that are the sum of two squared nonnegative integers -- just uncomment the last loop -- takes over 54 seconds.

The above uses the fact that the sequence of squared integers increases by the next odd integer, because (x+1)^2 - x^2 = 2x + 1. In other words, 1 = 1, 4 = 1 + 3, 9 = 1 + 3 + 5, 16 = 1 + 3 + 5 + 7, 25 = 1 + 3 + 5 + 7 + 9, and so on. sum1 is this sequence. sum2 is the same sequence, except starting at zero, and offset by sum1. I believe this was what iMalc vaguely alluded to.

My original code saves the nonzero term of the sum, and outputs not only the sum, but the two nonzero terms, too. The additional forty seconds my code takes is due to the extra output, and the integer square roots needed to find out the other term, as only one of the terms is saved. The total time is about ninety seconds.

My background is in physics, not in mathematics, so I might have used the term "sieve" wrong here.

In this case, I have an ordered set of numbers. The position indicates the value of the sum, and the number itself is one of the terms (the nonzero one if one is zero). I call that a sieve, for lack of better term -- and prime number sieves are populated in very similar fashion in practice. I would also consider a bit map with only the sums that have been found to match a sum of two nonnegative squares a sieve. In this case, all the entries in the sieve are known to be valid, so perhaps "sieve" is not the correct term. Let me know if you know so; I'd be happy to use the proper terms.

My code uses a fixed size window, and can apply the same calculation to any consecutive set of nonnegative integers that can be represented by an unsigned 64-bit integer. Because doubles have only up to 52 bits of precision, I use a hand-written version of the integer square root algorithm (an unoptimized one at that). Here is my original code. When run on the command line, it takes one or two parameters: either the lower and upper bounds, inclusive, of the desired region, or just the upper bound (in which case the lower bound is 1). The interesting part is in the u64sieve() function, which I designed to be modular to be used later of if a need arises. You never know.

Code:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>
/* 64-bit unsigned integer square root.
*
* Since IEEE-754 doubles have only 52 bits of mantissa,
* sqrt() provided by math.h may produce incorrect results due to rounding.
* For example:
* u64sqrt(15826587771568349189) = 3978264416
* 3978264416^2 = 15826587763611821056
* but
* sqrt(15826587771568349189.0) = 3978264417.0
* and
* 3978264417^2 = 15826587771568349889 > 15826587771568349189
* In other words, math.h sqrt() can return an integer value,
* that when squared, is greater than the original value.
*
*/
static inline uint64_t u64sqrt(uint64_t value)
{
uint64_t result = 0UL;
uint64_t bit = 4UL;
while (bit <= value >> 2U) {
if (!(bit << 2U))
break;
bit <<= 2U;
}
while (bit)
if (value >= result + bit) {
value -= result + bit;
result = (result >> 1U) + bit;
bit >>= 2U;
} else {
result >>= 1U;
bit >>= 2U;
}
return result;
}
/* Sieve for sums of two squared integers.
* For sum: start <= sum < start + length,
* if factor[sum - start] > 0,
* then i1 = factor[sum - start], i1 > 0,
* i2 = u64sqrt(sum - i1*i1), i2 >= 0,
* and sum = i1*i1 + i2*i2.
*/
static inline void u64sieve(uint64_t *const factor,
uint64_t const start,
size_t const length)
{
const uint64_t finish = start + (uint64_t)length;
uint64_t factor1 = (uint64_t)1;
/* Clear all factors to zero first. */
memset(factor, 0, length * sizeof (uint64_t));
while (1) {
/* Find out the smallest sensible factor to test. */
const uint64_t factor2 = (factor1*factor1 < start) ? u64sqrt(start - factor1*factor1) : (uint64_t)0;
/* Instead of computing the sum of the squares for each iteration,
* we utilize the fact that (x)*(x) + 2*(x) + 1 = (x+1)*(x+1). */
uint64_t sum = factor1 * factor1 + factor2 * factor2;
uint64_t add = (uint64_t)2 * factor2 + (uint64_t)1;
/* Sieve completed? */
if (sum >= finish)
return;
/* Skip to the start of the sieve; same as if we increased factor2 and recalculated sum. */
while (sum < start) {
sum += add;
add += (uint64_t)2;
}
while (sum < finish) {
/* New factor? */
if (!factor[sum - start])
factor[sum - start] = factor1; /* Note: factor1 is always nonzero. */
sum += add;
add += (uint64_t)2;
}
/* Next factor. */
factor1++;
}
}
/* Maximum number of uint64_t's in a sieve.
* Let's use up to 32 MB for the sieve.
*/
#define SIEVE_SIZE 4194304
int main(int argc, char *argv[])
{
size_t factors, n, i;
uint64_t *factor;
uint64_t imin, imax;
unsigned long value;
char dummy;
if (argc < 2 || argc > 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s [ from ] to\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will output lines of form\n");
fprintf(stderr, "\tsum = factor1^2 + factor^2\n");
fprintf(stderr, "for all integer sum, from <= sum <= to,\n");
fprintf(stderr, "with nonnegative integer factor1 and factor2.\n");
fprintf(stderr, "\n");
return 0;
}
if (argc > 2) {
if (sscanf(argv[argc - 2], "%lu %c", &value, &dummy) != 1) {
fprintf(stderr, "%s: Invalid lower limit.\n", argv[argc - 2]);
return 1;
}
imin = (uint64_t)value;
if (imin < (uint64_t)1 || (unsigned long)imin != value) {
fprintf(stderr, "%s: Invalid lower limit.\n", argv[argc - 2]);
return 1;
}
} else
imin = (uint64_t)1;
if (sscanf(argv[argc - 1], "%lu %c", &value, &dummy) != 1) {
fprintf(stderr, "%s: Invalid upper limit.\n", argv[argc - 1]);
return 1;
}
imax = (uint64_t)value;
if (imax < imin || (unsigned long)imax != value) {
fprintf(stderr, "%s: Invalid upper limit.\n", argv[argc - 1]);
return 1;
}
/* Maximum sieve size to allocate... */
factors = (size_t)SIEVE_SIZE;
/* ... except don't allocate more than we need. */
if ((uint64_t)factors > imax - imin + 1)
factors = (size_t)(imax - imin + 1);
factor = malloc(factors * sizeof (uint64_t));
if (!factor) {
fprintf(stderr, "Out of memory.\n");
return 1;
}
/* Sieve through the requested window. */
while (imin <= imax) {
/* Did imin wrap around? */
if (imin < (uint64_t)1)
break;
/* Do not exceed the given region. */
if (imax - imin >= (uint64_t)factors)
n = factors;
else
n = (size_t)(imax - imin + 1);
/* Apply the sieve. */
u64sieve(factor, imin, n);
/* Report the findings. */
for (i = 0; i < n; i++)
if (factor[i]) {
const uint64_t sum = imin + (uint64_t)i;
const uint64_t factor1 = factor[i];
const uint64_t factor2 = u64sqrt(sum - factor1 * factor1);
printf("%lu = %lu^2 + %lu^2\n", (unsigned long)sum, (unsigned long)factor1, (unsigned long)factor2);
}
/* Move the window. */
imin += n;
}
/* Done. Discard sieve. */
free(factor);
return 0;
}

The u64sieve() function above can be optimized further by using the integer square root -- which, unlike math.h sqrt(), returns an 64-bit unsigned integer that does not exceed the original argument if squared -- to eliminate large swathes of the inner loop. However, because it takes only a very small fraction of the total run time, it is not worth optimizing. Using a custom unsigned integer printing routine and aggressive caching should at least halve the total run time, I think.

I must say, I don't think I like the atmosphere in this discussion board.