1. ## Find the sum

hi,
i had come upon this problem on some other forum and have tried to solve it for a very long time. i'm unable to do so. any help would be highly appreciated

problem: Find the sum of all numbers less than or equal to a billion that can be expressed as a sum of squares of two non negative integers. eg 25=52+02, 10=32+12, etc

the problem i'm facing is to how to eliminate such numbers which can be expressed as a sum of squares in two or more different ways.
ex 50=52+52 and 50=72+12

2. Basically, you could keep a counter of how many sum-of-squares can be used to calculate each number. It's a little bit hard to determine the best way though, without knowing your current implementation.

3. EDIT: I'm a bit confused. Do you only want to print numbers that have exactly one sum-of-squares breakdown, and those that have two or more are not in the list at all? Or do you want to simply avoid duplicate numbers in the list? Given your example, where 50 has two solutions, should it show up zero or one time in your list?

4. try this:
generate a vector of all the squares up to whatever limit you want
generate all 2 out of N combinations of that vector, (Generating Combinations: 1 « Computer programming)
for each combination add the 2 values together and as anduril said, keep a count for each and eliminate those that have a count greater than 1

5. edit to my last comment:

my number theory is rusty, but you might only need a vector of all squares of prime numbers.

6. I would try using arrays to keep track. Have an array that is long enough to keep track of already "found" values for long enough for the numbers to increase to a point where the old entries can be replaced.

@anduril462 50 must be included once in our list
dmh2000 i havent learned the concept of vectors yet. i think there exists a straightforward approach too.
@codegeek892 i have tried that, but the max array size allowed on my compiler(dev c++) is around a hundred thousand..

8. I would guess You wouldn't need more than a thousand elements or so. Do you mean by max array size the maximum number of elements or the max size of an element. Because if its the former, then you can try this.

Create the array so that its initially filled with all zeroes. Then have it filled everytime there is a number that fits your criteria. Once the array is filled, upon the program finding the next number, if it is larger than the largest element in the array, have it replace the smallest element in the array.

Is there anyway you can post your code?

9. Originally Posted by jonvonneumann
@codegeek892 i have tried that, but the max array size allowed on my compiler(dev c++) is around a hundred thousand..[/FONT][/COLOR]
You don't need a count of how many times each sum is found, you just need a to know if the sum is found zero or more than zero times. Encode the found/not found status as a bitmask. That should reduce the storage size from a billion bytes down to 125MB or so (going from 1 flag per char to 8). You lose some speed here but gain a bunch of memory.

ETA - even so, you should be able to malloc a billion bytes on any sort of modern system.

10. I'm with KCfromNC, a bitmask is probably your best bet here.

Also note, 1 is not a prime number. That means 50 really only has one breakdown, 52 + 52.

If you haven't yet, I would look into the Sieve of Eratosthenes for generating prime numbers to square and sum, it will speed up your code greatly.

11. Originally Posted by anduril462
Also note, 1 is not a prime number.
That is irrelevant. The task at hand:
Originally Posted by jonvonneumann
expressed as a sum of squares of two non negative integers
However, a sieve is the efficient answer. There are a couple of useful things to remember, though.

First, there is no need for the sieve to cover the entire region at once. You can use a limited-size sieve, say a few million entries, by offsetting the sieve: instead of 1..length, you cover start .. start+length-1, and cover the larger range in chunks.

Second, it is very useful to also save the terms. If for nothing else, then for verification. If you use full-size integers instead of single bits as the sieve elements, you can simply save one/the nonzero term. Because the index in the sieve array determines the sum, the value determines one term, you can always compute the second term using sqrt(sum - term*term).

Third, you can use sqrt() from math.h to bracket the terms in the sieve loops, but only up to about 250 or so. Above that, the limited precision will cause the result to be rounded towards the next greater integer. For example, sqrt(15826587771568349189.0) returns 3978264417.0, but 39782644172 > 15826587771568349189.

Fourth, you can greatly improve the speed of the innermost loop in the sieve by realizing that the sequence of positive integers squared is 1, 4=1+3, 9=1+3+5, 16=1+3+5+7, 25=1+3+5+7+9.

As a coding exercise, I wrote an example sieve that uses 4194304 elements per sieve window, using 64-bit unsigned integer math, including a custom integer square root, which is obviously much slower than the math.h sqrt()). On an AMD Athlon II X4 640 processor, using a single core, it finds all the 173229058 unique sums of two nonnegative integers squared between 1 and 1000000000, in 94 seconds. A 32-bit version should be much, much faster, but limited to the 32-bit range, which I don't think is that interesting. (Besides, I'm in Europe; a long billion is 1000000000000 = 1012 ≃ 233×232; my code can cover that in a day or two.)

Note: the sums are unique. I only look for the first pair of terms; there may be any number of term pairs that when squared, sum up to the same value. I only count the unique sums, not term pairs.

In addition to the sum, my code also provides a valid term pair for each sum. The first sum is of course 1 = 12 + 02 , the penultimate sum is 999999986 = 63192 + 309852 , and the last one is 1000000000 = 12002 + 316002.

If this is NOT homework, and you're interested, I could post the C99 sources here. It could do with quite a bit of optimization and enhancement; I'd be interested to see in which direction others would take the code. It should be correct as it is, however: I verified a few smaller ranges using a nested-loop brute-force version.

One interesting option would be to simply parallelize the code. Each CPU core could easily cover a separate sieve window. Aside from output (which I guess should be sequential/sorted), it should parallelize perfectly. (And output can be done by a separate thread, of course.)

12. 90+ seconds seems like a long time. On a 2.5GHz Core2 Duo CPU the answer to the OP, and single core only :
Code:
```C:\Source>\cygwin\bin\time ./sumsquare.exe
sum = 85447795676690937
found = 219496984, calls = 392726043, hit percent = 0.558906
23.50user 0.04system 0:26.13elapsed 90%CPU (0avgtext+0avgdata 8257536maxresident)k
0inputs+0outputs (32392major+0minor)pagefaults 0swaps```
And I've still got debugging code to see how often sums are duplicated (the ~56% number above). I'm sure different approaches are more efficient as the search space grows but here a simple approach seems better.

I'm struggling to figure out what a sieve will accomplish here. It's not like integer multiplication is that hard for modern CPUs, so recalculating squares seems like a net win to me considering the sieve size is going to be larger than the L2. Basically I'm just doing :
Code:
```for a = 0 to sqrt(N)
for b = a up to a^2 + b^2 <= N
if a^2 + b^2 not seen already
mark it as seen and add it to the running total

print running total```
Where does the sieve fit in?

13. Originally Posted by KCfromNC
I'm struggling to figure out what a sieve will accomplish here. It's not like integer multiplication is that hard for modern CPUs, so recalculating squares seems like a net win to me considering the sieve size is going to be larger than the L2.
Indeed.

In fact the multiplications can be removed by doing quadratic interpolation to evaluate the series used for a^2 and b^2 anyway.

14. @KCfromKC, @iMalc: On the exact same machine, just looking for the which of the first thousand million (1000000000) nonnegative integers are a sum of two nonnegative integers squared takes about 0.6 seconds. And that includes clearing the 125-million byte bitmap, and setting each bit for such sums:
Code:
```#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

int main(void)
{
unsigned char   *map;
size_t           size;

limit = 1000000000U;
size = (size_t)(limit / CHAR_BIT) + 1; /* At least one extra bit, aligns to a byte boundary */

map = malloc(size);
if (!map)
return 1;
memset(map, 0, size);

sum1 = 1U;

while (sum1 <= limit) {

sum2 = sum1;

while (sum2 <= limit) {
map[sum2 / CHAR_BIT] |= 1U << (sum2 % CHAR_BIT);

}

}

for (sum1 = 0; sum1 <= limit; sum1++)
if (map[sum1 / CHAR_BIT] & (1U << (sum1 % CHAR_BIT)))
printf("%u\n", sum1);

return 0;
}```
Simply printing out the 173 million numbers that are the sum of two squared nonnegative integers -- just uncomment the last loop -- takes over 54 seconds.

The above uses the fact that the sequence of squared integers increases by the next odd integer, because (x+1)^2 - x^2 = 2x + 1. In other words, 1 = 1, 4 = 1 + 3, 9 = 1 + 3 + 5, 16 = 1 + 3 + 5 + 7, 25 = 1 + 3 + 5 + 7 + 9, and so on. sum1 is this sequence. sum2 is the same sequence, except starting at zero, and offset by sum1. I believe this was what iMalc vaguely alluded to.

My original code saves the nonzero term of the sum, and outputs not only the sum, but the two nonzero terms, too. The additional forty seconds my code takes is due to the extra output, and the integer square roots needed to find out the other term, as only one of the terms is saved. The total time is about ninety seconds.

My background is in physics, not in mathematics, so I might have used the term "sieve" wrong here.

In this case, I have an ordered set of numbers. The position indicates the value of the sum, and the number itself is one of the terms (the nonzero one if one is zero). I call that a sieve, for lack of better term -- and prime number sieves are populated in very similar fashion in practice. I would also consider a bit map with only the sums that have been found to match a sum of two nonnegative squares a sieve. In this case, all the entries in the sieve are known to be valid, so perhaps "sieve" is not the correct term. Let me know if you know so; I'd be happy to use the proper terms.

My code uses a fixed size window, and can apply the same calculation to any consecutive set of nonnegative integers that can be represented by an unsigned 64-bit integer. Because doubles have only up to 52 bits of precision, I use a hand-written version of the integer square root algorithm (an unoptimized one at that). Here is my original code. When run on the command line, it takes one or two parameters: either the lower and upper bounds, inclusive, of the desired region, or just the upper bound (in which case the lower bound is 1). The interesting part is in the u64sieve() function, which I designed to be modular to be used later of if a need arises. You never know.

Code:
```#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>

/* 64-bit unsigned integer square root.
*
* Since IEEE-754 doubles have only 52 bits of mantissa,
* sqrt() provided by math.h may produce incorrect results due to rounding.
* For example:
*          u64sqrt(15826587771568349189) = 3978264416
*   3978264416^2 = 15826587763611821056
* but
*             sqrt(15826587771568349189.0) = 3978264417.0
* and
*   3978264417^2 = 15826587771568349889 > 15826587771568349189
* In other words, math.h sqrt() can return an integer value,
* that when squared, is greater than the original value.
*
*/
static inline uint64_t u64sqrt(uint64_t value)
{
uint64_t   result = 0UL;
uint64_t   bit = 4UL;

while (bit <= value >> 2U) {
if (!(bit << 2U))
break;
bit <<= 2U;
}

while (bit)
if (value >= result + bit) {
value -= result + bit;
result = (result >> 1U) + bit;
bit >>= 2U;
} else {
result >>= 1U;
bit >>= 2U;
}

return result;
}

/* Sieve for sums of two squared integers.
* For sum: start <= sum < start + length,
* if       factor[sum - start] > 0,
* then     i1 = factor[sum - start], i1 > 0,
*          i2 = u64sqrt(sum - i1*i1), i2 >= 0,
* and      sum = i1*i1 + i2*i2.
*/
static inline void u64sieve(uint64_t *const factor,
uint64_t const  start,
size_t const    length)
{
const uint64_t  finish = start + (uint64_t)length;
uint64_t        factor1 = (uint64_t)1;

/* Clear all factors to zero first. */
memset(factor, 0, length * sizeof (uint64_t));

while (1) {
/* Find out the smallest sensible factor to test. */
const uint64_t  factor2 = (factor1*factor1 < start) ? u64sqrt(start - factor1*factor1) : (uint64_t)0;

/* Instead of computing the sum of the squares for each iteration,
* we utilize the fact that (x)*(x) + 2*(x) + 1 = (x+1)*(x+1). */
uint64_t        sum = factor1 * factor1 + factor2 * factor2;
uint64_t        add = (uint64_t)2 * factor2 + (uint64_t)1;

/* Sieve completed? */
if (sum >= finish)
return;

/* Skip to the start of the sieve; same as if we increased factor2 and recalculated sum. */
while (sum < start) {
}

while (sum < finish) {

/* New factor? */
if (!factor[sum - start])
factor[sum - start] = factor1; /* Note: factor1 is always nonzero. */

}

/* Next factor. */
factor1++;
}
}

/* Maximum number of uint64_t's in a sieve.
* Let's use up to 32 MB for the sieve.
*/
#define SIEVE_SIZE  4194304

int main(int argc, char *argv[])
{
size_t           factors, n, i;
uint64_t        *factor;
uint64_t         imin, imax;
unsigned long    value;
char             dummy;

if (argc < 2 || argc > 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, "       %s [ from ] to\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will output lines of form\n");
fprintf(stderr, "\tsum = factor1^2 + factor^2\n");
fprintf(stderr, "for all integer sum, from <= sum <= to,\n");
fprintf(stderr, "with nonnegative integer factor1 and factor2.\n");
fprintf(stderr, "\n");
return 0;
}

if (argc > 2) {
if (sscanf(argv[argc - 2], "%lu %c", &value, &dummy) != 1) {
fprintf(stderr, "%s: Invalid lower limit.\n", argv[argc - 2]);
return 1;
}
imin = (uint64_t)value;
if (imin < (uint64_t)1 || (unsigned long)imin != value) {
fprintf(stderr, "%s: Invalid lower limit.\n", argv[argc - 2]);
return 1;
}
} else
imin = (uint64_t)1;

if (sscanf(argv[argc - 1], "%lu %c", &value, &dummy) != 1) {
fprintf(stderr, "%s: Invalid upper limit.\n", argv[argc - 1]);
return 1;
}
imax = (uint64_t)value;
if (imax < imin || (unsigned long)imax != value) {
fprintf(stderr, "%s: Invalid upper limit.\n", argv[argc - 1]);
return 1;
}

/* Maximum sieve size to allocate... */
factors = (size_t)SIEVE_SIZE;

/* ... except don't allocate more than we need. */
if ((uint64_t)factors > imax - imin + 1)
factors = (size_t)(imax - imin + 1);

factor = malloc(factors * sizeof (uint64_t));
if (!factor) {
fprintf(stderr, "Out of memory.\n");
return 1;
}

/* Sieve through the requested window. */
while (imin <= imax) {

/* Did imin wrap around? */
if (imin < (uint64_t)1)
break;

/* Do not exceed the given region. */
if (imax - imin >= (uint64_t)factors)
n = factors;
else
n = (size_t)(imax - imin + 1);

/* Apply the sieve. */
u64sieve(factor, imin, n);

/* Report the findings. */
for (i = 0; i < n; i++)
if (factor[i]) {
const uint64_t  sum = imin + (uint64_t)i;
const uint64_t  factor1 = factor[i];
const uint64_t  factor2 = u64sqrt(sum - factor1 * factor1);
printf("%lu = %lu^2 + %lu^2\n", (unsigned long)sum, (unsigned long)factor1, (unsigned long)factor2);
}

/* Move the window. */
imin += n;
}

free(factor);

return 0;
}```
The u64sieve() function above can be optimized further by using the integer square root -- which, unlike math.h sqrt(), returns an 64-bit unsigned integer that does not exceed the original argument if squared -- to eliminate large swathes of the inner loop. However, because it takes only a very small fraction of the total run time, it is not worth optimizing. Using a custom unsigned integer printing routine and aggressive caching should at least halve the total run time, I think.

I must say, I don't think I like the atmosphere in this discussion board.

15. OK, that post helps a lot. I can see where printing out everything would add a lot of time - that certainly answers some of my questions.

Your first implementation is basically the same as mine. I added the in the optimization you used to use increments by odd values rather than multiplies and it bought me a few seconds, as did removing my debugging code. And I'm on a 32-bit system so the final sum had to be an unsigned long long, but overall it's basically the same approach.

Code:
```#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <limits.h>

#define MAX_SEARCH 1000000000UL
//#define MAX_SEARCH 10UL

//static unsigned long counter_calls;
//static unsigned long counter_found;

unsigned long b,
unsigned char *sums_found)
{
unsigned long val = a + b;
//counter_calls += 1;
//printf ("Checking %lu + %lu = %lu ... ", a, b, val);
if (sums_found[val / CHAR_BIT] & (1 << (val % CHAR_BIT)))
{
//  counter_found += 1;
return 0;
}
sums_found[val / CHAR_BIT] |= 1 << (val % CHAR_BIT);
return val;
}

int main(void)
{
unsigned char *sums_found;
unsigned long  a_sqr;
unsigned long  a_inc;
unsigned long  b_sqr;
unsigned long  b_inc;
unsigned long long sum = 0;

sums_found = calloc(MAX_SEARCH / CHAR_BIT + 1, 1);

a_inc = 1;
for (a_sqr = 0; a_sqr <= MAX_SEARCH;)
{
b_inc = a_inc;
for (b_sqr = a_sqr; a_sqr + b_sqr <= MAX_SEARCH; )
{
b_sqr += b_inc;
b_inc += 2;
}
a_sqr += a_inc;
a_inc += 2;
}

printf ("sum = %llu\n", sum);
//  printf ("found = %lu, calls = %lu, hit percent = %f\n", counter_found, counter_calls, (double)counter_found/ counter_calls);
}```
This runs in ~15 seconds on my system. Modifying your basic code to caluculate the overall total and print it runs in ~32 seconds for me, which is really strange since it's doing basically the same thing. My guess is that I get some benefits of a cache hit by accessing my found sums bitmask back to back rather than writing it once and then later calculating the sum.

Which is where your second approach helps? I can see why you call it a sieve - it's applying the ideas of a something like a wheel sieve to make your array of found numbers smaller. In any case, it's a good approach since the limiting factor is memory access times for a search of this size.

Just to see I understood the approach, I changed SIEVE_SIZE (after hacking the code to just sum up the unique sums of squares) :

32MB :43 sec
16MB :42 sec
8MB : 39 sec
6MB : 25 sec
4MB : 24 sec
2MB : 30 sec
1MB : 44 sec
512KB:74 sec
256KB:133 sec

These are all run on a 32-bit system so the absolute numbers don't matter, just the trend.

You can see this is most efficient when you fill up the 6MB L2 on my system. Too big and you end up paying for memory latency, too small and the overhead of figuring out if you're in the window kills performance. I was curious to see if I got a boost from fitting in L1, but no luck.

As for the atmosphere here, not sure if you meant me but my questions were genuine. I figured there was something more complex going on but honestly couldn't figure it out. In general keep in mind that the normal level of question here is "do my CS101 homework for me", so people are hesitant to give too much detail, especially when the OP can't manage to post anything they've done before.