Something like... (just typing ... too lazy to build a prototype and letting a compiler check...)
Code:
double array[8][512];
int sp;
// initialize every element
memcpy(array[0], ... <input array>, 512 * sizeof(double));
memcpy(array[1], ... <input array>, 512 * sizeof(double));
memcpy(array[2], ... <input array>, 512 * sizeof(double));
memcpy(array[3], ... <input array>, 512 * sizeof(double));
...
memcpy(array[7], ... <input array>, 512 * sizeof(double));
sp = 0;
for (;;) {
// do whatever processing on the entire 4096 element array
memcpy(array[sp], ... <input new chunk>, 512 * sizeof(double));
sp = (sp + 1) % 8; // get ready for next slot
}
Each new set of 512 elements is slotted into the next "stack" position. There's never any physical shifting of old data. One of eight ( 8 ) 512-element slots are merely overwritten in turn.
Sorry, my previous post said 4 x 512 when it's really 8 x 512 = 4096.
If you can elaborate on what algorithm is doing the actual processing it may influence the way the data is organized.