Originally Posted by

**Dave_Sinkula**
Code:

sol += C[i][j] * (xpwr * ypwr);

Investigate whether or not the resulting assembly code generates successive [re]calculations of

C[i][j] within the loop. If so, your code may benefit from using pointers calculated once at the start of the loop.

First, this advice sucks, so I'm glad you didn't pursue it. I did investigate, and found that it can make things worse. And such micro-optimization should generally be avoided.

Your attempts so far seem very reasonable and you seem to know quite well what it is you are doing, so my further advice on your problem would be questionable.

I am not familiar with ACPE, but I've tried to look into it a bit. I was wondering if you've got some tool that you use that might be helpful to share with the rest of us?

Also, such as in the case of your precomputed version, is the amortization over a single call to the function, or is it amortized over all calls? The point being, can you do something like this?

Code:

val_t poly2d_dave(val_t C[M][N])
{
static val_t coeffs[M][N] = {0};
static int init = 0;
int i, j;
val_t sol = 0.0;
if ( init == 0 )
{
val_t xpwr = 1.0;
val_t ypwr = 1.0;
val_t x = *x_p;
val_t y = *y_p;
for ( i = 0; i < M; i++ )
{
ypwr = 1.0;
for ( j = 0; j < N; j++ )
{
coeffs[i][j] = xpwr * ypwr;
ypwr *= y;
}
xpwr *= x;
}
init = 1;
}
for ( i = 0; i < M; i++ )
{
for ( j = 0; j < N; j++ )
{
sol += C[i][j] * coeffs[i][j];
}
}
return sol;
}

Again, my advice on this stinks. I come from the land of micro-optimization and I know those techniques better. Perhaps some of them can combine with better algorithms, but I'm venturing outside of my normal sphere when I look at improving the algorithm.

[edit]I shouldn't. But oh, well.

Code:

val_t poly2d_dave(val_t C[M][N])
{
static val_t coeffs[M][N] = {0};
static int init = 0;
val_t *mptr = &C[0][0], *cptr = &coeffs[0][0];
int i, j;
val_t sol = 0.0;
if ( init == 0 )
{
val_t xpwr = 1.0;
val_t ypwr = 1.0;
val_t x = *x_p;
val_t y = *y_p;
for ( i = 0; i < M; i++ )
{
ypwr = 1.0;
for ( j = 0; j < N; j++ )
{
coeffs[i][j] = xpwr * ypwr;
ypwr *= y;
}
xpwr *= x;
}
init = 1;
}
for ( i = 0; i < M * N; i++ )
{
sol += *mptr++ * *cptr++;
}
return sol;
}

After the first time, the loop should be tighter.

But, of course, that depends on the amortization question I asked. And just ignore the stupid micro-optimization attempts here please.