thanks for your response.
I was able to reduce the operation time to about 10% of the original time by manualy unrolling the loops in steps of three for the middle loop and steps of 16 for the inner loop. I also set gcc to optimization level 3, but I haven't tried just using fun loops yet( I don't know gcc that well).
I know of another algothrim (strasson or something like that), that is suppose to be really fast, but I don't know how to implement it.
There are also a few different way to format the loop that will make have less cache misses, but you have to do another store instuction in the inner loop, depending on the machine this can make it faster or slower. I did some simple test and found that it seems to make it faster for the pentium III.
Code:
for(i = 0; i<512; i++){
for(k = 0; k<512; k++){
r = a[i][k];
for(j = 0; j<512; j++)
c[i][j] += r * b[k][j];
}
}