1. ## Matrix product

Hello!

I searched the forum on this topic. My problem is as follows: I wrote a C++ algorithm which involved segment trees and matrix product. I had a timelimit of 1.1 seconds.

In the "matrix product" function I reversed the order of the 'for'-s as follows:

Code:
```for (k = 0; k<n; ++k)
for (i=0; i<n; ++i)
for (j=0; j<n; ++j)
C[i][j] += A[i][k] * B[k][j];```
It was implemented in a matrix class and A, B and C are in fact instances (didn't write all the code here)... The normal (and logical) order of the for-s was (i,j,k), but putting them this way (k,i,j), my code ran faster (worst case, from 1 sec -> 0.8 sec)...

I would like to ask you why is this happening? Any permutation of (i,j,k) I would use, one array would be accessed "normally", one on one row at a time and one on one column at a time... so the cache is not working in some parts, but does in others.

I hope my English is not bad enough to make this post impossible to be followed!

Thank you!

2. You are using a loop like this and wondering and then asking questions as to why the order causes performance differences? My main question to you is why you even care. My second question is that this type of loop (with 3 for's) is not going to get good performance no matter what you do with the order. Unroll some of the loops and you will get better performance.

There has to be a better way.

3. Yes, my concern is why the order of the loops makes any difference. I don't know anything about "unrolling loops" (or maybe I do but I don't know it's called like that, could you please explain me a bit?), but the complexity needed for this operation was O(N^3).

There's no doubt it would be very nice to find a better way, but as far as I know, doing a matrix product requires 3 for loops.

Edit:
is not going to get good performance no matter what you do with the order.
It is not, but the difference was enough for me to finish that code and not search for further optimization. But, I would like to find a solution - if there is any.

4. Unroll some of the loops and you will get better performance.
It should be tested with profiler. Some optimizers have problems with optimizing unrolled loops

5. Ok, found out more about loop unrolling on:

http://en.wikipedia.org/wiki/Loop_unrolling

Is it enough? Is it the thing you said I could do (in the deep-most for to compute several steps ahead)?

If it is, I will give it a try!

Edit: found out even more: http://en.wikipedia.org/wiki/Loop_nest_optimization