# efficient implementation of strassen's algorithm for matrix multiplication

• 01-23-2010
koodoo
efficient implementation of strassen's algorithm for matrix multiplication
Hi,

I am trying to implement strassen's algorithm in C++ that does matrix multiplication.
I have implemented it but the results are not much better than that of the traditional approach of O(n3) complexity.

In fact as already know from smaller matrices the traditional approach outperforms strassen's algorithm.

I tried multiplying two matrices A and B. The performance that I have obtained is:

Code:

``` Input: A(512 x 512) and B(512 x 512) Time (traditional): 7 seconds Time (strassen's): 7 seconds Input: A(1024 x 1024) and B(1024 x 1024) Time (traditional): 64 seconds Time (strassen's): 46 seconds Input: A(2048 x 2048) and B(2048 x 20484) Time (traditional): 529 seconds Time (strassen's): 527 seconds The multiplication for A(4096 x 4096) and B(4096 x 4096) is still running in my machine :(```
What I want to ask is that there is no significant improvement by the Strassan's algorithm. This I believe is because in my implementation of the Strassan's method there are a lot of memory accesses. One particular thing that I am doing is:

I have a matrix A which is of type vector < vector <double> >
now I have to partition this matrix into 4 parts A11, A12, A21, A22. For now I am copying the original matrix A into 4 smaller matrices. Is there any way to avoid this copy process, by using references in some way? I tried many ways but could not get it to work.

the strassen algorithm is called recursively and takes as input three matrices A, B, C such that it can calculate C = A x B

now I have partition A into 4 parts A11 A12 A21 A22 such that each of them is of type matrix so that they can be input to the strassen algorithm again and I also do not want to copy and create them.

To rephrase, given a matrix of type vector <vector<double> > can I create a smaller matrix from it without copying it (by using references in some way).

I may have made this post kind of confusing. I would be grateful to explain more if someone is willing to help.

Thanks,
koodoo
• 01-23-2010
grumpy
You can try extracting a vector<double *> from a vector<vector<double> >. For example, assuming the vector<vector<double> > is 2n by 2n, the upper left n x n submatrix can be found;

Code:

```std::vector<double *> upperleft(std::vector<std::vector<double> > &matrix) {     std::size_t i, newsize = matrix.size()/2;     std::vector<double *> retval(newsize);     for (i = 0; i < newsize; ++i)         retval[i] = &(matrix[i][0]);     return retval; }```
and the bottom right can be found by;
Code:

```std::vector<double *> bottomright(std::vector<std::vector<double> > &matrix) {     std::size_t i, newsize = matrix.size()/2;     std::vector<double *> retval(newsize);     for (i = 0; i < newsize; ++i)         retval[i] = &(matrix[i + newsize][newsize]);     return retval; }```
Of course, that leaves the problem of logically multiplying them together ..... which still means you need to provide additional memory space to hold the result.

What this will give you, I don't know.

If your algorithm implementation is not particularly sophisticated, the dimension of matrices will need to be pretty large before your algorithm becomes more efficient than conventional matrix multiplication. I remember reading a report somewhere about a very rudimentary implementation of Strassen's algorithm, and it was less efficient than conventional matrix multiplication for dimensions up to 40000 or so.

Incidentally, you are also aware that Strassen's algorithm is less stable numerically than the conventional matrix multiplication? Errors due to finite numerical precision build up more rapidly, and the effects will be more noticeable for larger matrices. Fast algorithms are rarely a free lunch .....
• 02-18-2010
koodoo
Hi Grumpy,

Thanks a lot for your reply and many apologies for the extremely late reply. I tried vector <double *> but I did not get any significant speedups. And you were right, Strassen's method does not give significant speedups with naive implementations like mine.

Thanks again,
koodoo