For N=21, I got 497 cycles with the code above.
I replaced line 14, with this
which resulted in 391 cycles!!! The case is that shifts take only one stage in ALU to be executed, while mul takes 7!!!Code:dsll r9, r3, 3
Now, if only I could replace the array[i]*2147484 with some other command(s), except division then, this would benefit me a lot.
EDIT: 2147484=2^21+2^15+2^14+2^10+2^7+2^4+2^3+2^2, but I do not know if this can help.. It has 8 terms, thus 8 shifts...