unrolling strategies

Printable View

06-21-2011
m37h0d

unrolling strategies

i have to deal with some very large matrices; approximately 2^14 x 100.

a huge amount of the computational power involved is simply in managing the for loops to carry out the calculations, so it would be great to optimize these.

looping with template chains works wonderfully for smaller matrices, but these are obviously far too large for the depth limits of 512 & 2000 for inheritance and specialization, respectively.

if anyone has ideas for how to partition the work so that the compiler won't error out using template chains, or something more clever than simply explicit unrolling statements, i'd love to hear it.

thanks.
06-21-2011
brewbuck

You can do more than one iteration in a template instance... You could have a version that unrolls 10 copies inline, for instance, then multiply that out to get a lower template depth.
06-21-2011
m37h0d

that sounds like the strategy i planned to pursue next, seems like the simplest way to go.
06-21-2011
King Mir

Loop unrolling is done to some degree by both by the compiler, and at run time by the processor, so make sure to profile to see if your changes actually make a difference.
06-21-2011
m37h0d

for polynomials, the boost in speed increased with the order of the equation. at 12th order, the difference was ~4.5x
06-21-2011
Salem

Beware that if you unroll too much, the result won't fit in L1 cache any more. A branch in cache is likely to be less expensive than cache misses.

Also bear in mind that in some architectures with branch prediction, the cost of a successful guess is close to zero.

As ever, it is probably a good idea to start with a baseline of "simple" code which you can profile and benchmark. For one thing, optimisers are the most effective when the programmer isn't being too "clever" with the code.

Ad-hoc hackery without a consistent approach to measuring the result won't get you anywhere.

It is also important that you understand all the options a modern compiler has, before starting on your own hatchet jobs on the code.
Optimize Options - Using the GNU Compiler Collection (GCC)
Eg.

Quote:

-fno-guess-branch-probability
Do not guess branch probabilities using heuristics. GCC will use heuristics to guess branch probabilities if they are not provided by profiling feedback (-fprofile-arcs). These heuristics are based on the control flow graph. If some branch probabilities are specified by `__builtin_expect', then the heuristics will be used to guess branch probabilities for the rest of the control flow graph, taking the `__builtin_expect' info into account. The interactions between the heuristics and `__builtin_expect' can be complex, and in some cases, it may be useful to disable the heuristics so that the effects of `__builtin_expect' are easier to understand.
The default is -fguess-branch-probability at levels -O, -O2, -O3, -Os.
06-21-2011
m37h0d

yes, i had begun to consider that, although its not a subject i know a lot about. if i am off base (good chance of that; i'm an engineer by trade, not a computer scientist...), please disabuse me of my ignorance:

if i am if i assume a cache size of 32k (64k is de-facto standard now?); i should be able to get ~4k 64-bit flops in before worrying about cache coherence

if i use specialization, i could partition this into 2 chains of 2k operations (or possibly 512 __m128d operations) each.
06-21-2011
m37h0d

hmm....

it seems you can "cheat" a little bit using multiple inheritance; evidently the depth rules only apply to a single chain.

Code:

enum Branch { Left=0, Right=1 }; template<unsigned int N,Branch b>class Node: Node<N-1,b> { public: enum{Order=N}; }; template<Branch b>class Node<0,b> { public: enum{Order=0}; }; template<unsigned int N>class Tree: Node<N/2,Left>, Node<N-Node<N/2,Left>::Order,Right> { }; void foo() { Tree<800>u; }
06-22-2011
iMalc

When it comes to dealing with matricies you should make your code work smarter, not harder. It depends on what you're doing with those matricies, but there are algorithms out there that can perform multiplications quicker for example.
Loop unrolling gives you diminishing returns. The more you unroll, the less it is worth it. I suspect that in your polynomial case that the only reason you might have gotten anywhere near such an improvement is that you were unrolling multiple levels of nested loops, so the improvements compound. For such a large matrix though, I don't expect such massive unrolling to make that much improvement.
06-22-2011
Elkvis

if iterations of the loop do not depend on each other, you could try running them in parallel using something like OpenMP, or explicitly spawn threads for ranges of the loop.
06-22-2011
m37h0d

hmm...the polynomials were definitely singly-looped.

the difference was essentially between:

Code:

double evaluate(double x) { double result=0; for(int i=0;i<N;++i) { result+=pow(x,i)*coeff[i]; } return result; }

and

Code:

double eval(doublex) { return x*x*x+polynomial<3>::coefficient+x*x+polynomial<2>::coefficient+x*polynomial<1>::coefficient+polynomial<0>::coefficient; }
06-24-2011
m37h0d

i don't know why it took me so long to see this, but there's no reason to use inheritance at all. static class methods + specialization are the way to go. there are apparently no limits (or much larger limits) on the number of class specializations you can have.
06-24-2011
iMalc

Quote:

Originally Posted by m37h0d

hmm...the polynomials were definitely singly-looped.

the difference was essentially between:

Code:

double evaluate(double x) { double result=0; for(int i=0;i<N;++i) { result+=pow(x,i)*coeff[i]; } return result; }

and

Code:

double eval(doublex) { return x*x*x+polynomial<3>::coefficient+x*x+polynomial<2>::coefficient+x*polynomial<1>::coefficient+polynomial<0>::coefficient; }

Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:

Code:

double eval(doublex) { return x*(x*(x*polynomial<3>::coefficient+polynomial<2>::coefficient)+polynomial<1>::coefficient)+polynomial<0>::coefficient; }

Only 3 multiplies!

Now you know that the speed improvement had not so much to do with loop unrolling, perhaps you'll reconsider loop unrolling at all with your matrix stuff huh!
06-24-2011
King Mir

Moving multiplies and adds around like that needs to be done with care, because it can yield very different results when the order of magnitude of the operands varies significantly.
06-27-2011
m37h0d

Quote:

Originally Posted by iMalc

Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:

Code:

double eval(doublex) { return x*(x*(x*polynomial<3>::coefficient+polynomial<2>::coefficient)+polynomial<1>::coefficient)+polynomial<0>::coefficient; }

Only 3 multiplies!

Now you know that the speed improvement had not so much to do with loop unrolling, perhaps you'll reconsider loop unrolling at all with your matrix stuff huh!

possibly, but i don't think so. i'm getting really good performance with dot-products. i'm still tweaking the benchmarking setup to ensure the optimizer doesn't give me an unfair baseline, but so far the results are highly compelling.