Thread: unrolling strategies

  1. #1
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838

    unrolling strategies

    i have to deal with some very large matrices; approximately 2^14 x 100.

    a huge amount of the computational power involved is simply in managing the for loops to carry out the calculations, so it would be great to optimize these.

    looping with template chains works wonderfully for smaller matrices, but these are obviously far too large for the depth limits of 512 & 2000 for inheritance and specialization, respectively.

    if anyone has ideas for how to partition the work so that the compiler won't error out using template chains, or something more clever than simply explicit unrolling statements, i'd love to hear it.

    thanks.

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    You can do more than one iteration in a template instance... You could have a version that unrolls 10 copies inline, for instance, then multiply that out to get a lower template depth.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  3. #3
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    that sounds like the strategy i planned to pursue next, seems like the simplest way to go.

  4. #4
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Loop unrolling is done to some degree by both by the compiler, and at run time by the processor, so make sure to profile to see if your changes actually make a difference.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  5. #5
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    for polynomials, the boost in speed increased with the order of the equation. at 12th order, the difference was ~4.5x

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Beware that if you unroll too much, the result won't fit in L1 cache any more. A branch in cache is likely to be less expensive than cache misses.

    Also bear in mind that in some architectures with branch prediction, the cost of a successful guess is close to zero.

    As ever, it is probably a good idea to start with a baseline of "simple" code which you can profile and benchmark. For one thing, optimisers are the most effective when the programmer isn't being too "clever" with the code.

    Ad-hoc hackery without a consistent approach to measuring the result won't get you anywhere.

    It is also important that you understand all the options a modern compiler has, before starting on your own hatchet jobs on the code.
    Optimize Options - Using the GNU Compiler Collection (GCC)
    Eg.
    -fno-guess-branch-probability
    Do not guess branch probabilities using heuristics. GCC will use heuristics to guess branch probabilities if they are not provided by profiling feedback (-fprofile-arcs). These heuristics are based on the control flow graph. If some branch probabilities are specified by `__builtin_expect', then the heuristics will be used to guess branch probabilities for the rest of the control flow graph, taking the `__builtin_expect' info into account. The interactions between the heuristics and `__builtin_expect' can be complex, and in some cases, it may be useful to disable the heuristics so that the effects of `__builtin_expect' are easier to understand.
    The default is -fguess-branch-probability at levels -O, -O2, -O3, -Os.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    yes, i had begun to consider that, although its not a subject i know a lot about. if i am off base (good chance of that; i'm an engineer by trade, not a computer scientist...), please disabuse me of my ignorance:

    if i am if i assume a cache size of 32k (64k is de-facto standard now?); i should be able to get ~4k 64-bit flops in before worrying about cache coherence

    if i use specialization, i could partition this into 2 chains of 2k operations (or possibly 512 __m128d operations) each.

  8. #8
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    hmm....

    it seems you can "cheat" a little bit using multiple inheritance; evidently the depth rules only apply to a single chain.

    Code:
    enum Branch
    {
        Left=0,
        Right=1
    };
    
    template<unsigned int N,Branch b>class Node:
        Node<N-1,b>
    {
        public:
        enum{Order=N};
    };
    
    template<Branch b>class Node<0,b>
    {
        public:
        enum{Order=0};
    };
    
    
    
    template<unsigned int N>class Tree:
        Node<N/2,Left>,
        Node<N-Node<N/2,Left>::Order,Right>
    {
    };
    
    void foo()
    {
        Tree<800>u;
    }

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    When it comes to dealing with matricies you should make your code work smarter, not harder. It depends on what you're doing with those matricies, but there are algorithms out there that can perform multiplications quicker for example.
    Loop unrolling gives you diminishing returns. The more you unroll, the less it is worth it. I suspect that in your polynomial case that the only reason you might have gotten anywhere near such an improvement is that you were unrolling multiple levels of nested loops, so the improvements compound. For such a large matrix though, I don't expect such massive unrolling to make that much improvement.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    if iterations of the loop do not depend on each other, you could try running them in parallel using something like OpenMP, or explicitly spawn threads for ranges of the loop.

  11. #11
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    hmm...the polynomials were definitely singly-looped.

    the difference was essentially between:

    Code:
    double evaluate(double x)
    {
        double result=0;
        for(int i=0;i<N;++i)
        {
            result+=pow(x,i)*coeff[i];
        }
        return result;
    }
    and

    Code:
    double eval(doublex)
    {
        return x*x*x+polynomial<3>::coefficient+x*x+polynomial<2>::coefficient+x*polynomial<1>::coefficient+polynomial<0>::coefficient;
    }

  12. #12
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    i don't know why it took me so long to see this, but there's no reason to use inheritance at all. static class methods + specialization are the way to go. there are apparently no limits (or much larger limits) on the number of class specializations you can have.

  13. #13
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by m37h0d View Post
    hmm...the polynomials were definitely singly-looped.

    the difference was essentially between:

    Code:
    double evaluate(double x)
    {
        double result=0;
        for(int i=0;i<N;++i)
        {
            result+=pow(x,i)*coeff[i];
        }
        return result;
    }
    and

    Code:
    double eval(doublex)
    {
        return x*x*x+polynomial<3>::coefficient+x*x+polynomial<2>::coefficient+x*polynomial<1>::coefficient+polynomial<0>::coefficient;
    }
    Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
    Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:
    Code:
    double eval(doublex)
    {
        return x*(x*(x*polynomial<3>::coefficient+polynomial<2>::coefficient)+polynomial<1>::coefficient)+polynomial<0>::coefficient;
    }
    Only 3 multiplies!

    Now you know that the speed improvement had not so much to do with loop unrolling, perhaps you'll reconsider loop unrolling at all with your matrix stuff huh!
    Last edited by iMalc; 06-24-2011 at 03:44 PM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  14. #14
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Moving multiplies and adds around like that needs to be done with care, because it can yield very different results when the order of magnitude of the operands varies significantly.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  15. #15
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    Quote Originally Posted by iMalc View Post
    Oh that explains it! I wasn't going to say it but more than a 2x improvement through loop unrolling is next to impossible, so yeah I figured something was up. It turns out that probably 99% of the speed improvement you got there was from not using the very slow pow function! (well the speed of pow with an integer exponent does somewhat vary amoung compilers)
    Aside from the obvious bug there with a few + instead of * in that code, you could make that even faster by avoiding a lot of multiplications, if you do it like this:
    Code:
    double eval(doublex)
    {
        return x*(x*(x*polynomial<3>::coefficient+polynomial<2>::coefficient)+polynomial<1>::coefficient)+polynomial<0>::coefficient;
    }
    Only 3 multiplies!

    Now you know that the speed improvement had not so much to do with loop unrolling, perhaps you'll reconsider loop unrolling at all with your matrix stuff huh!
    possibly, but i don't think so. i'm getting really good performance with dot-products. i'm still tweaking the benchmarking setup to ensure the optimizer doesn't give me an unfair baseline, but so far the results are highly compelling.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. 5-in-a-row strategies
    By Elysia in forum General AI Programming
    Replies: 0
    Last Post: 07-28-2010, 11:23 AM
  2. Strategies for Handling Command Line Arguments
    By thetinman in forum C++ Programming
    Replies: 7
    Last Post: 04-09-2008, 08:18 AM
  3. Automatic Testing Strategies
    By ichijoji in forum Game Programming
    Replies: 8
    Last Post: 04-25-2007, 11:45 AM