Thread: Lack of compiler loop optimization (Loop unswitching) ?

  1. #1
    Registered User
    Join Date
    Apr 2008
    Posts
    19

    Lack of compiler loop optimization (Loop unswitching) ?

    Using Visual C++ 2008 (express), the following program will finish on my CPU in either 1 second or around 4 seconds according to whether the scanf("%i",&tt) line is quoted or not.

    If it is enabled, I enter 0 (representing false), and one would then also expect the program to finish in 1 second, but unfortunately, it takes four.

    I did a little research and it turns out that there's a type of loop optimization called "loop unswitching". This means that at compile time, the loop is essentially duplicated, and the if(tt) u=sin((double)n); is removed in one of the versions.

    Unfortunately, Visual C++ isn't doing this optimization. Is there a way to force it to?

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/timeb.h>
    #include <math.h>
    
    
    void test(bool tt) {
    	double d=0;	double u=0;
    	timeb ti; ftime( &ti );	double l = ti.time*1000.0 + ti.millitm;		// Start the clock.
    	for(int n=0; n<1000000000; n++) {
    		d += u;
    
    		// The sin math part will never execute, but the 'If' will be looked at every time
    		// if the "scanf("%i",&tt)" line was used further below, and this will waste CPU cycles.
    		// I want the compiler optimization called "loop unswitching" to
    		// duplicate the entire For loop, with and without this line below.
    		if(tt) 	u=sin((double)n);		
    	}	
    	ftime( &ti );	double l2 = ti.time*1000.0 + ti.millitm;	// End the clock!
    	printf("Time taken: %f\n",l2-l);
    	printf("\n%f\n",d);
    }
    
    int main()
    {
    	bool tt=0;
    	scanf("%i",&tt);	// User enters 0 if this line is unquoted.
    	test(tt);
    	system("PAUSE");
    }

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Optimization Switches

    Note particularly the first few words: "Feature Only in Professional and Enterprise Editions"

  3. #3
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    How do you expect the compiler to unroll a loop that loops 1 000 000 000 times? What do you think the size of the executable would be?
    Also, if you remove the scanf, then tt will guaranteed to be 0 all the time, and therefore the compiler optimizes it, by basically removing all your code and loops.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #4
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Elysia View Post
    How do you expect the compiler to unroll a loop that loops 1 000 000 000 times?
    By unrolling it to some fixed degree, say ten iterations, and then looping around that.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  5. #5
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    How do you expect the compiler to unroll a loop that loops 1 000 000 000 times?
    There's a difference between loop unrolling and loop unswitching. Loop unswitching will only duplicate the function, one with the conditional section and one without. First though it has to prove that the condition will always be constant during the loop, otherwise the optimization can't be done. See Wikipedia here:
    Loop unswitching - Wikipedia, the free encyclopedia

    Note particularly the first few words: "Feature Only in Professional and Enterprise Editions"
    I'm a little bit confused because I'm able to obtain /O2 versus /Od, even with the Express edition. Also, it doesn't mention loop unswitching amongst that lot. Can you clarify further? Also bear in mind that page references to Visual Studio 6.0 (1998), whilst I'm using 2008. From a look at Microsoft's site, I can't seem to find a comparison for 2008 - Express vs Pro.

  6. #6
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    And what prevents you from simulating the loop unswitching compiler optimization?

  7. #7
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    itCbitC, if you mean manually coding the two loops myself, then it's impractical because maintenance would be a nightmare (my real program isn't the example above, but hundreds of lines). Any change to the loop would mean a change to its 'mirror'. It gets worse if there are say, 3 conditions (which would produce 8 'mirrors' because 2^3=8).

  8. #8
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by twinbee View Post
    itCbitC, if you mean manually coding the two loops myself, then it's impractical because maintenance would be a nightmare (my real program isn't the example above, but hundreds of lines). Any change to the loop would mean a change to its 'mirror'. It gets worse if there are say, 3 conditions (which would produce 8 'mirrors' because 2^3=8).
    If it were C++, I'd say templates are the solution. In C, macros may be a possibility.

    Code:
    #define UNSWITCHED_LOOP( statement ) \
    for(...) \
    { \
        blah blah \
        statement; \
        blah blah \
    }
    
    ...
    
    if( condition1 )
    {
        UNSWITCHED_LOOP( statement1 );
    }
    else if( condition2 )
    {
        UNSWITCHED_LOOP( statement2 );
    }
    else if(...)
    {
        ...
    }
    Ugly, and a pain in the ass in the debugger, but that's C for you.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  9. #9
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    Internal automatic loop unswitching would be even nicer though, and I thought C++ 2008 Express would've supported that. Thanks anyway, I may end up using that technique even if only as a last resort.

    A somewhat related question; Is it possible with VC++ to see the modified source code after the compiler has tried to unroll, and inline, and other stuff with it? (But obviously before it's been turned into object code). I want to see the intermediate stage so it's still legible.

  10. #10
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by twinbee View Post
    itCbitC, if you mean manually coding the two loops myself, then it's impractical because maintenance would be a nightmare (my real program isn't the example above, but hundreds of lines). Any change to the loop would mean a change to its 'mirror'. It gets worse if there are say, 3 conditions (which would produce 8 'mirrors' because 2^3=8).
    If that's true then the effect of performing the loop unswitching optimisation would be a negligible speed increase and a large increase of the code size. In fact that could even cause it to slow down due to code bloat.

    Visual Studio is probably doing the right thing. It's not stupid enough to duplicate hundreds of instructions for next to no benefit. In general, the loop unswitching optimisation would only be used in the kinds of places where you would almost do it yourself.

    What's more is that this is not a loop unswitching issue! Most likely when the scanf is commented out, the compiler is able to determine that tt does not change and is perhaps inlining the function and optimising out the 'if' branch! You can confirm this by setting tt to true, and commenting out the scanf. It should be barely quicker than with the scanf uncommented. Or better yet, you take a look at the assembly output.
    So it's probably not that it isn't doing an optimisation that you think it should, it's probably that its doing more optimistion than you expected!
    Your "little research" is probably a case of a little knowledge being more dangerous than no knowledge at all.

    Incidentally you should be using true and false with a bool, never 0 or 1, and for that matter you shouldn't be reading into a bool using %i, and even better, you should be using C++ I/O functionality anyway.
    Last edited by iMalc; 05-11-2010 at 01:10 AM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  11. #11
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    Thanks for trying to help.

    Most likely when the scanf is commented out, the compiler is able to determine that tt does not change and is perhaps inlining the function and optimising out the 'if' branch!
    You may well be right that the compiler is simply deleting the math line when I comment out the scanf line - I did consider that. But my point is that I still want to force the compiler to use loop unswitching for the case when the scanf line is being used. It would drastically speed my code up.

    Visual Studio is probably doing the right thing. It's not stupid enough to duplicate hundreds of instructions for next to no benefit.
    If I gain an extra 20% or even 5% speed at the cost of an extra few hundred lines in a multi-thousand line project, I would consider that a definite gain, at least for my purposes. Especially when these lines are invisible to me since only the compiler creates them on top during compilation.

    This case is more exaggerated of course. 4x speed is not to be sniffed at.

    Consider the following code. I have manually duplicated the loop myself, and called the appropriate one according to the initial boolean value. And even with the scanf function, I get the fast 1 second time, rather than the slow 4 second time. *This* is pretty much what I want the compiler to do, and it's not doing it.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/timeb.h>
    #include <math.h>
    
    
    void test(bool tt) {
    	double d=0;	double u=0;
    	timeb ti; ftime( &ti );	double l = ti.time*1000.0 + ti.millitm;		// Start the clock.
    	
    	if(tt==false) {
    		for(int n=0; n<1000000000; n++) {
    			d += u;
    		}
    	} else {
    		for(int n=0; n<1000000000; n++) {
    			d += u;
    			u=sin((double)n);		
    		}
    	}
    	ftime( &ti );
    	double l2 = ti.time*1000.0 + ti.millitm;	// End the clock!
    	printf("Time taken: %f\n",l2-l);
    	printf("\n%f\n",d);
    }
    
    
    
    
    int main()
    {
    	bool tt=0;
    	scanf("%b",&tt);	// User enters 0 if this line is unquoted.
    	test(tt);
    	system("PAUSE");
    }
    Does anyone use the Intel compiler? I would love to know if that will use loop unswitching.
    Last edited by twinbee; 05-11-2010 at 10:10 AM.

  12. #12
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Just to be clear, your 4 seconds vs 1 second is for the code you've been posting, not the real code you were initially concerned with right?
    Just because manual loop unswitching improved the speed of this code does not mean that in your real code that it is advantageous. Sometimes keeping the code size smaller is more important. Optimisation always varies depending on the actual code.

    Thinking more about this, what's probably stopping loop-unswitching from happening here (if that's the case) is that it can't prove that tt is not modified by other code. You've taken the address of tt and passed this into some function, and it has no way of proving that that function doesn't hold onto the address of tt and modify tt asychonously whilst the loop is executing. Even marking the tt parameter as const wouldn't help there.
    In your real code you might never be taking the address of tt and so it might not be so pessimistic, and it might actually do loop-unswitching. But maybe something else you are doing there is stopping it.
    The point being that no matter what you're proven the compiler does with this artifical example, it doesn't do anything to help improve the speed of your actual code. It is very likely that the compiler will do loop-unswitching in several cases. You just need to figure out what makes your code not one of those cases.

    If you want to speed up your actual code, you'd have to post it.
    Last edited by iMalc; 05-11-2010 at 01:08 PM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  13. #13
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    Just to be clear, your 4 seconds vs 1 second is for the code you've been posting, not the real code you were initially concerned with right?
    Correct.
    Sometimes keeping the code size smaller is more important.
    Within reason, I agree. Optimization if done at all should come at the end of the project. However, we're talking about code we aren't even going to see, because the compiler keeps it invisible from the programmer. Thus there won't really be any code bloat, apart from in the final exe obviously (which I'm not too bothered about here).

    Thinking more about this, what's probably stopping loop-unswitching from happening here (if that's the case) is that it can't prove that tt is not modified by other code.
    This is where I'm confused. The 'tt' variable isn't altered at any point throughout the test() function, let alone the loop itself. Also, it isn't being passed by reference to the function, so the function isn't going to worry about it getting changed by another thread or something. In theory, it should be trivial to prove as far as I can see. Can you clarify?

    If you have any suggestions I can try out to see what's going on, I'd be delighted to test them. However, I think this is simply a case of the Microsoft compiler being stupid. Perhaps the Intel compiler doesn't suffer in this way (even GCC/Mingw supposedly supports loop unswitching).

    In your real code you might never be taking the address of tt and so it might not be so pessimistic, and it might actually do loop-unswitching. But maybe something else you are doing there is stopping it.
    I'm pretty sure that the thing that's happening here is happening in my large project too. Just for the record, it's affecting my large project by 5-10% at the moment, but that will grow to around 20%-40% as I add more conditions inside the core inner loop, unless I solve this.

    The point being that no matter what you're proven the compiler does with this artifical example,
    To try and break down a problem, it seems best to simplify it as much as possible, hence this short code. My project is massive and (currently) messy - I wouldn't want to subject you to that yet Chances are, if the compiler fails on an example as simple as this, it's more than likely to fail on a large project.
    Last edited by twinbee; 05-11-2010 at 03:28 PM.

  14. #14
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by twinbee View Post
    Correct.
    This is where I'm confused. The 'tt' variable isn't altered at any point throughout the test() function, let alone the loop itself. Also, it isn't being passed by reference to the function, so the function isn't going to worry about it getting changed by another thread or something. In theory, it should be trivial to prove as far as I can see. Can you clarify?
    Sorry, you're right there. I was mistakenly thinking that tt was passed by reference.

    I am however not the least bit surprised that in your actual project the difference is more like 5-10% rather than 400%.
    Although this may increase as more conditions are added, the benefit of loop-unswitching when there are multiple unchanging variables decreases dramatically. As you've already noted, having three of them results in eight possible loop variations.

    What you didn't catch onto earlier was that I wasn't talking about source code size when I said "code size", I was talking about machine code size. I.e. say your code stays the same and the compiler produces machine code that actually does the loop-unswitching. You need to realise that loop-unswitching is only beneficial up to a point. Code that repeatedly evaulates an unchanging condition can end up faster than code that has many copies of some large loop after loop-unswitching, simply because of what does or does not fit in the CPU instruction caches. Any optimisation that makes your exe larger has the potential to decrease performance. More often than not it probably wont, but it amost definitely can!

    Thus this feared increase in time taken of 20% could well be unavoidable even if you were to do the loop-unswitching yourself. However, remember that you probably aren't just adding more conditions later, but you're probably adding more code as well. The proportion of code which is simply evaluating "constant" conditions may not change very much.
    More than likely though this can be made up for (and then some) by doing other optimisations yourself. Those of us who have spent years both implementing algorithms with better Big-Oh notation running times and at the other end of things, tinkering on projects such as software 3D engines have plenty of optimisation experience and could probably speed up your program considerably if only we could see some of the actual code.

    Just when you think you've squeezed the last 3% extra performance out of something that you can, someone else comes along and improves it by a factor of 300%.
    As strange as this sounds, I've once done some optimisation that improved performance by 10%, then after doing another optimisation it increased by a further 30%, then upon taking out the first optimisation, it increased a furthur 20%. I went back and forth several times, and it was indeed consistent. Very little is clear-cut when it comes to optimisation.

    Finally, have you heard of the poilcy-based-design technique? It can be used to effectively perform loop-unswitching, whilst not actually duplicating the source code of your loops. It's much easier in C++ though.
    Edit: Oh wait brewbuck has already shown that.
    Last edited by iMalc; 05-12-2010 at 02:30 AM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  15. #15
    Registered User
    Join Date
    Apr 2008
    Posts
    19
    I am however not the least bit surprised that in your actual project the difference is more like 5-10% rather than 400%.
    Although this may increase as more conditions are added, the benefit of loop-unswitching when there are multiple unchanging variables decreases dramatically. As you've already noted, having three of them results in eight possible loop variations.
    The section of code in question I could squeeze down to only around 1-2k. I'm more than happy to use 64 or even 32 duplicates. In any case, the example code I posted in this thread could easily be optimized by VC++, and it's not, so there's something screwy going on anyway.

    Code that repeatedly evaulates an unchanging condition can end up faster than code that has many copies of some large loop after loop-unswitching, simply because of what does or does not fit in the CPU instruction caches.
    Correct me if I'm wrong, but the 'chosen' loop out of the say, 32 'mirrors' would be the one that's being constantly accessed. Thus, I believe the CPU would intelligently keep these in the fastest level cache possible, by finding out that these instructions are being used the most. In other words, even if there 100 meg of program instructions, as long as only 1k of them being accessed during the program's run, then these will be prioritized massively in the CPU's instruction cache.

    tinkering on projects such as software 3D engines have plenty of optimisation experience and could probably speed up your program considerably
    I doubt it - I have spent a while Just kidding. In any case, I still have a some optimizations to make myself, so I'd do those first anyway. Other than that, it's all quite messy at the moment.

    Finally, have you heard of the poilcy-based-design technique? It can be used to effectively perform loop-unswitching, whilst not actually duplicating the source code of your loops. It's much easier in C++ though.
    Edit: Oh wait brewbuck has already shown that.
    I'm a little concerned about 'fluffing' the code up, but I will consider that. When he said "Ugly, and a pain in the ass in the debugger", did he mean the C++ template solution as well as the c macro solution, or just the c macro solution?
    Last edited by twinbee; 05-17-2010 at 02:06 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. syntax question
    By cyph1e in forum C Programming
    Replies: 19
    Last Post: 03-31-2006, 12:59 AM
  2. return to start coding?
    By talnoy in forum C++ Programming
    Replies: 1
    Last Post: 01-26-2006, 03:48 AM
  3. Scope And Parameter Passing
    By djwicks in forum C Programming
    Replies: 6
    Last Post: 03-28-2005, 08:26 PM
  4. Loop Optimization & String Comparison.
    By Lithorien in forum C++ Programming
    Replies: 8
    Last Post: 08-09-2004, 06:00 PM
  5. for loop or while loop
    By slamit93 in forum C++ Programming
    Replies: 3
    Last Post: 05-07-2002, 04:13 AM

Tags for this Thread