It seems highly doubtful that a 10x speedup was profiled on a release build, which is what you should always be profiling.
It seems perfectly likely to me.
Even the one temporary, assuming "RVO" here, would be more expensive than creating none simply because the normal constructor was designed to be expensive.
The question would be, how often does it come up such that a class has partial value state which can't be written to create such an expensive state only when necessary?
I'm still thinking pretty much never.
The last and final trick I'll mention, that is quite difficult to do and is increasingly unnecessary, is to use template-meta-programming to build up expression trees and evaluate them in an optimal manner at compile-time, creating zero unnecessary temporaries.
That would be a bad fit here according to post #3.
The question that started the discussion here relates to an expensive member variable that represents state unnecessarily for certain operations.
Using expression templates would fit the case where the expensive member is crucial to such operations.
That out of the way, you are completely wrong about "increasingly unnecessary", but then here it is absolutely unnecessary which leads me to believe that you don't know where expression templates shine. Even "rvalue references" do not reduce the utility of expression templates when combined with chaining evaluators.
[Edit]
Of course, this is only where they shine with respect to optimization code without sacrificing any readability.
That's actually a rather small part of expression templates.
And yes, I do realize iMalc was not actually referencing such stuff as embeddable languages when he said "increasingly unnecessary". This edit is for those who don't know this.
[/Edit]
Consider the following code with respect to the intent of saving copies and allocations.
Code:
VEC f1,f2,f3,f4;
// ...
f1 = f2 * (f3 - f4); // f1[?] = f2[?] * (f3[?] - f4[?]);
Here expression templates are simply unnecessary. Yes, they are obviously unnecessary with C++11 because of "rvalue references", but they aren't even useful only to that end, saving copies and allocations, as both of the examples Elysia and I posted shows. (And of course, there are still more possibilities.)
That goal only transforms the code to eliminate the overhead of the relevant temporaries.
Code:
VEC f1,f2,f3,f4;
// ...
f1 = f3;
f1 -= f4;
f1 *= f2;
// ...
Now, that alone is a fine goal, but it really is as you said "increasingly unnecessary" to do implement such facilities by hand only for that purpose.
That isn't where expression templates stop.
The above code exhibits multiple transforms that are as large as the relevant object. (Here being the length of the implied array.)
You basically get something like the following where each element of the target is mutated multiple times.
Code:
VEC f1,f2,f3,f4;
// ...
for(int cB(0), cE(f1.size()); cB < cE; ++cB)
{
f1[cB] = f3[cB];
}
// ...
for(int cB(0), cE(f1.size()); cB < cE; ++cB)
{
f1[cB] -= f4[cB];
}
// ...
for(int cB(0), cE(f1.size()); cB < cE; ++cB)
{
f1[cB] *= f2[cB];
}
// ...
The code below is likely to be quite a bit faster for multiple reasons. (If the array is large enough you'll get fewer cache misses if nothing else came into play.)
Code:
VEC f1,f2,f3,f4;
// ...
for(int cB(0), cE(f1.size()); cB < cE; ++cB)
{
f1[cB] = f2[cB] * (f3[cB] - f4[cB]);
}
// ...
However, while many modern compilers do implement at least "RVO", they don't really make that transformation. (If the loops had been written directly with fixed values the transformation is somewhat likely at higher optimizations with some compilers. However, the "visibility" of the loops are hidden behind methods where the compiler is already trying to work out "RVO" and "NRVO" optimizations as well as whatever other optimizations may be appropriate within those methods. Of course, making them `inline' could very likely help, but even then, the transformation is far more unlikely because far more must be proven about the code for the compiler to reason that the transformation is always valid.)
This transform is exactly what expression templates can buy you without even considering what optimizations the compiler supports.
Again, here it isn't necessary, but then, we still aren't at then end of expression templates.
The single loop code from above might very well look more like what follows where more meta-programming may live.
Code:
VEC f1,f2,f3,f4;
// ...
for(int cB(0), cE(f1.size()); cB < cE; ++cB)
{
eval<???expression>(f1[cB],eval<???expression>(f2[cB], eval<???expression>(f3[cB], f4[cB])));
}
// ...
Here some real magic (*) may happen. Why?
This could result, as part of expansion, in a completely different expression tree with far more complexity... which can also be evaluated as part of evaluation the original expression. ^_^
Soma
(*) #3: Any sufficiently advanced technology is indistinguishable from magic.