Right, my example isn't saved by loop unrolling. Each iteration through the first loop depends on the previous iteration, so the CPU will have to wait for each product computation before going on to the next. That makes for bad throughput and low resource utilization (of ALUs). But if the two loops are merged, then the CPU can, via pipelining, compute the sum for the second loop at the same time as it computes the product for the first.
It's a contrived example to be sure. And I agree that you only optimize when you've established a need for it. But it is better performance to use one loop.