With the new code, I get the same results as you, the second example needs to be linked to the math library. Looks like it's not a version issue.
Yes, the presence of call pow in the assembly is proof of that, that is the compiler telling the linker that it actually needs to call the pow() function.
While the pow() function is part of the standard C specification, it is not part of the "standard library", at least on Linux implementations. I believe this is a holdover from days long past when a FPU (floating point unit) was not integrated into the CPU and thus, depending on which FPU you had on your system (if any -- some lower end systems used to omit them, doing FP calcs with the regular CPU -- it was slower but cheaper), you would need to generate different instructions for computing pow() or other floating point operations. That is why FP functions were in a separate "math" library.
Now, as to GCC's optimization, it boils down to constant folding. You are compiling with the default optimization level (-O0 which is no optimizations). That level does not constant-fold the more complex expression in pow2.c. Some quick scripting to produce assembly code for pow1 and pow2 at each optimization level:
Code:
$ for i in {1..2};
do
for opt in {0..3};
do
gcc -Wall -S -O${opt} -o pow${i}.${opt}.s pow${i}.c;
done;
done
That produces files like pow1.0.s for level 0 optimization, pow1.1.s for level 1 optimization, etc. Now, using diff to compare pow1 and pow2 assembly code at equal optimization levels:
Code:
$ for opt in {0..3};
do
echo "Optimization level ${opt}";
diff pow{1,2}.${opt}.s;
echo;
done
Optimization level 0
1c1
< .file "pow1.c"
---
> .file "pow2.c"
3c3
< .LC1:
---
> .LC2:
16c16
< subq $16, %rsp
---
> subq $32, %rsp
21a22,31
> movsd %xmm0, -24(%rbp)
> movss -8(%rbp), %xmm0
> cvtps2pd %xmm0, %xmm0
> movsd .LC1(%rip), %xmm1
> call pow
> movsd %xmm0, -32(%rbp)
> movq -32(%rbp), %rax
> movq %rax, -32(%rbp)
> movsd -32(%rbp), %xmm0
> addsd -24(%rbp), %xmm0
27c37
< movl $.LC1, %eax
---
> movl $.LC2, %eax
37a48,52
> .section .rodata
> .align 8
> .LC1:
> .long 0
> .long 1074266112
Optimization level 1
1c1
< .file "pow1.c"
---
> .file "pow2.c"
28,29c28,29
< .long 3758096384
< .long 1074897878
---
> .long 536870912
> .long 1076582285
Optimization level 2
1c1
< .file "pow1.c"
---
> .file "pow2.c"
29,30c29,30
< .long 3758096384
< .long 1074897878
---
> .long 536870912
> .long 1076582285
Optimization level 3
1c1
< .file "pow1.c"
---
> .file "pow2.c"
29,30c29,30
< .long 3758096384
< .long 1074897878
---
> .long 536870912
> .long 1076582285
That bit I highlighted in red is the key difference between pow1 and pow2 at optimization level 0. It is all the extra code for actually computing the expression run-time instead of pre-computing it at compile time. Also note that it actually calls pow(). In the rest of the optimization levels, the difference between pow1 and pow2 boils down to using different pre-computed values.
Note, I changed the exponent in the second call from 3 to 2 and the call to pow() went away, though pow2 still required more code:
Code:
Optimization level 0
1c1
< .file "pow1.c"
---
> .file "pow2.c"
21a22,26
> movapd %xmm0, %xmm1
> movss -8(%rbp), %xmm0
> cvtps2pd %xmm0, %xmm0
> mulsd %xmm0, %xmm0
> addsd %xmm1, %xmm0
It looks like GCC converts calls to pow() with an exponent of 2, to a simple floating point multiplication of a number by itself. You can imagine that for higher powers, using a series of multiplies might not be so efficient. The pow() implementation probably uses some special tricks.