Assuming you do not enable IPA, interprocedural analysis, or IPO, interprocedural optimizations, at link time: If you compile the target function in a separate compilation unit, the compiler cannot optimize away the function calls, even if the function was declared void and takes no parameters.
I do a lot of math-related stuff, and end up having to benchmark all sorts of functions, with various optimization flags. In complex cases inlining and data reuse tends to affect timings quite a bit, but full programs tend to be difficult to fully test and benchmark. So, using these microbenchmarks in different conditions let me make pretty good estimates how each implementation works in different situations, without doing the actual, full tests.
I usually have one dummy function that takes the exact same parameters, but does nothing in the function body, i.e.
Code:
void compute_dummy(double *const out, const double *const in, const size_t size);
void compute_candidate1(double *const out, const double *const in, const size_t size);
void compute_candidate2(double *const out, const double *const in, const size_t size);
so that I can estimate the function call overhead. (On x86-64, it's typically 66-67 clock cycles as reported by rdtsc, with optimizations enabled. It does vary a bit if there are many or complex function arguments.)
The test functions and the main benchmark are compiled separately, to separate object files, and only linked together -- being careful not to enable IPA or IPO -- to get the actual executable.
When benchmarking, things like cache hotness (whether data is already in cache or not) affects the timing results a lot, but measuring the runtime of e.g. 1000 or 10000 calls, and looking at the timing distribution (via rdtsc on x86 and x86-64, or clock_gettime(CLOCK_THREAD_CPUTIME_ID, ×pec) in general case) is quite informative. I'm personally very careful to not just look at the minimum timings; the median or mode is much more reliable. You can always expect a few surprisingly long measurements, due to kernel/processor scheduling.