# Thread: Micro-optimizing Fibonacci n-th term generator: a journey to perfection

1. ## Micro-optimizing Fibonacci n-th term generator: a journey to perfection

Hi,

Going through MIT's Introduction to Algorithms I'm developing a Fibonacci n-th term generator and want to make it as fast and as compact memory-vise as possible. The story:

1. Developed a naive, dumb version that does recursive calls:
Code:
```return (NaiveFibonacciNthTermGenerator (the_xth_fibonacci_num - 1) + \
NaiveFibonacciNthTermGenerator (the_xth_fibonacci_num - 2));```
2. I remembered the way recursion calls work and how a a binary-tree like structure having recursive calls work

so I switched the terms in the recursive call (my intuition was that it would help for a memoized solution as the base case is reached faster in the switched scenario, but it turned out to help even for the naive one). I have tested many times with different combinations and I gain up to 1 second for n > 45. The number of times the function is executed is the same in both cases (checked by adding a counter):
Code:
```return (NaiveFibonacciNthTermGenerator_switchedTerms (the_xth_fibonacci_num - 2) + \
NaiveFibonacciNthTermGenerator_switchedTerms (the_xth_fibonacci_num - 1));```
3. Then I started working on the memoization but stumbled upon the array that will contain the memoized solutions to the subproblems. Trying to optimize it for space I declared it like this:

Code:
`unsigned long long int cache[the_xth_fibonacci_num - 3];`
the "-3" supposed to save me some bytes (it's an array of long longs after all, so 24 bytes (3*8)), this is allowed by the fact that the base cases should not be in the cache[]. But now I'm not sure how should I handle it as this is a variable lenght array VLA.

Questions:
1. Why does the naive case with switched terms work faster than the one with not-swotched ones? I have swithed the function calls places (thinking that some caching might be involved - not sure how) - it is still consistently faster when n > 15. The same amount of function calls is made (I checked it) - why is it faster?
2. What is the correct solution for the cache[] problem? I can't place it as global constant-int-sized array, because it gets the value in the main() as the user enters the "the_xth_fibonacci_num" as the value he wants. Should I malloc it? Is this the best solution for minimizing time and space?
3. Any other performance tips - for saving time and space? Any tips for saving time sacrificing space and vice versa? I would appreaceate it - I have placed the base case for n=2 to appear on top of n=1 as it will be reached more times and so should be executed first (so you see what level of obsession we are talking about here ).

Anything like "check the aseambly generated for x", "change the terms", "reorder", "use different datatypes" - anything would help me a lot.

2. If you only want the nth term, not the terms before it, then consider using the formula to approximate the result: with appropriate rounding you should get the correct integer result with constant space and constant time (okay, not really because you have to compute powers).

3. Optimization should be done only as much as necessary and only when it is determined to be needed in the first place. You should also determine where the best places to optimize are (longest running code sections). In this case your algorithm in general needs optimization (as laserlight suggests) far more than the expression of that algorithm in code.

4. if you really want n fib numbers, and not just the n'th,
then you can easily implement an iterative algorithm, just make a loop, that takes the last 2 numbers of your array, add it togehter and save it in the array.
First 2 elements i would init by hand, since this is easier to implement.

this has linear time and linear space

5. Originally Posted by fendor
if you really want n fib numbers, and not just the n'th,
then you can easily implement an iterative algorithm, just make a loop, that takes the last 2 numbers of your array, add it togehter and save it in the array.
First 2 elements i would init by hand, since this is easier to implement.

this has linear time and linear space
But can we make the time logarithmic? Can we?!

6. There is a fairly simply mathematical expression - known since about the 17th century - that can be used. Easily expressed using three standard functions since C99. Albeit making use of floating point in intermediate steps.

No recursion, no need for additional storage (except for some intermediate results).

The root is 5

7. Originally Posted by MutantJohn
But can we make the time logarithmic? Can we?!
no, not as far as i know
if you want an array with every fib number till to n'th, you have to iterate through the whole list which causes the linear time. If you want only the n'th number, then i'd recommend grumpy and nonpuz, there is an mathematical expression, which is pretty accurate

if you want to prove mentioned function, you could have some fun with difference equation takes only about... 2 pages, if you have a large writing ^^

8. Ok, thanks - but I'm doing this not to solve the problem of "getting the N-th term in a Fib sequence" - this is just a problem that I tackle to learn optimisation and the thinking that is involved in it. The explicit problem "compute the n-th Fib number" is secondary to the implicit problem of learning optimisation techniques...

Therefore I will stick with microptimisations for the time being (hence the name of the thread ) this means that the algorithm will be recursive.
As a follow up problem I might codeup the approximation approach (i) or the array (ii) approach - but that is for latter.

Any answers on the questions I enumerated? Thanks

9. You ever heard of premature optimisation?

10. Originally Posted by grumpy
You ever heard of premature optimisation?
Yeah, I use to have that problem with my girlfriend when we first started dating. Oh wait... That was a different thing...

11. Originally Posted by Mindaugas
Ok, thanks - but I'm doing this not to solve the problem of "getting the N-th term in a Fib sequence" - this is just a problem that I tackle to learn optimisation and the thinking that is involved in it. The explicit problem "compute the n-th Fib number" is secondary to the implicit problem of learning optimisation techniques...

Therefore I will stick with microptimisations for the time being (hence the name of the thread ) this means that the algorithm will be recursive.
As a follow up problem I might codeup the approximation approach (i) or the array (ii) approach - but that is for latter.

Any answers on the questions I enumerated? Thanks
You miss the point entirely. What you're doing has absolutely no value now or in the future. You are not learning any valuable skill from doing this, you are not learning anything other than how to waste time. That is why we keep telling you about premature optimization. Also this "micro-optimization" is made up and pointless. If you really want to get into low-level optimization then you should be examining the generated assembly from this program and working from there.

12. Originally Posted by Mindaugas
Therefore I will stick with microptimisations for the time being (hence the name of the thread ) this means that the algorithm will be recursive.
You don't seem to have any idea how bad the exponential time taken by the naive recursive approach is.
Doubly-recursively calculating the Fib of N+1 takes more than twice the effort of calculating the Fib of N.
It's so bad that we're talking over a billion times slower to calculate even Fib(50) recursive rather than iteratively. Should you want to go to say Fib(100), it takes so long that you would literally grow old and die before it finished.
But if done using a simple loop with two variables, one can easily calculate Fib(100) in less than a millisecond.

Knowing that, I will also tell you that an iterative loop can be translated to a simple form of single-recursion, sort of the opposite of doing the tail-recursion elimination.
Tail recursion elimination is a useful skill to learn, but going the other way - not so much.

13. Laserlight, Grumpy and fendor refer to Binet's formula, which was actually found by Abraham de Moivre:
Fn = ϕn/√5 - (-ϕ)n/√5
where ϕ is the golden ratio,
ϕ = (1 + √5) / 2
Since |(-ϕ)n/√5| is always less than one half for any nonnegative n, you can use
Fn = ⌊ ϕn/√5 + 0.5 ⌋
where ⌊⌋ denotes rounding towards zero, i.e. floor function.

Using the standard types, both the recursive and the closed form (Binet's formula) are limited to a rather small sequence, as
F46 < 231, F47 < 232, but F48 > 232
F78 < 253, but F79 > 253
F92 < 263, F93 < 264, but F94 > 264
In other words, unsigned 32-bit integers can calculate up to and including F47, and signed 32-bit integers up to and including F46. Unsigned 64-bit integers can calculate up to and including F93, and signed 64-bit integers up to and including F92. You should use C99 uint32_t, int32_t, uint64_t, or int64_t types provided by inttypes.h or stdint.h for these, as the size of int and long and long long is up to the compiler to decide.

Most architectures nowadays implement the C99 double type using IEEE-754 Binary64 format which has 53-bit mantissa (bits of precision), so you cannot expect Binet's formula (or the round-towards-zero version derived from it) to be accurate beyond F78 or so. x86 and x86_64 architectures' 387 FPU also provides an 80-bit floating-point format with 64 bits of precision, which should allow accurate evaluation up to F93 or so. Some architectures may provide even more precise long double. In C99, you can use pow(FLT_RADIX, DBL_MANT_DIG) or powl(FLT_RADIX, LDBL_MANT_DIG) to find out how large integers the double resp. long double type can represent exactly. (On all machines I know of, except for IBM 360, FLT_RADIX is 2, and therefore DBL_MANT_DIG and LDBL_MANT_DIG directly define the number of bits of precision in the mantissa.)

For larger n, you need arbitrary-precision integers. For C, there are lots of libraries that provide all functionality needed -- I recommend GSL (manual) --, but implementing one yourself for nonnegative integers (and only for addition, subtraction, and multiplication operations, and conversions to/from decimal strings) in C99 is straightforward -- particularly using uint32_t as the base "word" type, and uint64_t as a high-precision intermediate type, to make carrying trivial and code portable.

If you write your own arbitrary-precision integer stuff, but wish to avoid having to write a divide-by-ten function for conversion to a string, you can pack nine decimal digits (000000000 to 999999999) in a 30-bit integer: your "words" are then groups of nine decimal digits. Although this typically wastes 2 out of every 32 bits, or 6.25% of memory used (and is slower time-wise too, because more "words" are needed to represent a number), it makes the string conversion code -- as well as any other operation that deals with specific decimal digits -- very fast and simple to implement. For addition, subtraction, and multiplication operations, only the numerical limits/constants are affected.

In practice, the relation
Fn = Fn-1 + Fn-2
is rarely used. Instead, the recurrence relations
Fm+n-1 = Fm Fn + Fm-1 Fn-1
Fm+n = Fm Fn+1 + Fm-1 Fn = Fm (Fn + Fn-1) + Fm-1 Fn
and
F2n-1 = Fn2 + Fn-12
F2n = Fn (Fn + 2 Fn-1)
are used to calculate the desired Fibonacci number; only the initial number for continuous sequences. Using the approach outlined below, this takes O(log n) arithmetic operations -- but note that each arithmetic operation on arbitrary-precision integers is now at least O(N) operations, where N is the number of digits (or "words") in the number.

The key is to realize you can write n as a (binary) sum,
n = b020 + b121 + ... + bk2k
which means you can write
Fn = Fb020 + b121 + ... + bk2k
In other words, you use the binary representation of n and powers-of-two and powers-of-two-less-one Fibonacci numbers (F0, F1, F2, F3, F4, F7, F8, F15, F16, F31, F32, and so on) with the Fm+n and Fm+n-1 formulas to construct the desired Fibonacci number (and the number preceding it for the intermediate terms).

Here is the pseudocode for a generic Fibonacci number function:
Code:
```function Fibonacci(n):

if (n == 0) then
return 0 /* fibonacci[0] */
end if

if (n < 0) then
n = -n
negate = (n is even)
else
negate = (false)
end if

if (n == 1) then
return 1 /* fibonacci[1] == fibonacci[-1] */
end if

factor_prev = 0 /* fibonacci[0] */
factor_curr = 1 /* fibonacci[1] */

if (n is odd) then
result_prev = 0 /* fibonacci[0] */
result_curr = 1 /* fibonacci[1] */
else
result_prev = 1 /* fibonacci[-1] */
result_curr = 0 /* fibonacci[0] */
end if

n = n / 2   /* This must be integer division. */

while (n > 0)

/* F(2*n-1) = F(n)*F(n) + F(n-1)*F(n-1) */
helper_prev = factor_curr * factor_curr + factor_prev * factor_prev

/* F(2*n) = F(n) * ( F(n) + F(n-1) + F(n-1) ) */
helper_curr = factor_curr * (factor_curr + factor_prev + factor_prev)

/* Double the factor */
factor_prev = helper_prev
factor_curr = helper_curr

/* If n has LSB set, combine factor and result: F(result) = F(factor+result) */
if (n is odd) then
/* F(m+n-1) = F(m)*F(n) + F(m-1)*F(n-1) */
helper_prev = factor_curr * result_curr + factor_prev * result_prev

/* F(m+n) = F(m)*F(n+1) + F(m-1)*F(n)
= F(m)*(F(n-1) + F(n)) + F(m-1)*F(n) */
helper_curr = factor_curr * (result_curr + result_prev) + factor_prev * result_prev

/* Combine */
result_prev = helper_prev
result_curr = helper_curr
end if

n = n / 2   /* Integer division here too. */

end while

if (negate) then
result_curr = -result_curr
end if

return result_curr
end function```
You can precalculate F2k and F2k-1 for some range of k, but the memory use is an issue: F2k has 0.694242×2k-1.160964 binary digits, and requires that many bits of memory. Precalculating larger k means you reserve a lot of memory for data that ends up being used only once per Fibonacci number calculated.

The slowest operation for large n is multiplication. This means that if each multiplication takes O(m(n)) time, the arbitrary-precision integer implementation that calculates Fn takes O(m(n) log n) time. Simple naive multiplication -- normal long multiplication, done by hand for example -- is O(N2), so a very straightforward implementation in C99 should be able to calculate an arbitrary Fibonacci number in O(n2 log n) time.

In practice, most current libraries and applications use naive multiplication for small arbitrary-precision integers, Toom-Cook multiplication for intermediate-size arbitrary-precision integers, and Schönhage-Strassen algorithm for largest arbitrary-precision integers: a discrete Fourier transform is taken from the multiplicands' digits ("words"), multiplied together digit-wise ("word"-wise), converted back using the inverse transform, followed by a carrying pass over the digits ("words"). This has only O(N log N log log N) time complexity (where N is the number of digits in each multiplicand, the multiplicands assumed to be of the same size), and is noticeably faster than the O(N2) for naive multiplication for larger N. The cutoff points depends on the implementation, and hardware cache architecture tends to add another wrinkle to serious optimization efforts.

Calculating and printing a sequence of N Fibonacci numbers starting with
previous = Fk-1
current = Fk
is trivial; in pseudocode,
Code:
```repeat N times:
print current
temporary = previous
previous = current
current = current + temporary```
In my opinion, the above pseudocode should be even more obvious than the recursive function.

My point: the journey to "perfection" mentioned in the thread title is pretty misleading. Each algorithm and each implementation is a balance between multiple factors, many of which are unknown beforehand. I'm hoping the OP and any others reading this thread realize that full understanding of the strengths and limitations of the implementation (and of the algorithm implemented) is much more important, and much more valuable than any warm-and-fuzzy feelings about perfection.

(In a perfect world, those insights about strengths and limitations would be documented in comments in the implementation, but I am myself still struggling to learn to do that...)

14. Originally Posted by MutantJohn
Yeah, I use to have that problem with my girlfriend when we first started dating. Oh wait... That was a different thing...
That depends on the girlfriend, and how early she starts trying to improve you.

15. Eh I'd like to add one other thing to what Nominal Animal said. Multiplication by single "digit" (as in a digit with the base you're using...pow(10,9) in his example) is faster and easier in many cases than when both numbers have multiple "digits" (you would need less intermediate arbritary precision numbers for one) if the numbers youre working with fit within sane range of O(n^2) multiplication. (That's upto about 10 to 20 thousand base 10 digits.)