Micro-optimizing Fibonacci n-th term generator: a journey to perfection

**Nominal Animal** · 01-01-2015

Originally Posted by the_jackass

Multiplication by single "digit" (as in a digit with the base you're using...pow(10,9) in his example) is faster and easier in many cases than when both numbers have multiple "digits" (you would need less intermediate arbritary precision numbers for one) if the numbers youre working with fit within sane range of O(n^2) multiplication.

Quite right. If N is the number of digit groups (digit group = "word") in a number,

Adding or subtracting a small (single-word) integer from an arbitrary precision integer takes O(N) operations.

The addition/subtraction is O(1) unless a carry pass is needed. If I remember correctly, you need less than 1 carry step on average (using uniform random additions/subtractions), and less than 2 carry steps on average even if adding/subtracting the largest small integer.
Multipliying an arbitrary precision integer by a small (single-word) integer takes O(N) operations.

The trick is that you need doubly-wide temporary result. Adding a small integer to the result (making the operation essentially a fused multiply-add by small integers) is essentially free, as it is easy to achieve by initializing the carry to that value (instead of zero).

The fused multiply-add by a small integer is especially useful when parsing decimal number strings. I recommend reading the decimal digits in groups (the largest power of ten that fits in a single arbitrary-sized integer word), from left to right if using a non-power-of-ten-sized words, and from right to left if using a power-of-ten-sized words, for maximum efficiency. You get a significant speed boost for large values, no practical slowdown for small values, for a very small increase in code complexity -- if you design that in; it is very hard to correctly add it afterwards.
Dividing an arbitrary precision integer by a small (single-word) integer takes O(N) operations.

The trick is that the dividend (numerator) can be as large as the size of the arbitrary word size plus the small integer size.

If you need to convert arbitrary precision integers to decimal number strings, and your words are not power-of-ten sized, you'll need this function. Using the largest power of ten that fits in the digit group ("word") integers, makes for a pretty efficient conversion. Starting at the least significant digit group, and rightmost position in the character buffer (knowing the size of the arbitrary integer means you know how large the buffer might have to be; remember to reserve for the trailing \0 and optionally for the sign), you build the decimal number string leftwards. Remove any leading zeroes, add the sign if needed, and move the resulting string to the start of the buffer.

If your words are power-of-ten sized, you can convert arbitrary precision integers to decimal number strings by directly converting the highest digit group (skipping leading zeroes), then converting the following digit groups but keeping the leading zeroes.

In many cases the calculations done on the number strings are actually much faster than the I/O needed to convert from and to human readable decimal number strings. Even with the overhead (6.25% for 9-decimal-digit groups in 32-bit storage units), an implementation that uses power-of-ten -based words can therefore be both simpler to implement, and faster to execute, because the I/O is so much more efficient.
Naive integer division between two arbitrary precision integer takes O(N²) operations.

The trick is that the dividend (numerator) can be twice as large as the arbitrary digit group ("word") size.

All C99 compilers support uint_leastN_t and uint_fastN_t types for N = 8, 16, 32, and 64 (but note that the exact-width types uintN_t are optional; see §7.18.1.2 and §7.18.1.3 in the C99 standard for details).

Integer multiplication and division with these types might use compiler-provided helper functions, but you can use the 32-bit types for the digit groups, and 64-bit types for the intermediate results, to write quite portable and reasonably efficient C99 arbitrary-precision integer code without much hassle. For signed integers, using a separate sign bit is much simpler than two's complement or one's complement format.

Unfortunately, none of the libraries I know of use these, and instead use various configuration-time/compilation-time magic to determine which basic types (and their sizes) to use. Okay, that might give theoretically better performance, but I for one would prefer to also have a clean, portable, C99 bigint library for learning purposes.

Apologies for the two excessively vebose posts. I just find Fibonacci numbers very interesting for learning purposes. First, writing a bigint library is a good learning exercise; it involves memory management and understanding the properties of the integer operations. Second, the radical increase in efficiency/speed by applying math and logic on the algorithm -- basically, optimizing the algorithm instead of code -- is pretty much a perfect example of proper optimization. Third, it is quite straightforward to write all that with portable C99 code, with exact known properties. Fourth, such a library (and the Fibonacci function itself) is just about the perfect size and complexity for showing programmers interested in numerical computation (say, scientists doing computational physics, chemistry, etc.), on how to interface such libraries to higher-level code like Python, so that end-users (non-programmer scientists) can avoid writing awful code in low-level language, and instead write the frontends in a high-level language, relying on libraries and functions provided by skilled programmers. Each computational science group should have a code specialist, in my biased opinion.

**the_jackass** · 01-03-2015

It was a very informative post, thank you.

Originally Posted by Nominal Animal

Naive integer division between two arbitrary precision integer takes O(N²) operations.

The trick is that the dividend (numerator) can be twice as large as the arbitrary digit group ("word") size.

I didnt quite understand what you were saying here. Currently the only method I know when denominator is multi-word is of Newton's iteration. It'll be nice to know what you had in mind, because I find Newton's method very complicated to implement.

**Mindaugas** · 01-03-2015

Originally Posted by iMalc

You don't seem to have any idea how bad the exponential time taken by the naive recursive approach is.
Doubly-recursively calculating the Fib of N+1 takes more than twice the effort of calculating the Fib of N.
It's so bad that we're talking over a billion times slower to calculate even Fib(50) recursive rather than iteratively. Should you want to go to say Fib(100), it takes so long that you would literally grow old and die before it finished.
But if done using a simple loop with two variables, one can easily calculate Fib(100) in less than a millisecond.

Knowing that, I will also tell you that an iterative loop can be translated to a simple form of single-recursion, sort of the opposite of doing the tail-recursion elimination.
Tail recursion elimination is a useful skill to learn, but going the other way - not so much.

Thanks, for introducing the "tail recursion elimination" - will learn that as well. I did the naive implementation only to see how slow it is compared to the memoized one

... also I will take a look at the iterative way of calculating it. Another thing to my to-do list ....

P.S. The memoized recursive implementation takes 0ms to calculate the 90th Fib - fascinating.

**Nominal Animal** · 01-03-2015

Originally Posted by the_jackass

Currently the only method I know when denominator is multi-word is of Newton's iteration.

To me, long division == naïve division. There are examples in e.g. Hacker's Delight (using 32-bit, 64-bit division). It is also described in Donald E. Knuth's The Art of Computer Programming as Algorithm D, I believe. Looking at the multi-digit divisor example (animation included) in the Wikipedia article might help in understanding it. Just remember to go at it a step at a time, and not merge steps prematurely.

It's not useful other than as a learning experience -- as opposed to the division by a single word, which is required for conversions to other radixes, i.e. for output.

**Mindaugas** · 01-04-2015

Originally Posted by Nominal Animal

Quite right. If N is the number of digit groups (digit group = "word") in a number,

Adding or subtracting a small (single-word) integer from an arbitrary precision integer takes O(N) operations.

The addition/subtraction is O(1) unless a carry pass is needed. If I remember correctly, you need less than 1 carry step on average (using uniform random additions/subtractions), and less than 2 carry steps on average even if adding/subtracting the largest small integer.
Multipliying an arbitrary precision integer by a small (single-word) integer takes O(N) operations.

The trick is that you need doubly-wide temporary result. Adding a small integer to the result (making the operation essentially a fused multiply-add by small integers) is essentially free, as it is easy to achieve by initializing the carry to that value (instead of zero).

The fused multiply-add by a small integer is especially useful when parsing decimal number strings. I recommend reading the decimal digits in groups (the largest power of ten that fits in a single arbitrary-sized integer word), from left to right if using a non-power-of-ten-sized words, and from right to left if using a power-of-ten-sized words, for maximum efficiency. You get a significant speed boost for large values, no practical slowdown for small values, for a very small increase in code complexity -- if you design that in; it is very hard to correctly add it afterwards.
Dividing an arbitrary precision integer by a small (single-word) integer takes O(N) operations.

The trick is that the dividend (numerator) can be as large as the size of the arbitrary word size plus the small integer size.

If you need to convert arbitrary precision integers to decimal number strings, and your words are not power-of-ten sized, you'll need this function. Using the largest power of ten that fits in the digit group ("word") integers, makes for a pretty efficient conversion. Starting at the least significant digit group, and rightmost position in the character buffer (knowing the size of the arbitrary integer means you know how large the buffer might have to be; remember to reserve for the trailing \0 and optionally for the sign), you build the decimal number string leftwards. Remove any leading zeroes, add the sign if needed, and move the resulting string to the start of the buffer.

If your words are power-of-ten sized, you can convert arbitrary precision integers to decimal number strings by directly converting the highest digit group (skipping leading zeroes), then converting the following digit groups but keeping the leading zeroes.

In many cases the calculations done on the number strings are actually much faster than the I/O needed to convert from and to human readable decimal number strings. Even with the overhead (6.25% for 9-decimal-digit groups in 32-bit storage units), an implementation that uses power-of-ten -based words can therefore be both simpler to implement, and faster to execute, because the I/O is so much more efficient.
Naive integer division between two arbitrary precision integer takes O(N²) operations.

The trick is that the dividend (numerator) can be twice as large as the arbitrary digit group ("word") size.

All C99 compilers support uint_leastN_t and uint_fastN_t types for N = 8, 16, 32, and 64 (but note that the exact-width types uintN_t are optional; see §7.18.1.2 and §7.18.1.3 in the C99 standard for details).

Integer multiplication and division with these types might use compiler-provided helper functions, but you can use the 32-bit types for the digit groups, and 64-bit types for the intermediate results, to write quite portable and reasonably efficient C99 arbitrary-precision integer code without much hassle. For signed integers, using a separate sign bit is much simpler than two's complement or one's complement format.

Unfortunately, none of the libraries I know of use these, and instead use various configuration-time/compilation-time magic to determine which basic types (and their sizes) to use. Okay, that might give theoretically better performance, but I for one would prefer to also have a clean, portable, C99 bigint library for learning purposes.

Apologies for the two excessively vebose posts. I just find Fibonacci numbers very interesting for learning purposes. First, writing a bigint library is a good learning exercise; it involves memory management and understanding the properties of the integer operations. Second, the radical increase in efficiency/speed by applying math and logic on the algorithm -- basically, optimizing the algorithm instead of code -- is pretty much a perfect example of proper optimization. Third, it is quite straightforward to write all that with portable C99 code, with exact known properties. Fourth, such a library (and the Fibonacci function itself) is just about the perfect size and complexity for showing programmers interested in numerical computation (say, scientists doing computational physics, chemistry, etc.), on how to interface such libraries to higher-level code like Python, so that end-users (non-programmer scientists) can avoid writing awful code in low-level language, and instead write the frontends in a high-level language, relying on libraries and functions provided by skilled programmers. Each computational science group should have a code specialist, in my biased opinion.

No need to apologize as your posts expressed in a clear maner what I had only a vague intuition about: solving Fibonacci sequence problems is a great tool to learn programming in C. I have your answers printed and will try to implement most of what you outlined when I revisit Fibonacci - the 2nd iterration.

Thanks very much!

Thread: Micro-optimizing Fibonacci n-th term generator: a journey to perfection

Thread Tools

Search Thread

Display

Similar Threads

Lagged Fibonacci generator

Looking for source code of Lagged Fibonacci generator

A journey to nowhere

Magneto's journey to becoming a C++ expert.

Fibonacci How should I make a generator

Tags for this Thread