# Thread: Best approach for extracting bits from a certain kind of pseudorandom generator?

1. ## Best approach for extracting bits from a certain kind of pseudorandom generator?

I have this pseudorandom generator that works in some arbitrary state space S. Obviously, the complexity (apparent entropy) of the output generated depends on the "size" of S. Unfortunately, for this particular generator the larger the state space, the larger the range of the integers generated. So I need to extract 64-bit numbers from a say 2^4096-bit result R. Now I could just cycle the generator, take the modulus of R with 2^64 to be my random number, rinse, repeat. But then using that approach I'd potentially be discarding a ton of bits on each run (the actual size of each number is unpredictable however). I'd really like to use at least some of those unused bits, but then again I don't want to risk biasing the output in the process either (suppose I were to simply take the modulus and divide out 2^64 repeatedly until R was exhausted, for instance). Is there an easy way to safely do this...or, should I just fuggeddaboudit (ie: not worth the trouble)?

2. Is there a reason why you can't set S to the 64 bit range when generating these numbers and then set it to something else when you need a bigger space?

3. When you're selecting a subset of a larger set based on some criteria (such as using modulo) you're reducing its apparent entropy anyway. Setting the state space and hence range to suit (like Mario suggested) is the simplest approach, assuming the implementation of the algorithm supports that.

4. Originally Posted by Mario F.
Is there a reason why you can't set S to the 64 bit range when generating these numbers and then set it to something else when you need a bigger space?
Well yeah, because I need the quality of the corresponding output to be that of say "mersenne-twister", not "RANDU"! The only way to achieve that is with a larger state space, but simply grabbing the lower 64-bits and tossing out what could be 4032 useful bits of pseudorandomness in the process just seems so wasteful. Of course, I could simply extract every contiguous "word" out of R (in this case, 64 64-bit integers), but I have a feeling that would somehow lead to a bias in the output. And if I simply extracted as many 64-bit numbers as possible using the modulus/division exhaustion technique I hinted at earlier, the output would most definitely be biased.

5. Of course, I could simply extract every contiguous "word" out of R as possible (in this case, 64 64-bit integers), but I have a feeling that would somehow lead to a bias in the output.
O_o

If the entropy is sufficiently distributed the bias will be reasonable.

If you are really worried, consider "whitening" the output.

Soma

6. Originally Posted by phantomotap
If the entropy is sufficiently distributed the bias will be reasonable.
Right, but the potential long runs of zeroes just seems so unnatural. I mean, when we run a random generator we don't even expect a single zero-valued word, much less a whole slew of them in a row (imagine the value of R was simply 2: this would be followed by 63 zeros!). I suppose I could simply skip all zero words but then would that lead to some sort of bias in the generator?

Originally Posted by phantomotap
If you are really worried, consider "whitening" the output.
How's that?

7. Right, but the potential long runs of zeroes just seems so unnatural. I mean, when we run a random generator we don't even expect a single zero-valued word, much less a whole slew of them in a row (imagine the value of R was simply 2: this would be followed by 63 zeros!).
O_o

I'm not sure you fully appreciate the science of "seeding" a "PRNG" or how "PRNG" actually behave, but you are certainly incorrect. You should read more about the value of "PRNG" state, the relationship between state and period, and using the entire state by crafting an algorithm which eventually mutates the state into every possible configuration.
[/Edit]

You can easily initialize the state for the common "MT19937" such that you'll get a lot of zeroes. The less common "CMWC4096" has a massive run of zeroes if you initialize the state to the correct values. Even if you begin sampling these excellent algorithms with a good initial "seed", you will eventually mutate the state sufficiently to pull a large sequence of zero values because the algorithms eventually mutates into the appropriate state.

Essentially, every "PRNG" having limited state merely pulls a few bits from of a massive number. I would not only expect such a massive number to have long repeated sequences; I'd be truly shocked if it didn't have long repeated sequences. There is nothing "unnatural" about a "PRNG" that returns the same value multiple times in series.

You could toss zeros, but you would not actually change anything in the direction you seem to care about. You could just as easily, with the same probability even depending on your algorithm, get a sequence (63 results) or ones or twos. You'd also have runs that follow other patterns like linear or doubling. If the nature of these results is significant, you algorithm has problems in any event.

Soma

8. Originally Posted by phantomotap
O_o
I'm not sure you fully appreciate the science of "seeding" a "PRNG" or how "PRNG" actually behave, but you are certainly incorrect. You should read more about the value of "PRNG" state, the relationship between state and period, and using the entire state by crafting an algorithm which eventually mutates the state into every possible configuration.
I can assure you, the confusion is all yours. The seed has not even been discussed anywhere here. I said that R was the RESULT from each cycle of the generator (which is of maximal period in any case, by the way).

Originally Posted by phantomotap
You can easily initialize the state for the common "MT19937" such that you'll get a lot of zeroes. The less common "CMWC4096" has a massive run of zeroes if you initialize the state to the correct values. Even if you begin sampling these excellent algorithms with a good initial "seed", you will eventually mutate the state sufficiently to pull a large sequence of zero values because the algorithms eventually mutates into the appropriate state. Essentially, every "PRNG" having limited state merely pulls a few bits from of a massive number. I would not only expect such a massive number to have long repeated sequences; I'd be truly shocked if it didn't have long repeated sequences. There is nothing "unnatural" about a "PRNG" that returns the same value multiple times in series. You could toss zeros, but you would not actually change anything in the direction you seem to care about. You could just as easily, with the same probability even depending on your algorithm, get a sequence (63 results) or ones or twos. You'd also have runs that follow other patterns like linear or doubling. If the nature of these results is significant, you algorithm has problems in any event.
Internal state aside, it absolutely IS a problem if a generator starts spitting out a huge run of zeroes - that could easily ruin a simulation. The problem here is nothing more than the fact that the larger state space is "encroaching" on the smaller range of numbers that we're pulling out, so we really just need a way to "shrink" the state space somehow. Neither lopping off zeroes nor using the proposed modulus/divide method seem to be the right solution though, so what?

9. I can assure you, the confusion is all yours. The seed has not even been discussed anywhere here. I said that R was the RESULT from each cycle of the generator.
o_O

You simply don't know what you are talking about.

This is really very simple: the RESULT from each cycle of a "PRNG" is a function of the state.

You clearly don't understand that very simple relationship so I'll elaborate now: the "seed" is a value or values used in the (re)initialization of the state.
[/Edit]

Put any "PRNG" algorithm in the correct state and you will see a sequence of RESULTS which are all zeroes or some such pattern.

If the period is long enough, you'll see pretty much any other finite pattern you can imagine.

Internal state aside, it absolutely IS a problem if a generator starts spitting out a huge run of zeroes - that could easily ruin a simulation.
*sigh*

Why don't you do yourself a favor and search the internet for "seeding MT19937 simulation"?

Did you see all of those posters who got wonky simulations because the strategy they used caused "MT19937" to yield long seemingly patterned sequences?

Despite the absolute fact that "MT19937" will produce runs of zeros, people successfully use "MT19937" for simulations.

If your algorithm isn't up to the task, use an algorithm which is appropriate for simulations.

Soma

10. Most large-state PRNGs use the same approach for this. You split the state into word-sized units and return each word as a separate pseudorandom number in turn. When you run out, you regenerate a new state.

For example, the original MT19937 Mersenne Twister variant has 19968 bits of state, arranged as 634 words of 32 bits each. It also contains an index variable, that tracks which words has already been returned as pseudorandom numbers, and when empty, regenerates the full state.

(The "give me a new pseudorandom number" is just a lookup function, which calls the PRNG generator only when there are no precalculated numbers available.)

Some cryptographic and other security-sensitive PRNGs only reveal a fraction of their internal state, to avoid the case -- mentioned for MT19937 on the Wikipedia article -- where a full copy of the PRNG state (and therefore all future numbers) can be obtained simply by getting enough consecutive generated numbers.

When you implement a new PRNG, it is always a good idea to verify the generated numbers pass the Diehard or Dieharder tests.

For molecular dynamics and Metropolis-Monte Carlo simulations, I like the Xorshift generators. They're simple, extremely fast, and many pass the Diehard tests. They're typically only sensitive to an initial state of all zeros, which is trivial to avoid (and even detect in run-time).

11. Most large-state PRNGs use the same approach for this. You split the state into word-sized units and return each word as a separate pseudorandom number in turn. When you run out, you regenerate a new state.
O_o

A small note: many alter the relevant bits with a separate function before providing the value to the client.
[/Edit]

Actually, I'm not sure my original comment was correct. I know that many algorithms designed by Marsaglia would not conform so I've modified the comment.
[/Edit]

For example, the common "MT19937" you reference uses "XOR", "LSHIFT", "RSHIFT", and a couple of magic numbers to alter the sampled state.

Soma

12. Originally Posted by Sebastiani
Internal state aside, it absolutely IS a problem if a generator starts spitting out a huge run of zeroes - that could easily ruin a simulation.
That is a problem with design of the simulation, or expectations of it, not the random number generator.

Real distributions of (almost) anything do permit a non-zero chance of unlikely sequences. If you're doing something to prevent particular sequences (such as runs) then you're introducing a bias - regardless of the quality of the random number generator.

Originally Posted by Sebastiani
The problem here is nothing more than the fact that the larger state space is "encroaching" on the smaller range of numbers that we're pulling out, so we really just need a way to "shrink" the state space somehow. Neither lopping off zeroes nor using the proposed modulus/divide method seem to be the right solution though, so what?
As Mario said, it is easier to shrink the state space that the random number generator works with than to use a larger state space and then adjust the output to a smaller range (by lopping off zeros, modulo/divide, etc). The two are distinct, but you are treating one as the other. They are not interchangeable either (usually).

13. I think what everyone's missing here is that the maximum length of runs of zeroes grows with the state space. If this space is large enough in comparison to the target range, it WILL produce an inordinate lack of useful data. There has got to be a better way!

14. I think what everyone's missing here is that the maximum length of runs of zeroes grows with the state space.
O_o

I don't think anyone is missing that fact.

I admit to not addressing the issue with specific code because I don't know the details of your algorithm, but I did give you a hint about "whitening".

Sure, the notion of "whitening" is a very general bias mechanic.

However, I believe you will find some useful papers if you look for "PRNG whitening" possibly with a few keywords relevant to your algorithm.

To be clear though, my comments as well as the comments of others apply to "pattern growth". (I understand your algorithm is, let us say, configurable with period by simply increasing the available state. I am though assuming this is still the same algorithm you've commented on in the past.) If a given client needs such an extreme period, they are responsible for understanding the values produced by sampling or adjusting their expectations--as grumpy implies. You aren't responsible for the misuse of the implementation if someone doesn't read the manual.
[/Edit]

I also understand that you feel that the "pattern growth" is a weakness in the algorithm. (I can't speculate because I don't know the probability of the state of your algorithm being composed of "patterned" values.) If the "pattern growth" is not born of bias, simply "mixing" the value provided to the client--possibly with a "LFSR", "CMWC", or "Xorshift" layer--is sufficient to disguise such patterns. Such a layer is common, again see "MT19937", with many "PRNG" having large state.
[/Edit]

Soma

15. Originally Posted by phantomotap
O_o

I don't think anyone is missing that fact.

I admit to not addressing the issue with specific code because I don't know the details of your algorithm, but I did give you a hint about "whitening".

Sure, the notion of "whitening" is a very general bias mechanic.

However, I believe you will find some useful papers if you look for "PRNG whitening" possibly with a few keywords relevant to your algorithm.

Soma
Yes, whitening may be sufficient. I'm still holding out on the idea that there must be some mathematically-robust way to do this, though...