# Thread: Drawing multiple samples from a discrete probability distribution

1. ## Drawing multiple samples from a discrete probability distribution

Given a list of pairs (value and probability), what's a fast way to draw many samples from the distribution?

For example, the input could be something like -

<"bob", 0.25>
<"cindy", 0.5>
<"david", 0.25>

meaning for each sample, I want 25% chance of picking Bob, 50% chance of picking Cindy, and 25% chance of picking David.

I can calculate the cumulative probability of each point, and do a binary search for each sample, but that's O(mlog(n)), where m is the number of samples I want, and n is the number of states.

In my case, both m and n are on the order of 10 million, so it would take quite a while.

Is there a faster way?

The probabilities are mostly quite similar. 2. Let's recap first:
Code:
```0      K/4     K/2    3K/4      K
╪═══════╪═══════╪═══════╪═══════╪
│ B 25% │     C 50%     │ D 25% │```
To generate pseudorandom numbers with a desired discrete distribution, you generate a pseudorandom integer x=[0,K) (not including K, i.e. x=[0,K-1]). You have an array filled with the cumulative probability of previous results, scaled to 100%=K, starting with a zero for the first option, and adding a 100% at end, to make the search as simple and as fast as possible.

Here, it would be
probability = [ 0, K/4, 3*K/4, K ]
To find the result our pseudorandom x refers to, we find the index i in probability for which
probability[i] <= x < probability[i+1]
Here, i=0 would be Bob, i=1 would be Cindy, and i=2 would be David.

(If we had a continuous distribution, we'd have a nondecreasing cumulative probability function 0 <= P(y) <= 1, which would tell the cumulative probability of all results x<=y. In this case, we could construct the inverse cumulative probability function P-1(p)=x, generate a pseudorandom number 0<=p<=1, and feed it to the inverse cumulative probability function to get the resulting value x. This is how you generate values according to a desired Gaussian distribution, for example.)

In the discrete case here, a binary search to find the index i in probability takes log2N, where N is the number of possible results. If N is ten million, you need 24 iterations of the search loop to find the result. While that looks very acceptable on the face of it, the cache behaviour -- you end up accessing your forty-to-eighty-million-byte array here and there, and it's not going to be all in cache all the time, so the CPU ends up having to read parts of it from RAM -- is likely to make it slow.

Cyberfish is asking for any ideas to make that faster. Three options immediately come to my mind.

If all the probabilities can be approximated to sufficient precision using pi/K, where pi is an integer referring to the probability of result i, you could use an array of length K, with each entry referring to the result i. Generating a pseudorandom integer x, 0<=x<K, you'd get the resulting i using simply result[x]. Obviously, this only works when the number of results N<=K, and for any kind of precision for the probabilities, you need a high K/N ratio. If N is already ten million, this is unlikely to work for Cyberfish.

When N is already large as in this particular case, we only need to get rid of the initial steps of the binary search (because those are the ones that cause the nasty cache behaviour), to get reasonable performance. Fortunately, we can use a very simple indexing approach to limit our binary search to relatively short section of the probability array, depending on the exact values filled in it. (It turns out that if you can, interleaving improbable and probable results would give near minimum-length binary search ranges, and thus near-optimum performance.)

The idea is that you create a much smaller discrete probability array of size 1+K/2[sup]L[/I] -- using a power of two so that you can get the index to this array by x>>L. Again, the first entry in the array is zero, and the final (additional) entry is K, the index of the last (additional) entry in our probability array (referring to 100% probability). You want this array to be small enough to stay reasonably hot in your caches; something like a few thousand of entries, typically. Benchmarking on the target machines is definitely a good idea here.

When you generate the pseudorandom number x, 0<=x<K, the section of the probability array you need to search is between lookup[x>>L] and lookup[(x>>L)+1], excluding the upper limit. This is a single (two-entry) probe into the array, and should in most cases reduce the number of iterations needed in the binary search. The number of iterations in the binary search is however less important than the fact that we reduce the memory area involved i the binary search, hopefully getting better performance. Unfortunately, when you generate pseudorandom numbers near the order of K, you will have to access the entire probability array anyway. Which brings us to the third option:

The third option is surprisingly the most likely to get you best performance, and needs ridiculously little added code:

Use a cache. Generate K values -- not exactly K, but of the same order, or at least a few million values, at a time. This way, the probability array gets cache-hot, reducing the amortized cost of generating each value. Remember, the "slowness" is caused by cache behaviour; otherwise the algorithm is pretty darn fast already (we're talking about one to three clock cycles per binary search step on current Intel/AMD processors). Pick off the numbers from the cache as is, and only when empty, generate a new batch. 3. Awesome thanks! Yeah I was mostly worried about the cache behaviour, and using buckets does solve that! Will give it a try. Popular pages Recent additions 