Ok. Took a good look at your posts. It turns out your analysis didn't differ much from how I was doing things. Although I was working more intuitively and less experimental. But that 1% nagged me. Especially because the frequency curves I was getting from my own experimentation didn't suggest much of a change between both methods. So I decided to follow your lead and instead of being lazy, get my hands dirty on real experimentation.
Code
The following is my code for testing both methods. I call of Int cast method the traditional method of min + int(interval * rng()), and the OP method I call Partition method.
Code:
from collections import defaultdict
import math
import random
SAMPLES = 100000 # 100 thousand
RUNS = 1000
def gencast(lbound, interval):
results = [0] * interval
for _ in range(SAMPLES):
results[lbound + int(interval * random.random())] += 1
return results
def genpartition(lbound, interval, factor):
results = [0] * interval
for _ in range(SAMPLES):
results[int(random.random() * factor) % interval + lbound] += 1
return results
def gen(lbound, ubound):
interval = ubound - lbound + 1
factor = 10 ** math.floor(math.log10(interval) + 2)
final = {}
for run in range(RUNS):
rescast = gencast(lbound, interval)
respartition = genpartition(lbound, interval, factor)
final[run + 1] = min(rescast), max(rescast), min(respartition), max(respartition)
amplitudes = []
for vals in final.values():
amplitudes.append((vals[1] - vals[0], vals[3] - vals[2]))
return amplitudes
If you don't know Python, I trust you will still be able to derive a C program from it. Python is expressive enough, I trust you will have little problems converting. My apologies for not following your language of choice, but for 3 years now I lost most interest in C++. Only C is of some interest to me, and only as a glue language I can use to read from, or build my own, optimized libraries.
The test case follows:
Code:
a = gen(0, 10)
b = gen(0, 1000)
acasts = [amps[0] for amps in a]
aparts = [amps[1] for amps in a]
bcasts = [amps[0] for amps in b]
bparts = [amps[1] for amps in b]
amin_ampcasts = min(acasts)
amax_ampcasts = max(acasts)
amin_ampparts = min(aparts)
amax_ampparts = max(aparts)
bmin_ampcasts = min(bcasts)
bmax_ampcasts = max(bcasts)
bmin_ampparts = min(bparts)
bmax_ampparts = max(bparts)
aavg_ampcasts = sum(acasts) / len(acasts)
aavg_ampparts = sum(aparts) / len(aparts)
bavg_ampcasts = sum(bcasts) / len(bcasts)
bavg_ampparts = sum(bparts) / len(bparts)
print('gen(0, 10):')
print(amin_ampcasts, amax_ampcasts, aavg_ampcasts)
print(amin_ampparts, amax_ampparts, aavg_ampparts)
print('gen(0, 1000):')
print(bmin_ampcasts, bmax_ampcasts, bavg_ampcasts)
print(bmin_ampparts, bmax_ampparts, bavg_ampparts)
One test case is composed of two individual tests. One on a comparatively small interval of [0, 10], and another on an interval of [0, 1000]. Each test is ran one thousand times. Each run tests both methods (int cast and partition). And each method generates one hundred thousand samples. For a grand total of 2 * 1,000 * 2 * 100,000 = 400,000,000, four hundred million samples generated.
Methodology
What I am interested mostly is to record and then analyse the amplitude of a discreet run in the typical white noise graph of an uniform distribution. This is a fancy way of saying, I'm interested in recording the maximum and minimum times values were generated and from these calculate the maximum amplitude of the uniform distribution signal of that run.
Consider this, each run produces 100,000 samples. In an interval of [0,10] it is expected that each number is generated around 9,090.9 times. I collect the amount of times the numbers were generated and then record the maximum and minimum. The maximum amplitude of the run is maximum - minimum.
This amplitude is revealing of the distribution properties. We want it as close to 0 as possible for a uniform distribution. So a higher amplitude means a weaker uniform distribution.
The code can probably convey better what is being done. Fell free to analyse and critic the methodology.
Results
Interval [0, 10] |
Min. Amplitude |
Max. Amplitude |
Average Amplitude |
Int Cast |
121 |
612 |
301.775 |
Partition |
109 |
733 |
321.219 |
Interval [0, 1000] |
Min. Amplitude |
Max. Amplitude |
Average Amplitude |
Int Cast |
54 |
83 |
64.699 |
Partition |
54 |
85 |
64.515 |
The results clearly reveal a weakness in the Partition method, compared to the old int cast method.
The difference however doesn't seem so high as your earlier suggestion. It's in fact 0.02% in the [0, 10] interval and 0,001% in the [0, 1000] interval. A far cry from your 1%.
I think this also helps understand why I decided to include two different intervals in the test case. Anyone who has ever tried to generate signals from a random uniform distribution, knows how weak the signals tends to become as you increase the interval. It was important to also test a higher interval because of the increased uniformity and see if the partition method could keep up with the expected behavior. It can, as you can seen above. But while it performs better than before, it is still behind the Int Cast method.
A small detail is worth noting. I did this test case several times as I was building graphs and testing the code. On almost all of them something was revealing of the Partition method. It almost always tends to generate the lowest amplitude in the range. You can see that above. It may perform below the Int Cast when all runs are considered, but it sure tends to produce the better individual runs. We will soon see why however it under-performs.
The above numbers are the final collected results. What about individual runs and how do both methods really look like? What follows is a few histograms showing the properties of both methods.
Graphs
The first histogram display side-by-side the amplitude frequencies (how many times we see that amplitude) of both methods.
They are both mountain histograms as one would expect. But if we look closely, we can see that while the partition method tends towards the left side (lower amplitudes, that's good), it misbehaves by being more spread. It occupies a larger spectrum of the amplitude range.
On all my runs I observed this. It seems to be a constant behavior. It tends often to display even higher bins on the left side of the histogram than the int cast, which would make it a better method.
But there it is the weakness revealed. It is more spread out across the amplitude range, loosing any advantage it could have gained from its tendency to produce a higher number of lower amplitudes.
A side-by-side comparison can't match a superimposed histogram. So here it is for a better appreciation of the differences between both methods.
In here, you can see the spread. This test histograms match all my previous tests except on one case. The green histogram (partition method) does tend to show higher bars than the int cast method on the lower amplitudes ranges before the median (top of the mountain). But it is still a good representation of almost all my tests. It is noticeable how the partition method tends to not only stick out on the higher amplitude ranges, but also how it tends to spread itself across almost the entire spectrum.
The interval [0, 1000] isn't worth much saying because it would be mostly redundant at this point. Here are both plots, with just a couple of notes at the end.
The jagged view of these histograms is also typical of a large interval. As the interval approaches the number of samples (here a 1:100 relation), these histograms move from a bell curve, to a mountain and then into an ever more jagged format.
Here we can see better the tendency of the Partition method to better occupy the lower amplitude range. But it can't help it being eager to also conquer the right side of the higher amplitude ranges and thus lose any advantage it could have gained.
Conclusion
The partition method is a poor substitute of the int cast method. It is less uniform by only a very small margin, not worth discussing unless one works in a casino. But it is also slower.
The added multiplication and mod operations haven't brought any advantage whatsoever. I thought that by consuming the numbers to the right of the decimal sign by multiplying by factors of 10, I could strengthen the distribution properties weakened by a cast to int (the partition method is also based on a cast). But obviously the modulus operation that follows is just too destructive.
Final advise: Not worth it.
This post brought to you by:
Mario "Can also do long posts" Figueiredo