Thread: Preserving RNG distribution on int cast

  1. #16
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Ok. Took a good look at your posts. It turns out your analysis didn't differ much from how I was doing things. Although I was working more intuitively and less experimental. But that 1% nagged me. Especially because the frequency curves I was getting from my own experimentation didn't suggest much of a change between both methods. So I decided to follow your lead and instead of being lazy, get my hands dirty on real experimentation.


    Code

    The following is my code for testing both methods. I call of Int cast method the traditional method of min + int(interval * rng()), and the OP method I call Partition method.

    Code:
    from collections import defaultdict
    import math 
    import random
    
    SAMPLES = 100000    #  100 thousand
    RUNS = 1000
    
    def gencast(lbound, interval):
        results = [0] * interval
        for _ in range(SAMPLES):
            results[lbound + int(interval * random.random())] += 1
    
        return results
    
    
    def genpartition(lbound, interval, factor):
        results = [0] * interval
        for _ in range(SAMPLES):
            results[int(random.random() * factor) % interval + lbound] += 1
    
        return results
    
    
    def gen(lbound, ubound):
        interval = ubound - lbound + 1
        factor = 10 ** math.floor(math.log10(interval) + 2)
    
        final = {}
        for run in range(RUNS):
            rescast = gencast(lbound, interval)
            respartition = genpartition(lbound, interval, factor)
            final[run + 1] = min(rescast), max(rescast), min(respartition), max(respartition)
    
        amplitudes = []
        for vals in final.values():
            amplitudes.append((vals[1] - vals[0], vals[3] - vals[2]))
    
        return amplitudes
    If you don't know Python, I trust you will still be able to derive a C program from it. Python is expressive enough, I trust you will have little problems converting. My apologies for not following your language of choice, but for 3 years now I lost most interest in C++. Only C is of some interest to me, and only as a glue language I can use to read from, or build my own, optimized libraries.

    The test case follows:

    Code:
    a = gen(0, 10)
    b = gen(0, 1000)
    
    acasts = [amps[0] for amps in a]
    aparts = [amps[1] for amps in a]
    bcasts = [amps[0] for amps in b]
    bparts = [amps[1] for amps in b]
    
    amin_ampcasts = min(acasts)
    amax_ampcasts = max(acasts)
    amin_ampparts = min(aparts)
    amax_ampparts = max(aparts)
    bmin_ampcasts = min(bcasts)
    bmax_ampcasts = max(bcasts)
    bmin_ampparts = min(bparts)
    bmax_ampparts = max(bparts)
    
    aavg_ampcasts = sum(acasts) / len(acasts)
    aavg_ampparts = sum(aparts) / len(aparts)
    bavg_ampcasts = sum(bcasts) / len(bcasts)
    bavg_ampparts = sum(bparts) / len(bparts)
    
    print('gen(0, 10):')
    print(amin_ampcasts, amax_ampcasts, aavg_ampcasts)
    print(amin_ampparts, amax_ampparts, aavg_ampparts)
    print('gen(0, 1000):')
    print(bmin_ampcasts, bmax_ampcasts, bavg_ampcasts)
    print(bmin_ampparts, bmax_ampparts, bavg_ampparts)
    One test case is composed of two individual tests. One on a comparatively small interval of [0, 10], and another on an interval of [0, 1000]. Each test is ran one thousand times. Each run tests both methods (int cast and partition). And each method generates one hundred thousand samples. For a grand total of 2 * 1,000 * 2 * 100,000 = 400,000,000, four hundred million samples generated.


    Methodology

    What I am interested mostly is to record and then analyse the amplitude of a discreet run in the typical white noise graph of an uniform distribution. This is a fancy way of saying, I'm interested in recording the maximum and minimum times values were generated and from these calculate the maximum amplitude of the uniform distribution signal of that run.

    Consider this, each run produces 100,000 samples. In an interval of [0,10] it is expected that each number is generated around 9,090.9 times. I collect the amount of times the numbers were generated and then record the maximum and minimum. The maximum amplitude of the run is maximum - minimum.

    This amplitude is revealing of the distribution properties. We want it as close to 0 as possible for a uniform distribution. So a higher amplitude means a weaker uniform distribution.

    The code can probably convey better what is being done. Fell free to analyse and critic the methodology.


    Results

    Interval [0, 10] Min. Amplitude Max. Amplitude Average Amplitude
    Int Cast 121 612 301.775
    Partition 109 733 321.219


    Interval [0, 1000] Min. Amplitude Max. Amplitude Average Amplitude
    Int Cast 54 83 64.699
    Partition 54 85 64.515

    The results clearly reveal a weakness in the Partition method, compared to the old int cast method.

    The difference however doesn't seem so high as your earlier suggestion. It's in fact 0.02% in the [0, 10] interval and 0,001% in the [0, 1000] interval. A far cry from your 1%.

    I think this also helps understand why I decided to include two different intervals in the test case. Anyone who has ever tried to generate signals from a random uniform distribution, knows how weak the signals tends to become as you increase the interval. It was important to also test a higher interval because of the increased uniformity and see if the partition method could keep up with the expected behavior. It can, as you can seen above. But while it performs better than before, it is still behind the Int Cast method.

    A small detail is worth noting. I did this test case several times as I was building graphs and testing the code. On almost all of them something was revealing of the Partition method. It almost always tends to generate the lowest amplitude in the range. You can see that above. It may perform below the Int Cast when all runs are considered, but it sure tends to produce the better individual runs. We will soon see why however it under-performs.

    The above numbers are the final collected results. What about individual runs and how do both methods really look like? What follows is a few histograms showing the properties of both methods.


    Graphs

    The first histogram display side-by-side the amplitude frequencies (how many times we see that amplitude) of both methods.

    They are both mountain histograms as one would expect. But if we look closely, we can see that while the partition method tends towards the left side (lower amplitudes, that's good), it misbehaves by being more spread. It occupies a larger spectrum of the amplitude range.

    On all my runs I observed this. It seems to be a constant behavior. It tends often to display even higher bins on the left side of the histogram than the int cast, which would make it a better method.

    But there it is the weakness revealed. It is more spread out across the amplitude range, loosing any advantage it could have gained from its tendency to produce a higher number of lower amplitudes.

    Preserving RNG distribution on int cast-download-png

    A side-by-side comparison can't match a superimposed histogram. So here it is for a better appreciation of the differences between both methods.

    Preserving RNG distribution on int cast-download-1-png

    In here, you can see the spread. This test histograms match all my previous tests except on one case. The green histogram (partition method) does tend to show higher bars than the int cast method on the lower amplitudes ranges before the median (top of the mountain). But it is still a good representation of almost all my tests. It is noticeable how the partition method tends to not only stick out on the higher amplitude ranges, but also how it tends to spread itself across almost the entire spectrum.

    The interval [0, 1000] isn't worth much saying because it would be mostly redundant at this point. Here are both plots, with just a couple of notes at the end.

    Preserving RNG distribution on int cast-download-2-png
    Preserving RNG distribution on int cast-download-3-png

    The jagged view of these histograms is also typical of a large interval. As the interval approaches the number of samples (here a 1:100 relation), these histograms move from a bell curve, to a mountain and then into an ever more jagged format.

    Here we can see better the tendency of the Partition method to better occupy the lower amplitude range. But it can't help it being eager to also conquer the right side of the higher amplitude ranges and thus lose any advantage it could have gained.

    Conclusion

    The partition method is a poor substitute of the int cast method. It is less uniform by only a very small margin, not worth discussing unless one works in a casino. But it is also slower.

    The added multiplication and mod operations haven't brought any advantage whatsoever. I thought that by consuming the numbers to the right of the decimal sign by multiplying by factors of 10, I could strengthen the distribution properties weakened by a cast to int (the partition method is also based on a cast). But obviously the modulus operation that follows is just too destructive.

    Final advise: Not worth it.

    This post brought to you by:
    Mario "Can also do long posts" Figueiredo
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  2. #17
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by Nominal Animal View Post
    What I fail to see, is what benefit there is for the extra operations (power of ten and modulus)?
    I don't like it when code does work "just because"; I'd prefer there to be a reason.
    A more politically correct reason, I give in the Conclusion section of the previous post. But the main reason is probably going to annoy you.

    I never intended for any of this to be discussed in the context of a computer algorithm. This thread started because I explained logarithms to my math students this week. And every Monday (a 4 hour math class), for the first 2 hours I like to give my students real-life applications of the concepts learned the previous week. I also encourage them to research these applications on their own during the weekend and bring them to class. But other than a couple of students, they rarely do.

    For some reason, I always liked the ingenuity of the log10 based method to find the number of digits in any integer number. And the modulus operation also popped in my head as I was trying to think of interesting ways of putting it to practice.

    And eventually this thing was born. More as a math concept to bring in to high school students hoping that one day all this makes them hate math just a little less, than as a computer algorithm. I just wanted to test it out before bring in to class and it's when questions of uniformity and such were born. And a small bug on my calculations that I could no longer reproduce and was eventually solved by phantomotap on post #4.

    But then you came in with your "this algorithm sucks" and my pride was shaken
    If it's war he wants, he'll get war!

    EDIT:
    I attached here's the IPython notebook used to test and generate the graphs, in case you are interested. Forgot to do it on the previous post.

    It's zipped because:
    a) Cboard sucks and doesn't recognize a .ipynb json file format
    b) CBoard sucks even more because it doesn't recognize the 7zip format

    Untitled.zip
    Last edited by Mario F.; 04-18-2015 at 11:40 AM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  3. #18
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by Mario F. View Post
    The test case follows:

    Code:
    a = gen(0, 10)
    b = gen(0, 1000)
    The most problematic intervals are those one less than a power of ten, i.e. 9, 99, 999, and so on. You should be looking at
    Code:
    a = gen(0, 8)
    b = gen(0, 98)
    and so on.

    Here is my test bench using Python:
    Code:
    import math
    import random
    
    def int_cast_dist(lbound, ubound, samples):
        interval = ubound - lbound + 1
        result = {}
    
        for k in xrange(lbound, ubound + 1):
            result[k] = 0
    
        for n in xrange(samples):
            k = int(random.random() * interval) + lbound
            result[k] += 1
    
        for k in xrange(lbound, ubound + 1):
            result[k] /= float(samples)
    
        return result
    
    def partition_dist(lbound, ubound, samples):
        interval = ubound - lbound + 1
        factor = 10 ** math.floor(math.log10(interval) + 2)
        result = {}
    
        for k in xrange(lbound, ubound + 1):
            result[k] = 0
    
        for n in xrange(samples):
            k = (int(random.random() * factor) % interval) + lbound
            result[k] += 1
    
        for k in xrange(lbound, ubound + 1):
            result[k] /= float(samples)
    
        return result
    
    # One million random numbers,
    N = 1000000
    
    # from 0,
    L = 0
    
    # to 8, inclusive.
    U = L + 8
    
    idist = int_cast_dist(L, U, N)
    pdist = partition_dist(L, U, N)
    print "#Value IntCast Partition"
    print L-0.5
    for k in xrange(L,U+1):
        print k, idist[k], pdist[k]
    print U+0.5
    The -0.5 and 8.5 in the output are there just to make Gnuplot pick a better horizontal range automatically.

    If you save the output to a file, you can visualize the distributions in Gnuplot using
    Code:
    gnuplot -p -e 'plot "output-filename.txt" u 1:2 t "int cast" w boxes fs solid, "" u 1:3 t "partition" w boxes fs solid'
    For me, zero occurs just under 12% of the time, whereas 1, 2, .., 7, and 8 occur only about 11% of the time.

    For larger ranges, like U=L+98, 0 to 9 occur about 1.1% of the time each, whereas 10 to 98 only about 1.0% of the time each.

    (These results are completely in agreement to my previous post, and do not match your results, because you're using a different interval that exhibits much smaller aliasing issues.)
    Quote Originally Posted by Mario F. View Post
    But then you came in with your "this algorithm sucks"
    Ouch

    I love the idea and how it came to be, it's just that the modulus part really hurts the results. Not my fault!

    I tried to think of another way to mix logarithms and random number distributions, but the only thing I came up with is related to cheating: how to create a random number generator that makes the house always win, by tweaking the PRNG output so that it favours other numbers more than others.

    It's ethically a bit iffy, but it might be something that appeals to students -- you put together a simple game (in Python), then ask them to try and modify the random number generator to make them win more often. Maybe only let them modify one specific file? That might lead to interesting discussions on probability distributions and how they are implemented using computers.

  4. #19
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by Nominal Animal View Post
    The most problematic intervals are those one less than a power of ten, i.e. 9, 99, 999, and so on.
    I completely skipped that problem. I tested it out and boy, it was fun!

    On the [0,8] interval, the superimposed histograms don't superimpose anymore. They juxtapose! I couldn't believe it, so I had to run it again. And sure enough...

    Preserving RNG distribution on int cast-download-png

    The spread and range of the amplitude for the partition method is all over the charts. The int cast has a spread of 500, the partition of a whooping 800. The range upper limit for the int cast is 600, the partition lower limit is 800, 200 more! I laughed. Seriously.

    My test is using the general amplitude as a basis for analysis, instead of your element count. And it not only confirms your observed bias, it also comes up with a very close figure to your 1%. Int cast scored 0,3038% only, and Partition scored a much higher 1,1424%

    For the larger intervals, things come back to normal. But still with an increased loss over my previous test where I wasn't using intervals close to the a multiple of 10. With an interval of [0, 998] I got:

    Preserving RNG distribution on int cast-download-1-png

    Quote Originally Posted by Nominal Animal View Post
    I tried to think of another way to mix logarithms and random number distributions


    I'm still taking this to class on Monday. It is very unlikely I will have time to work something out before that. My wife has assigned me to room painting duty tomorrow.

    It still is mathematical sound. And for students, most of whom never touched a computer, the statistical deficiencies will not be of any importance. I will make a note of
    show the int cast alternative and speak against this "partition" for prng usage, but not one of them will fully comprehend the meaning of that. Turning a small interval into any type of interval will however look magical to them. Or so I hope.
    Last edited by Mario F.; 04-18-2015 at 08:28 PM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  5. #20
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Btw Nominal, I actually really enjoyed your lecture. That was a pretty good read.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 7
    Last Post: 10-23-2009, 11:54 AM
  2. preserving beyond a function
    By Aisthesis in forum C++ Programming
    Replies: 7
    Last Post: 09-04-2009, 12:35 AM
  3. [C] Sorting an array and preserving the original index
    By frodo_jedi in forum C Programming
    Replies: 10
    Last Post: 04-06-2009, 06:51 AM
  4. Preserving Windoze date while installing linux
    By windoze victim in forum Tech Board
    Replies: 10
    Last Post: 04-02-2003, 10:17 PM
  5. Best Distribution
    By gnu-ehacks in forum Linux Programming
    Replies: 4
    Last Post: 11-21-2001, 03:59 AM