Doubts about detecting the periodicity of psuedo random functions

This is a discussion on Doubts about detecting the periodicity of psuedo random functions within the C++ Programming forums, part of the General Programming Boards category; >> So why follow the link when I can explain it here? Uh, because I didn't grok it, quite frankly! ...

  1. #16
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    >> So why follow the link when I can explain it here?

    Uh, because I didn't grok it, quite frankly! And besides, you've explained it much more clearly. Thanks.

    Still, I'm going to focus on the uniform distribution calculation until I have that worked out first. But I will be sure to move on to the chi-squared test, eventually.

  2. #17
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    Ok, well i've come up with a function to test the uniform distribution *of sorts* (in the sense that I have no idea if it follows the standard convention or if the values even remotely correspond to the official function). I'm fairly happy with the output, but if anyone could point out how the conventional function would be designed I would appreciate it.

    Code:
    /*
    	Returns a score on the interval (0, 1) where 0 == worst and 1 == perfect.
    	Container is filled with the counts of each value where the values correspond
    	to the logical index of that particular element. So for example data[ 11 ]
    	contains a count of the value 11.
    */
    template < typename Container >
    double uniform_distribution( Container const& data )
    {
    	double
    		result = 0, 
    		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
    		expected = count / data.size( );
    	if( count == 0 )
    		return std::numeric_limits< double >::quiet_NaN( );
    	for
    	( 
    		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
    		seq != fin; 
    		++seq 
    	)
    		result += fabs( *seq - expected );
    	return 1.0 - ( result / count );
    }
    EDIT:
    Ok, well the function is not actually implemented correctly. I'm working on a fix right now...

    EDIT #2:
    Fixed (thanks to Tabstop)!

    EDIT #3:
    Really fixed (thanks to Tabstop)!!
    Last edited by Sebastiani; 07-13-2009 at 10:32 PM.

  3. #18
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    So, I have no idea what you're trying to make result do here, exactly; but the idea is you need to capture how far away your counts are from what you want them to be. You know how much you want them to be: that's what scale is. The important thing here is (value-scale) -- we want that to be 0, any deviation from that is bad (whether it be positive or negative) and the bigger it is the more heartburn it gives us. The canonical way to deal with that is to use sum-of-squares: it turns everything positive, so that no cancellation happens (we want all the errors to accumulate). So we would actually add up (value-scale)^2 to get our measure of how far away we are -- 0 is spot on, large is bad. To get some notion of scaling in, we could maybe divide at the end by scale and -- hey! that's Pearson's chi-square test!

  4. #19
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    Many thanks, Tabstop. I think it's working properly now. Now I'm going to give the chi-squared test a shot!

    Cheers.

  5. #20
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by Sebastiani View Post
    Ok, well i've come up with a function to test the uniform distribution *of sorts* (in the sense that I have no idea if it follows the standard convention or if the values even remotely correspond to the official function). I'm fairly happy with the output, but if anyone could point out how the conventional function would be designed I would appreciate it.

    Code:
    /*
    	Returns a score on the interval (0, 1) where 0 == worst and 1 == perfect.
    	Container is filled with the counts of each value where the values correspond
    	to the logical index of that particular element. So for example data[ 11 ]
    	contains a count of the value 11.
    */
    template < typename Container >
    double uniform_distribution( Container const& data )
    {
    	double
    		result = 0, 
    		size = double( data.size( ) ), 
    		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
    		expected = count / size;
    	if( count == 0 )
    		return std::numeric_limits< double >::quiet_NaN( );
    	for
    	( 
    		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
    		seq != fin; 
    		++seq 
    	)
    	{
    		double
    			value = *seq;
    		if( value )
    			result += value - expected;
    	}
    	return 1.0 - ( result / count );
    }
    EDIT:
    Ok, well the function is not actually implemented correctly. I'm working on a fix right now...

    EDIT #2:
    Fixed (thanks to Tabstop)!
    Do you find this always gives you one? I would think it would (unless there were some zeroes in the group), as adding (value-expected) is the same as adding (value) and then adding (expected) and subtracting -- and these should always give you the same. At the very least you'll want an absolute value on that I would think.

  6. #21
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    >> At the very least you'll want an absolute value on that I would think.

    Doh! I was wondering why some generators were reporting results that were either better or worse than expected. Ok, so I've updated the function as necessary. Does it look correct?

  7. #22
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,293
    Quote Originally Posted by laserlight View Post
    What do you mean by "perfect uniformity"?
    I mean that if it wont ever generate an outcome that has been seen n times already while there are other outcomes that have been seen only n-1 times, then clearly each outcome is not equally likely on any given individual outcome produced.

    You certainly can't call a generator biased if after generating n outcomes (where there are exactly n possible outcomes), that the outcomes are not all unique.
    Another way of looking at it, Given enough possible outcomes, the chance that after a given outcome generation, all outcomes have been generated an equal number of times, is remote enough that it would probably mean the RNG is cheating, isn't really unbiased, or isn't truly random.
    None of this is anything anyone here wouldn't already know, and lets not get into a discussion on true randomness.

  8. #23
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,235
    Quote Originally Posted by iMalc View Post
    I mean that if it wont ever generate an outcome that has been seen n times already while there are other outcomes that have been seen only n-1 times, then clearly each outcome is not equally likely on any given individual outcome produced.
    Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform. The estimate of the probability distribution based on frequencies of observed events is just that -- an estimate.

    Honestly, I think you should look at the question in terms of mutual information between the bits of internal state of the RNG. The external state is not enough information to make a good decision on randomness. Unfortunately you may not have access to the internal state bits.

    I wrote an experiment last night to measure this (mutual information between external state bits) for the rand() function of my platform (Ubuntu Linux), and found that the estimate of M.I. depends on the number of iterations performed. I did not collect enough data to see if the curve eventually levels off. However, after one billion observations, the estimate of mutual information between the first bit and the higher 15 bits (I limited the experiment to 16 bits of randomness) was somewhere around 2e-6.

    For a perfect uniform random source, the MI should go to zero as the number of observations goes to infinity. 2e-6 isn't zero but it's pretty damn small (when compared to 1, the assumed entropy of each bit). Conclusion? My implementation of rand() is decent.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  9. #24
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,293
    Quote Originally Posted by brewbuck View Post
    Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform. The estimate of the probability distribution based on frequencies of observed events is just that -- an estimate.
    Exactly.

  10. #25
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    Well, it looks like there were flaws in both the uniform distribution function and the periodicity detector. But I think I've straightened them out. For the former, the problem was that I wasn't factoring in the maximum error in order to obtain a useful result. The corrected version:

    Code:
    template < typename Container >
    double uniform_distribution( Container const& data )
    {
    	double
    		one = 1,
    		result = 0, 
    		size = data.size( ),
    		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
    		expected = count / size, 
    		//maximum_error = ( size - one ) * ( expected + one ); // WRONG
    		maximum_error = ( count - expected ) + ( expected * ( size - one ) );
    	if( count == 0 )
    		return std::numeric_limits< double >::quiet_NaN( );
    	for
    	( 
    		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
    		seq != fin; 
    		++seq 
    	)
    		result += fabs( *seq - expected );
    	return one - ( result / maximum_error );
    }
    As for the latter, it was just plain broken. The updated version is much simpler and generic, and is much more rigorous in that it doesn't just check the pattern once but repeatedly until it reaches the end:

    Code:
    template < typename Iterator >
    size_t period_of_sequence( Iterator begin, Iterator end )
    {
        for( ; begin != end; ++begin )
        {
            Iterator
                found = begin;
            while( ( found = find( found + 1, end, *begin ) ) != end )
            {
                bool
                    counting = true;
                size_t
                    count = 1;
                Iterator
                    next = begin, 
                    search = found;
                while( ++search != end )
                {
                    if( *( ++next ) != *search )
                        break;
                    if( next == found )
                        counting = false;
                    if( counting )
                        ++count;
                }             
                if( search == end && !counting )
                    return count;
            }
        }
        return 0;
    }
    
    template < typename Container >
    inline size_t period_of_sequence( Container const& data )
    {
        return period_of_sequence( data.begin( ), data.end( ) );
    }
    There are still a few cases where it could return a possible false positive. For instance if the pattern ends with two identical samples, it has no choice but to assume that the sequence repeats indefinitely. In that case, though, you could just expand the range and retest.

    >> Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform.

    Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it? Anyway, you wouldn't want it to be totally uniform but within a certain range. And at any rate, a test of uniformity has a limited relevance when considering the strengths and weakenesses of an RNG.

    >> For a perfect uniform random source, the MI should go to zero as the number of observations goes to infinity. 2e-6 isn't zero but it's pretty damn small (when compared to 1, the assumed entropy of each bit). Conclusion? My implementation of rand() is decent.

    Interesting. I haven't gotten quite that far yet, but it sounds like a useful test. I'm guessing that Ubuntu probably uses the Marsenne twister?

    >> Another way of looking at it, Given enough possible outcomes, the chance that after a given outcome generation, all outcomes have been generated an equal number of times, is remote enough that it would probably mean the RNG is cheating, isn't really unbiased, or isn't truly random.

    That's a good point. Would a good test for that be to sample the uniform distribution at certain intervals and calculate the relative entropy of the output?
    Last edited by Sebastiani; 07-14-2009 at 10:40 PM. Reason: corrected UD function

  11. #26
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,185
    Quote Originally Posted by Sebastiani View Post

    Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it?
    Not really. For instance, if you had just a simple 50-50 (one bit) random number generator, that is truly random (say the theoretical perfect coin), and ran it 10 times, the probability of getting 5 1's and 5 0's is (approximately) 0.25. In fact anywhere from 3-7 of each would be considered "usual" (happen more often than 5% by chance).

  12. #27
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    >> Not really. For instance, if you had just a simple 50-50 (one bit) random number generator, that is truly random (say the theoretical perfect coin), and ran it 10 times, the probability of getting 5 1's and 5 0's is (approximately) 0.25. In fact anywhere from 3-7 of each would be considered "usual" (happen more often than 5% by chance).

    Maybe we're talking about two different things here. Randomness aside, if you have an array of two counts, one for heads and the other for tails, and each element contains the value of 5, then the distribution is "perfect" (totally uniform), and any other configuration would not be. It has nothing to do with probability.

  13. #28
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,235
    Quote Originally Posted by Sebastiani View Post
    Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it?
    No. There is no reason why the observations should be precisely uniform even if the underlying process is uniform. Consider a perfectly balanced coin. You flip it once. It comes up heads. So based on this observation -- that the coin came up heads 1 time out of 1 -- you conclude that the coin will come up heads 100% of the time, which is obviously not a justifiable conclusion.

    Or consider a RNG which distributes samples uniformly between 0 and 32767, and you generate an odd number of samples. Obviously, there must be at least one set of buckets whose counts are larger than any other bucket. Yet if you drew an even number of samples, you might have perfect buckets. So, does the uniformity of the distribution depend on how many observations you choose to make? No, clearly it's either uniform or not, regardless of your measurements.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  14. #29
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,235
    Quote Originally Posted by Sebastiani View Post
    if you have an array of two counts, one for heads and the other for tails, and each element contains the value of 5, then the distribution is "perfect" (totally uniform), and any other configuration would not be. It has nothing to do with probability.
    I think you need to be careful with the word "distribution." The distribution is a theoretical object which represents the actual probabilities of certain observations. But an array which tallies counts is only a measurement of this distribution, not the distribution itself. There are many ways to estimate the probability from a count of frequencies. The method of "take the number and divide by the total number of observations" is the maximum likelihood estimator which means among all consistent estimators it assigns the maximum probabilities to the observed events. But there are many other estimators you could use which are equally valid, depending on which assumptions you make on the underlying distribution. For instance, suppose you are estimating the likelihoods of particular words beginning a sentence. You might do this by scanning millions of sentences and incrementing a counter for each unique word you see. But this ignores the possibility that there are words which you have not seen, simply because your sample isn't big enough. Using the maximum likelihood estimator, you would assign these words a probability of zero, even though we know the true probability is not zero just because a given word didn't occur in the sample set.
    Last edited by brewbuck; 07-15-2009 at 12:29 AM.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  15. #30
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,655
    >> The distribution is a theoretical object which represents the actual probabilities of certain observations. But an array which tallies counts is only a measurement of this distribution, not the distribution itself.

    I was simply referring to the "frequency" distribution, then, I guess.

    >> Consider a perfectly balanced coin. You flip it once. It comes up heads. So based on this observation -- that the coin came up heads 1 time out of 1 -- you conclude that the coin will come up heads 100% of the time, which is obviously not a justifiable conclusion.

    If you're testing the uniformity of samples given a range of possibilities, I would think that you would have to supply some multiple of that range to obtain a useful indicator. Otherwise, the result would be practically meaningless. I mean, if you have an RNG that generates 16-bit numbers but only call it 5 times, you simply do not have enough information to make a judgement about the relative frequency distribution. Period.

    >> For instance, suppose you are estimating the likelihoods of particular words beginning a sentence. You might do this by scanning millions of sentences and incrementing a counter for each unique word you see. But this ignores the possibility that there are words which you have not seen, simply because your sample isn't big enough. Using the maximum likelihood estimator, you would assign these words a probability of zero, even though we know the true probability is not zero just because a given word didn't occur in the sample set.

    That's an interesting thought, but I honestly don't grok the import of the logic there. If you don't know what all of the possibilities are, then why factor them in at all? That's just seems illogical. Of course, probability never really made sense to me, *anyway*. It always seemed be a big much-to-do-about-nothing, personally. Perhaps if it had been given a more appropriate name, like "possibility theory", I would have given it more credence.

Page 2 of 4 FirstFirst 1234 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. time Delays and Random functions
    By markphaser in forum C++ Programming
    Replies: 17
    Last Post: 02-20-2006, 06:10 PM
  2. random functions, is it possible?
    By sreetvert83 in forum C++ Programming
    Replies: 8
    Last Post: 09-02-2005, 11:50 AM
  3. How do I restart a random number sequence.
    By jeffski in forum C Programming
    Replies: 6
    Last Post: 05-29-2003, 02:40 PM
  4. Random function's application:Dice Roller
    By angeljicu in forum C Programming
    Replies: 0
    Last Post: 11-02-2002, 04:29 AM
  5. random functions
    By bluehead in forum C++ Programming
    Replies: 5
    Last Post: 11-09-2001, 03:04 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21