Doubts about detecting the periodicity of psuedo random functions

**Sebastiani** · 07-13-2009

>> So why follow the link when I can explain it here?

Uh, because I didn't grok it, quite frankly! And besides, you've explained it much more clearly. Thanks.

Still, I'm going to focus on the uniform distribution calculation until I have that worked out first. But I will be sure to move on to the chi-squared test, eventually.

**Sebastiani** · 07-13-2009

Ok, well i've come up with a function to test the uniform distribution *of sorts* (in the sense that I have no idea if it follows the standard convention or if the values even remotely correspond to the official function). I'm fairly happy with the output, but if anyone could point out how the conventional function would be designed I would appreciate it.

Code:

/*
	Returns a score on the interval (0, 1) where 0 == worst and 1 == perfect.
	Container is filled with the counts of each value where the values correspond
	to the logical index of that particular element. So for example data[ 11 ]
	contains a count of the value 11.
*/
template < typename Container >
double uniform_distribution( Container const& data )
{
	double
		result = 0, 
		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
		expected = count / data.size( );
	if( count == 0 )
		return std::numeric_limits< double >::quiet_NaN( );
	for
	( 
		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
		seq != fin; 
		++seq 
	)
		result += fabs( *seq - expected );
	return 1.0 - ( result / count );
}

EDIT:
Ok, well the function is not actually implemented correctly. I'm working on a fix right now...

EDIT #2:
Fixed (thanks to Tabstop)!

EDIT #3:
Really fixed (thanks to Tabstop)!!

**tabstop** · 07-13-2009

So, I have no idea what you're trying to make result do here, exactly; but the idea is you need to capture how far away your counts are from what you want them to be. You know how much you want them to be: that's what scale is. The important thing here is (value-scale) -- we want that to be 0, any deviation from that is bad (whether it be positive or negative) and the bigger it is the more heartburn it gives us. The canonical way to deal with that is to use sum-of-squares: it turns everything positive, so that no cancellation happens (we want all the errors to accumulate). So we would actually add up (value-scale)^2 to get our measure of how far away we are -- 0 is spot on, large is bad. To get some notion of scaling in, we could maybe divide at the end by scale and -- hey! that's Pearson's chi-square test!

**Sebastiani** · 07-13-2009

Many thanks, Tabstop. I think it's working properly now. Now I'm going to give the chi-squared test a shot!

Cheers.

**tabstop** · 07-13-2009

Originally Posted by Sebastiani

Ok, well i've come up with a function to test the uniform distribution *of sorts* (in the sense that I have no idea if it follows the standard convention or if the values even remotely correspond to the official function). I'm fairly happy with the output, but if anyone could point out how the conventional function would be designed I would appreciate it.

Code:

/*
	Returns a score on the interval (0, 1) where 0 == worst and 1 == perfect.
	Container is filled with the counts of each value where the values correspond
	to the logical index of that particular element. So for example data[ 11 ]
	contains a count of the value 11.
*/
template < typename Container >
double uniform_distribution( Container const& data )
{
	double
		result = 0, 
		size = double( data.size( ) ), 
		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
		expected = count / size;
	if( count == 0 )
		return std::numeric_limits< double >::quiet_NaN( );
	for
	( 
		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
		seq != fin; 
		++seq 
	)
	{
		double
			value = *seq;
		if( value )
			result += value - expected;
	}
	return 1.0 - ( result / count );
}

EDIT:
Ok, well the function is not actually implemented correctly. I'm working on a fix right now...

EDIT #2:
Fixed (thanks to Tabstop)!

Do you find this always gives you one? I would think it would (unless there were some zeroes in the group), as adding (value-expected) is the same as adding (value) and then adding (expected) and subtracting -- and these should always give you the same. At the very least you'll want an absolute value on that I would think.

**Sebastiani** · 07-13-2009

>> At the very least you'll want an absolute value on that I would think.

Doh! I was wondering why some generators were reporting results that were either better or worse than expected. Ok, so I've updated the function as necessary. Does it look correct?

**iMalc** · 07-14-2009

Originally Posted by laserlight

What do you mean by "perfect uniformity"?

I mean that if it wont ever generate an outcome that has been seen n times already while there are other outcomes that have been seen only n-1 times, then clearly each outcome is not equally likely on any given individual outcome produced.

You certainly can't call a generator biased if after generating n outcomes (where there are exactly n possible outcomes), that the outcomes are not all unique.
Another way of looking at it, Given enough possible outcomes, the chance that after a given outcome generation, all outcomes have been generated an equal number of times, is remote enough that it would probably mean the RNG is cheating, isn't really unbiased, or isn't truly random.
None of this is anything anyone here wouldn't already know, and lets not get into a discussion on true randomness.

**brewbuck** · 07-14-2009

Originally Posted by iMalc

I mean that if it wont ever generate an outcome that has been seen n times already while there are other outcomes that have been seen only n-1 times, then clearly each outcome is not equally likely on any given individual outcome produced.

Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform. The estimate of the probability distribution based on frequencies of observed events is just that -- an estimate.

Honestly, I think you should look at the question in terms of mutual information between the bits of internal state of the RNG. The external state is not enough information to make a good decision on randomness. Unfortunately you may not have access to the internal state bits.

I wrote an experiment last night to measure this (mutual information between external state bits) for the rand() function of my platform (Ubuntu Linux), and found that the estimate of M.I. depends on the number of iterations performed. I did not collect enough data to see if the curve eventually levels off. However, after one billion observations, the estimate of mutual information between the first bit and the higher 15 bits (I limited the experiment to 16 bits of randomness) was somewhere around 2e-6.

For a perfect uniform random source, the MI should go to zero as the number of observations goes to infinity. 2e-6 isn't zero but it's pretty damn small (when compared to 1, the assumed entropy of each bit). Conclusion? My implementation of rand() is decent.

**iMalc** · 07-14-2009

Originally Posted by brewbuck

Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform. The estimate of the probability distribution based on frequencies of observed events is just that -- an estimate.

Exactly.

**Sebastiani** · 07-14-2009

Well, it looks like there were flaws in both the uniform distribution function and the periodicity detector. But I think I've straightened them out. For the former, the problem was that I wasn't factoring in the maximum error in order to obtain a useful result. The corrected version:

Code:

template < typename Container >
double uniform_distribution( Container const& data )
{
	double
		one = 1,
		result = 0, 
		size = data.size( ),
		count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ), 
		expected = count / size, 
		//maximum_error = ( size - one ) * ( expected + one ); // WRONG
		maximum_error = ( count - expected ) + ( expected * ( size - one ) );
	if( count == 0 )
		return std::numeric_limits< double >::quiet_NaN( );
	for
	( 
		typename Container::const_iterator seq = data.begin( ), fin = data.end( ); 
		seq != fin; 
		++seq 
	)
		result += fabs( *seq - expected );
	return one - ( result / maximum_error );
}

As for the latter, it was just plain broken.

The updated version is much simpler and generic, and is much more rigorous in that it doesn't just check the pattern once but repeatedly until it reaches the end:

Code:

template < typename Iterator >
size_t period_of_sequence( Iterator begin, Iterator end )
{
    for( ; begin != end; ++begin )
    {
        Iterator
            found = begin;
        while( ( found = find( found + 1, end, *begin ) ) != end )
        {
            bool
                counting = true;
            size_t
                count = 1;
            Iterator
                next = begin, 
                search = found;
            while( ++search != end )
            {
                if( *( ++next ) != *search )
                    break;
                if( next == found )
                    counting = false;
                if( counting )
                    ++count;
            }             
            if( search == end && !counting )
                return count;
        }
    }
    return 0;
}

template < typename Container >
inline size_t period_of_sequence( Container const& data )
{
    return period_of_sequence( data.begin( ), data.end( ) );
}

There are still a few cases where it could return a possible false positive. For instance if the pattern ends with two identical samples, it has no choice but to assume that the sequence repeats indefinitely. In that case, though, you could just expand the range and retest.

>> Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform.

Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it? Anyway, you wouldn't want it to be totally uniform but within a certain range. And at any rate, a test of uniformity has a limited relevance when considering the strengths and weakenesses of an RNG.

>> For a perfect uniform random source, the MI should go to zero as the number of observations goes to infinity. 2e-6 isn't zero but it's pretty damn small (when compared to 1, the assumed entropy of each bit). Conclusion? My implementation of rand() is decent.

Interesting. I haven't gotten quite that far yet, but it sounds like a useful test. I'm guessing that Ubuntu probably uses the Marsenne twister?

>> Another way of looking at it, Given enough possible outcomes, the chance that after a given outcome generation, all outcomes have been generated an equal number of times, is remote enough that it would probably mean the RNG is cheating, isn't really unbiased, or isn't truly random.

That's a good point. Would a good test for that be to sample the uniform distribution at certain intervals and calculate the relative entropy of the output?

**tabstop** · 07-14-2009

Originally Posted by Sebastiani

Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it?

Not really. For instance, if you had just a simple 50-50 (one bit) random number generator, that is truly random (say the theoretical perfect coin), and ran it 10 times, the probability of getting 5 1's and 5 0's is (approximately) 0.25. In fact anywhere from 3-7 of each would be considered "usual" (happen more often than 5% by chance).

**Sebastiani** · 07-14-2009

>> Not really. For instance, if you had just a simple 50-50 (one bit) random number generator, that is truly random (say the theoretical perfect coin), and ran it 10 times, the probability of getting 5 1's and 5 0's is (approximately) 0.25. In fact anywhere from 3-7 of each would be considered "usual" (happen more often than 5% by chance).

Maybe we're talking about two different things here. Randomness aside, if you have an array of two counts, one for heads and the other for tails, and each element contains the value of 5, then the distribution is "perfect" (totally uniform), and any other configuration would not be. It has nothing to do with probability.

**brewbuck** · 07-15-2009

Originally Posted by Sebastiani

Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it?

No. There is no reason why the observations should be precisely uniform even if the underlying process is uniform. Consider a perfectly balanced coin. You flip it once. It comes up heads. So based on this observation -- that the coin came up heads 1 time out of 1 -- you conclude that the coin will come up heads 100% of the time, which is obviously not a justifiable conclusion.

Or consider a RNG which distributes samples uniformly between 0 and 32767, and you generate an odd number of samples. Obviously, there must be at least one set of buckets whose counts are larger than any other bucket. Yet if you drew an even number of samples, you might have perfect buckets. So, does the uniformity of the distribution depend on how many observations you choose to make? No, clearly it's either uniform or not, regardless of your measurements.

**brewbuck** · 07-15-2009

Originally Posted by Sebastiani

if you have an array of two counts, one for heads and the other for tails, and each element contains the value of 5, then the distribution is "perfect" (totally uniform), and any other configuration would not be. It has nothing to do with probability.

I think you need to be careful with the word "distribution." The distribution is a theoretical object which represents the actual probabilities of certain observations. But an array which tallies counts is only a measurement of this distribution, not the distribution itself. There are many ways to estimate the probability from a count of frequencies. The method of "take the number and divide by the total number of observations" is the maximum likelihood estimator which means among all consistent estimators it assigns the maximum probabilities to the observed events. But there are many other estimators you could use which are equally valid, depending on which assumptions you make on the underlying distribution. For instance, suppose you are estimating the likelihoods of particular words beginning a sentence. You might do this by scanning millions of sentences and incrementing a counter for each unique word you see. But this ignores the possibility that there are words which you have not seen, simply because your sample isn't big enough. Using the maximum likelihood estimator, you would assign these words a probability of zero, even though we know the true probability is not zero just because a given word didn't occur in the sample set.

**Sebastiani** · 07-15-2009

>> The distribution is a theoretical object which represents the actual probabilities of certain observations. But an array which tallies counts is only a measurement of this distribution, not the distribution itself.

I was simply referring to the "frequency" distribution, then, I guess.

>> Consider a perfectly balanced coin. You flip it once. It comes up heads. So based on this observation -- that the coin came up heads 1 time out of 1 -- you conclude that the coin will come up heads 100% of the time, which is obviously not a justifiable conclusion.

If you're testing the uniformity of samples given a range of possibilities, I would think that you would have to supply some multiple of that range to obtain a useful indicator. Otherwise, the result would be practically meaningless. I mean, if you have an RNG that generates 16-bit numbers but only call it 5 times, you simply do not have enough information to make a judgement about the relative frequency distribution. Period.

>> For instance, suppose you are estimating the likelihoods of particular words beginning a sentence. You might do this by scanning millions of sentences and incrementing a counter for each unique word you see. But this ignores the possibility that there are words which you have not seen, simply because your sample isn't big enough. Using the maximum likelihood estimator, you would assign these words a probability of zero, even though we know the true probability is not zero just because a given word didn't occur in the sample set.

That's an interesting thought, but I honestly don't grok the import of the logic there. If you don't know what all of the possibilities are, then why factor them in at all? That's just seems illogical. Of course, probability never really made sense to me, *anyway*. It always seemed be a big much-to-do-about-nothing, personally. Perhaps if it had been given a more appropriate name, like "possibility theory", I would have given it more credence.