Well, it looks like there were flaws in both the uniform distribution function and the periodicity detector. But I think I've straightened them out. For the former, the problem was that I wasn't factoring in the maximum error in order to obtain a useful result. The corrected version:
Code:
template < typename Container >
double uniform_distribution( Container const& data )
{
double
one = 1,
result = 0,
size = data.size( ),
count = std::accumulate( data.begin( ), data.end( ) , double( 0 ) ),
expected = count / size,
//maximum_error = ( size - one ) * ( expected + one ); // WRONG
maximum_error = ( count - expected ) + ( expected * ( size - one ) );
if( count == 0 )
return std::numeric_limits< double >::quiet_NaN( );
for
(
typename Container::const_iterator seq = data.begin( ), fin = data.end( );
seq != fin;
++seq
)
result += fabs( *seq - expected );
return one - ( result / maximum_error );
}
As for the latter, it was just plain broken. The updated version is much simpler and generic, and is much more rigorous in that it doesn't just check the pattern once but repeatedly until it reaches the end:
Code:
template < typename Iterator >
size_t period_of_sequence( Iterator begin, Iterator end )
{
for( ; begin != end; ++begin )
{
Iterator
found = begin;
while( ( found = find( found + 1, end, *begin ) ) != end )
{
bool
counting = true;
size_t
count = 1;
Iterator
next = begin,
search = found;
while( ++search != end )
{
if( *( ++next ) != *search )
break;
if( next == found )
counting = false;
if( counting )
++count;
}
if( search == end && !counting )
return count;
}
}
return 0;
}
template < typename Container >
inline size_t period_of_sequence( Container const& data )
{
return period_of_sequence( data.begin( ), data.end( ) );
}
There are still a few cases where it could return a possible false positive. For instance if the pattern ends with two identical samples, it has no choice but to assume that the sequence repeats indefinitely. In that case, though, you could just expand the range and retest.
>> Just because the RNG does not produce exactly equal bucket counts doesn't mean it's not perfectly uniform.
Perfect uniformity *by definition* means a completely equal distribution of samples, doesn't it? Anyway, you wouldn't want it to be totally uniform but within a certain range. And at any rate, a test of uniformity has a limited relevance when considering the strengths and weakenesses of an RNG.
>> For a perfect uniform random source, the MI should go to zero as the number of observations goes to infinity. 2e-6 isn't zero but it's pretty damn small (when compared to 1, the assumed entropy of each bit). Conclusion? My implementation of rand() is decent.
Interesting. I haven't gotten quite that far yet, but it sounds like a useful test. I'm guessing that Ubuntu probably uses the Marsenne twister?
>> Another way of looking at it, Given enough possible outcomes, the chance that after a given outcome generation, all outcomes have been generated an equal number of times, is remote enough that it would probably mean the RNG is cheating, isn't really unbiased, or isn't truly random.
That's a good point. Would a good test for that be to sample the uniform distribution at certain intervals and calculate the relative entropy of the output?