So, I had a go on ARM.
The core of the code looks like this;
Code:
/***** statics *****/
int long long unsigned static volatile __attribute__( (aligned(128)) )
c0 = 0,
c1 = 1;
/****************************************************************************/
libshared_pal_thread_return_t LIBSHARED_PAL_THREAD_CALLING_CONVENTION thread_reader( void *thread_argument )
{
int long long unsigned
local_c0,
local_c1;
assert( thread_argument != NULL );
reader_start:
local_c0 = c0;
local_c1 = c1;
__atomic_thread_fence( __ATOMIC_ACQUIRE );
if( local_c0 == c0 and local_c1 == c1 )
if( local_c0 > local_c1 )
printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
goto reader_start;
return LIBSHARED_PAL_THREAD_RETURN_CAST(RETURN_SUCCESS);
}
/****************************************************************************/
libshared_pal_thread_return_t LIBSHARED_PAL_THREAD_CALLING_CONVENTION thread_writer( void *thread_argument )
{
assert( thread_argument != NULL );
writer_start:
c0++;
c1++;
__atomic_thread_fence( __ATOMIC_RELEASE );
goto writer_start;
return LIBSHARED_PAL_THREAD_RETURN_CAST(RETURN_SUCCESS);
}
Now, with this kind of work, it is extremely easy to make mistakes. It do not draw from this any solid conclusion, because it's just not possible to be confident enough in the code. Review by you and others would be a very good thing.
Using this code, I was on ARM64 unable to induce the printf() (i.e. a difference of two or more).
With the memory barriers removed, it showed up immediately and often.
This in and of itself isn't unexpected as such - McKenney argued the propagation delay from store buffers was so short (nanoseconds) that it wasn't a problem. It may just be that the delay is there, but it's so short the code cannot provoke the problem (especially given the work being done in the reader to perform a read).
There are however other possible causes which could be blocking the problem. On ARM I think there are "levels" of memory barrier (how far out you push the stores), and it might that they somehow differ in how long they take to execute.
Another reason that might contribute is that the system was entirely unloaded except for this code. On busy systems, the delay may (well) be increased. Or maybe it's reduced, because more stores are occurring!