This is the reader thread compiled on ARM64 with -O0.
Code:
reader_start:
{
local_c0 = c0;
28c: 90000000 adrp x0, 0 <main>
290: 91000000 add x0, x0, #0x0
294: f9400000 ldr x0, [x0]
298: f90017a0 str x0, [x29,#40]
local_c1 = c1;
29c: 90000000 adrp x0, 0 <main>
2a0: 91000000 add x0, x0, #0x0
2a4: f9400000 ldr x0, [x0]
2a8: f90013a0 str x0, [x29,#32]
__atomic_thread_fence( __ATOMIC_ACQUIRE );
2ac: d50339bf dmb ishld
if( local_c0 == c0 and local_c1 == c1 )
2b0: 90000000 adrp x0, 0 <main>
2b4: 91000000 add x0, x0, #0x0
2b8: f9400000 ldr x0, [x0]
2bc: f94017a1 ldr x1, [x29,#40]
2c0: eb00003f cmp x1, x0
2c4: 54000201 b.ne 304 <thread_reader+0xb4>
2c8: 90000000 adrp x0, 0 <main>
2cc: 91000000 add x0, x0, #0x0
2d0: f9400000 ldr x0, [x0]
2d4: f94013a1 ldr x1, [x29,#32]
2d8: eb00003f cmp x1, x0
2dc: 54000141 b.ne 304 <thread_reader+0xb4>
if( local_c0 > local_c1 )
2e0: f94017a1 ldr x1, [x29,#40]
2e4: f94013a0 ldr x0, [x29,#32]
2e8: eb00003f cmp x1, x0
2ec: 540000c9 b.ls 304 <thread_reader+0xb4>
printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
2f0: 90000000 adrp x0, 0 <main>
2f4: 91000000 add x0, x0, #0x0
2f8: f94013a2 ldr x2, [x29,#32]
2fc: f94017a1 ldr x1, [x29,#40]
300: 94000000 bl 0 <printf>
}
goto reader_start;
It's -O0, so no compiler optimization. The acquire barrier is coming out as a DMB ISHLD.
The ARM ARM sayeth;
ISHLD
DMB operation that waits only for loads to complete, and only applies to the inner shareable domain.
Which sounds right.
Right now I think I'm tending toward not buying that loads can move down over the barrier, because if they can single-threaded program order is being violated. If I write in my single-threaded view of the world that all loads before "now" must be complete before the loads after, then earlier loads *can't* be made to complete before later loads.
It *is* true that earlier loads will not be guaranteed to have *completed* by the time the load barrier returns, but that's fine. All that matters is that when they do complete, they will complete before the later loads.
I suspect the phrase "loads above can move down over the barrier" may actually be a way of saying that loads do not *complete* because of the load barrier.
However, the *ordering* imposed by the barrier *is* present.