Trying to understand hazard pointer implementation

**Toby Douglass** · 01-27-2017

Originally Posted by King Mir

An acquire fence will prevent loads from being moved before the fence, but will not prevent load from being moved bellow. So your code is equivalent to:

Code:

for(;;)
 {
   __atomic_thread_fence( __ATOMIC_ACQUIRE );
   local_c0 = c0;
   local_c1 = c1;
   if( local_c0 == c0 and local_c1 == c1 )
     if( local_c0 > local_c1 )
       printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
 }

That's not very useful.

I think you are right. I'm in the process of checking.

An acquire barrier is not a yield. It does not in any way signal anything to another thread. All it does it prevent your compiler and hardware from reordering instructions and forces memory consistency. At most this may cause a stall on some architectures until cache changes are visible from other cores.

Yes. That was not the misunderstanding; rather, I had forgotten that loads could down below a load barrier. I was treating it as if it were a barrier to the movement of loads in *both* directions.

My current thought (not yet confirmed) is I actually need a full memory barrier here.

This would I think also explain something I saw in McKenney's sample hazard pointer code, where he was using a full barrier where I thought a load barrier would be enough.

**Toby Douglass** · 01-28-2017

Mir, do you have any sources you can let me know of?

I'm been looking for confirmation of this (from sources I trust) and I can't find any.

I've not found anything which disproves it, either.

I think I remember reading it a long time ago, perhaps in some MSDN docs, but I have no idea now.

I am currently in two minds about it.

If we consider a case with two threads and a stack data structure, the situation is clear; thread A will push to the stack (and in doing so emit a store barrier) and thread B will later pop (and emit a load barrier). To be *able* to pop, the store barrier must have occurred, so the load barrier ensures thread B will see the stores from thread A. All good and well.

If we now however consider the case of a single thread using load barriers, we now make the following assertations;

1. all loads issued prior to the load barrier will complete prior to loads after the load barrier
2. the load barrier causes the invalidation request queue to be serviced
3. loads above the load barrier can be reordered to below the barrier, but loads below the barrier cannot be reordered above

For your observation regarding the code (in your most recent post) to be true, then we are saying that the two loads which I had above the load barrier have been reordered to be below the load barrier - that for this to occur they have not yet *been* issued, so the load barrier cannot cause them to be complete before loads after the barrier; they are in effect themselves now after the barrier.

If this is the case, then in any single thread, all load barriers can in effect move up to the very top of the code, like so;

Code:

load A
load B
load C
load_barrier
load D
load E
load_barrier
load F
load G
load H
load_barrier

can become

Code:

load_barrier
load_barrier
load_barrier
load A
load B
load C
load D
load E
load F
load G
load H

And this may indeed be true, but there seems to me something odd about it, which I will try to explain.

In our original code, and assuming load barriers block movement in both directions, the loads occur an ordered series of unordered groups, i.e. ABC (these three in any order) followed by DE (these two in any order) followed by FGH (these three in any order).

If we say loads can move down over a load barrier, then we can imagine the second situation - and we say the loads have not been issued, so they cannot honour the load barrier.

But if this is true, then there is not need to imagine the ordering (ABC), (DE), (FGH) still holds. These loads could be in any order, since they were issued after the last load barrier...

Code:

load_barrier
load_barrier
load_barrier
load G
load E
load A
load F
load C
load H
load D
load B

...and this seems to violate the principle that a single thread always sees the world as correctly sequential; we *did* issue loads before the load barriers, in our single-threaded view of the world, and the load barriers are *not* being honoured.

So I'm in two minds about this.

As an aside, when you are here talking about reordering over the barrier, are you thinking of compiler reordering, processor reordering, or both? the barrier is assumed to contain a matching compiler barrier of the same type.

The Intel docs describe LFENCE by saying it prevents speculative loads crossing the fence (which is reasonable), and says nothing else.

**Toby Douglass** · 01-28-2017

This is the reader thread compiled on ARM64 with -O0.

Code:

  reader_start:
  {
    local_c0 = c0;
 28c:	90000000 	adrp	x0, 0 <main>
 290:	91000000 	add	x0, x0, #0x0
 294:	f9400000 	ldr	x0, [x0]
 298:	f90017a0 	str	x0, [x29,#40]
    local_c1 = c1;
 29c:	90000000 	adrp	x0, 0 <main>
 2a0:	91000000 	add	x0, x0, #0x0
 2a4:	f9400000 	ldr	x0, [x0]
 2a8:	f90013a0 	str	x0, [x29,#32]
    __atomic_thread_fence( __ATOMIC_ACQUIRE );
 2ac:	d50339bf 	dmb	ishld
    if( local_c0 == c0 and local_c1 == c1 )
 2b0:	90000000 	adrp	x0, 0 <main>
 2b4:	91000000 	add	x0, x0, #0x0
 2b8:	f9400000 	ldr	x0, [x0]
 2bc:	f94017a1 	ldr	x1, [x29,#40]
 2c0:	eb00003f 	cmp	x1, x0
 2c4:	54000201 	b.ne	304 <thread_reader+0xb4>
 2c8:	90000000 	adrp	x0, 0 <main>
 2cc:	91000000 	add	x0, x0, #0x0
 2d0:	f9400000 	ldr	x0, [x0]
 2d4:	f94013a1 	ldr	x1, [x29,#32]
 2d8:	eb00003f 	cmp	x1, x0
 2dc:	54000141 	b.ne	304 <thread_reader+0xb4>
      if( local_c0 > local_c1 )
 2e0:	f94017a1 	ldr	x1, [x29,#40]
 2e4:	f94013a0 	ldr	x0, [x29,#32]
 2e8:	eb00003f 	cmp	x1, x0
 2ec:	540000c9 	b.ls	304 <thread_reader+0xb4>
        printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
 2f0:	90000000 	adrp	x0, 0 <main>
 2f4:	91000000 	add	x0, x0, #0x0
 2f8:	f94013a2 	ldr	x2, [x29,#32]
 2fc:	f94017a1 	ldr	x1, [x29,#40]
 300:	94000000 	bl	0 <printf>
  }
  goto reader_start;

It's -O0, so no compiler optimization. The acquire barrier is coming out as a DMB ISHLD.

The ARM ARM sayeth;

ISHLD

DMB operation that waits only for loads to complete, and only applies to the inner shareable domain.

Which sounds right.

Right now I think I'm tending toward not buying that loads can move down over the barrier, because if they can single-threaded program order is being violated. If I write in my single-threaded view of the world that all loads before "now" must be complete before the loads after, then earlier loads *can't* be made to complete before later loads.

It *is* true that earlier loads will not be guaranteed to have *completed* by the time the load barrier returns, but that's fine. All that matters is that when they do complete, they will complete before the later loads.

I suspect the phrase "loads above can move down over the barrier" may actually be a way of saying that loads do not *complete* because of the load barrier.

However, the *ordering* imposed by the barrier *is* present.

**Toby Douglass** · 01-28-2017

This is the reader thread compiled on ARM64 with -O0.

Code:

  reader_start:
  {
    local_c0 = c0;
 28c:	90000000 	adrp	x0, 0 <main>
 290:	91000000 	add	x0, x0, #0x0
 294:	f9400000 	ldr	x0, [x0]
 298:	f90017a0 	str	x0, [x29,#40]
    local_c1 = c1;
 29c:	90000000 	adrp	x0, 0 <main>
 2a0:	91000000 	add	x0, x0, #0x0
 2a4:	f9400000 	ldr	x0, [x0]
 2a8:	f90013a0 	str	x0, [x29,#32]
    __atomic_thread_fence( __ATOMIC_ACQUIRE );
 2ac:	d50339bf 	dmb	ishld
    if( local_c0 == c0 and local_c1 == c1 )
 2b0:	90000000 	adrp	x0, 0 <main>
 2b4:	91000000 	add	x0, x0, #0x0
 2b8:	f9400000 	ldr	x0, [x0]
 2bc:	f94017a1 	ldr	x1, [x29,#40]
 2c0:	eb00003f 	cmp	x1, x0
 2c4:	54000201 	b.ne	304 <thread_reader+0xb4>
 2c8:	90000000 	adrp	x0, 0 <main>
 2cc:	91000000 	add	x0, x0, #0x0
 2d0:	f9400000 	ldr	x0, [x0]
 2d4:	f94013a1 	ldr	x1, [x29,#32]
 2d8:	eb00003f 	cmp	x1, x0
 2dc:	54000141 	b.ne	304 <thread_reader+0xb4>
      if( local_c0 > local_c1 )
 2e0:	f94017a1 	ldr	x1, [x29,#40]
 2e4:	f94013a0 	ldr	x0, [x29,#32]
 2e8:	eb00003f 	cmp	x1, x0
 2ec:	540000c9 	b.ls	304 <thread_reader+0xb4>
        printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
 2f0:	90000000 	adrp	x0, 0 <main>
 2f4:	91000000 	add	x0, x0, #0x0
 2f8:	f94013a2 	ldr	x2, [x29,#32]
 2fc:	f94017a1 	ldr	x1, [x29,#40]
 300:	94000000 	bl	0 <printf>
  }
  goto reader_start;

It's -O0, so no compiler optimization. The acquire barrier is coming out as a DMB ISHLD.

The ARM ARM sayeth;

ISHLD

DMB operation that waits only for loads to complete, and only applies to the inner shareable domain.

Which sounds right.

Right now I think I'm tending toward not buying that loads can move down over the barrier, because if they can single-threaded program order is being violated. If I write in my single-threaded view of the world that all loads before "now" must be complete before the loads after, then earlier loads *can't* be made to complete before later loads.

It *is* true that earlier loads will not be guaranteed to have *completed* by the time the load barrier returns, but that's fine. All that matters is that when they do complete, they will complete before the later loads.

I suspect the phrase "loads above can move down over the barrier" may actually be a way of saying that loads do not *complete* because of the load barrier. I think this is correct, and it is of course as with store barriers.

However, the *ordering* imposed by the barrier *is* present, and the point at which the invalidation queue is flushed *is* the same.

Thread: Trying to understand hazard pointer implementation

Thread Tools

Search Thread

Display

Similar Threads

Suggestion Smart Pointer Class Implementation

This Program may help you to understand pointer

Need help in analyzing pointer and macro implementation.

Help for understand pointer

Don't understand backtrace on invalid pointer

Tags for this Thread