Thread: Trying to understand hazard pointer implementation

  1. #46
    Registered User
    Join Date
    Dec 2009
    Posts
    83
    Quote Originally Posted by King Mir View Post
    An acquire fence will prevent loads from being moved before the fence, but will not prevent load from being moved bellow. So your code is equivalent to:
    Code:
    for(;;)
     {
       __atomic_thread_fence( __ATOMIC_ACQUIRE );
       local_c0 = c0;
       local_c1 = c1;
       if( local_c0 == c0 and local_c1 == c1 )
         if( local_c0 > local_c1 )
           printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
     }
    That's not very useful.
    I think you are right. I'm in the process of checking.

    An acquire barrier is not a yield. It does not in any way signal anything to another thread. All it does it prevent your compiler and hardware from reordering instructions and forces memory consistency. At most this may cause a stall on some architectures until cache changes are visible from other cores.
    Yes. That was not the misunderstanding; rather, I had forgotten that loads could down below a load barrier. I was treating it as if it were a barrier to the movement of loads in *both* directions.

    My current thought (not yet confirmed) is I actually need a full memory barrier here.

    This would I think also explain something I saw in McKenney's sample hazard pointer code, where he was using a full barrier where I thought a load barrier would be enough.

  2. #47
    Registered User
    Join Date
    Dec 2009
    Posts
    83
    Mir, do you have any sources you can let me know of?

    I'm been looking for confirmation of this (from sources I trust) and I can't find any.

    I've not found anything which disproves it, either.

    I think I remember reading it a long time ago, perhaps in some MSDN docs, but I have no idea now.

    I am currently in two minds about it.

    If we consider a case with two threads and a stack data structure, the situation is clear; thread A will push to the stack (and in doing so emit a store barrier) and thread B will later pop (and emit a load barrier). To be *able* to pop, the store barrier must have occurred, so the load barrier ensures thread B will see the stores from thread A. All good and well.

    If we now however consider the case of a single thread using load barriers, we now make the following assertations;

    1. all loads issued prior to the load barrier will complete prior to loads after the load barrier
    2. the load barrier causes the invalidation request queue to be serviced
    3. loads above the load barrier can be reordered to below the barrier, but loads below the barrier cannot be reordered above

    For your observation regarding the code (in your most recent post) to be true, then we are saying that the two loads which I had above the load barrier have been reordered to be below the load barrier - that for this to occur they have not yet *been* issued, so the load barrier cannot cause them to be complete before loads after the barrier; they are in effect themselves now after the barrier.

    If this is the case, then in any single thread, all load barriers can in effect move up to the very top of the code, like so;

    Code:
    load A
    load B
    load C
    load_barrier
    load D
    load E
    load_barrier
    load F
    load G
    load H
    load_barrier
    can become

    Code:
    load_barrier
    load_barrier
    load_barrier
    load A
    load B
    load C
    load D
    load E
    load F
    load G
    load H
    And this may indeed be true, but there seems to me something odd about it, which I will try to explain.

    In our original code, and assuming load barriers block movement in both directions, the loads occur an ordered series of unordered groups, i.e. ABC (these three in any order) followed by DE (these two in any order) followed by FGH (these three in any order).

    If we say loads can move down over a load barrier, then we can imagine the second situation - and we say the loads have not been issued, so they cannot honour the load barrier.

    But if this is true, then there is not need to imagine the ordering (ABC), (DE), (FGH) still holds. These loads could be in any order, since they were issued after the last load barrier...

    Code:
    load_barrier
    load_barrier
    load_barrier
    load G
    load E
    load A
    load F
    load C
    load H
    load D
    load B
    ...and this seems to violate the principle that a single thread always sees the world as correctly sequential; we *did* issue loads before the load barriers, in our single-threaded view of the world, and the load barriers are *not* being honoured.

    So I'm in two minds about this.

    As an aside, when you are here talking about reordering over the barrier, are you thinking of compiler reordering, processor reordering, or both? the barrier is assumed to contain a matching compiler barrier of the same type.

    The Intel docs describe LFENCE by saying it prevents speculative loads crossing the fence (which is reasonable), and says nothing else.
    Last edited by Toby Douglass; 01-28-2017 at 09:21 AM.

  3. #48
    Registered User
    Join Date
    Dec 2009
    Posts
    83
    This is the reader thread compiled on ARM64 with -O0.

    Code:
      reader_start:
      {
        local_c0 = c0;
     28c:	90000000 	adrp	x0, 0 <main>
     290:	91000000 	add	x0, x0, #0x0
     294:	f9400000 	ldr	x0, [x0]
     298:	f90017a0 	str	x0, [x29,#40]
        local_c1 = c1;
     29c:	90000000 	adrp	x0, 0 <main>
     2a0:	91000000 	add	x0, x0, #0x0
     2a4:	f9400000 	ldr	x0, [x0]
     2a8:	f90013a0 	str	x0, [x29,#32]
        __atomic_thread_fence( __ATOMIC_ACQUIRE );
     2ac:	d50339bf 	dmb	ishld
        if( local_c0 == c0 and local_c1 == c1 )
     2b0:	90000000 	adrp	x0, 0 <main>
     2b4:	91000000 	add	x0, x0, #0x0
     2b8:	f9400000 	ldr	x0, [x0]
     2bc:	f94017a1 	ldr	x1, [x29,#40]
     2c0:	eb00003f 	cmp	x1, x0
     2c4:	54000201 	b.ne	304 <thread_reader+0xb4>
     2c8:	90000000 	adrp	x0, 0 <main>
     2cc:	91000000 	add	x0, x0, #0x0
     2d0:	f9400000 	ldr	x0, [x0]
     2d4:	f94013a1 	ldr	x1, [x29,#32]
     2d8:	eb00003f 	cmp	x1, x0
     2dc:	54000141 	b.ne	304 <thread_reader+0xb4>
          if( local_c0 > local_c1 )
     2e0:	f94017a1 	ldr	x1, [x29,#40]
     2e4:	f94013a0 	ldr	x0, [x29,#32]
     2e8:	eb00003f 	cmp	x1, x0
     2ec:	540000c9 	b.ls	304 <thread_reader+0xb4>
            printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
     2f0:	90000000 	adrp	x0, 0 <main>
     2f4:	91000000 	add	x0, x0, #0x0
     2f8:	f94013a2 	ldr	x2, [x29,#32]
     2fc:	f94017a1 	ldr	x1, [x29,#40]
     300:	94000000 	bl	0 <printf>
      }
      goto reader_start;
    It's -O0, so no compiler optimization. The acquire barrier is coming out as a DMB ISHLD.

    The ARM ARM sayeth;

    ISHLD

    DMB operation that waits only for loads to complete, and only applies to the inner shareable domain.
    Which sounds right.

    Right now I think I'm tending toward not buying that loads can move down over the barrier, because if they can single-threaded program order is being violated. If I write in my single-threaded view of the world that all loads before "now" must be complete before the loads after, then earlier loads *can't* be made to complete before later loads.

    It *is* true that earlier loads will not be guaranteed to have *completed* by the time the load barrier returns, but that's fine. All that matters is that when they do complete, they will complete before the later loads.

    I suspect the phrase "loads above can move down over the barrier" may actually be a way of saying that loads do not *complete* because of the load barrier.

    However, the *ordering* imposed by the barrier *is* present.

  4. #49
    Registered User
    Join Date
    Dec 2009
    Posts
    83
    This is the reader thread compiled on ARM64 with -O0.

    Code:
      reader_start:
      {
        local_c0 = c0;
     28c:	90000000 	adrp	x0, 0 <main>
     290:	91000000 	add	x0, x0, #0x0
     294:	f9400000 	ldr	x0, [x0]
     298:	f90017a0 	str	x0, [x29,#40]
        local_c1 = c1;
     29c:	90000000 	adrp	x0, 0 <main>
     2a0:	91000000 	add	x0, x0, #0x0
     2a4:	f9400000 	ldr	x0, [x0]
     2a8:	f90013a0 	str	x0, [x29,#32]
        __atomic_thread_fence( __ATOMIC_ACQUIRE );
     2ac:	d50339bf 	dmb	ishld
        if( local_c0 == c0 and local_c1 == c1 )
     2b0:	90000000 	adrp	x0, 0 <main>
     2b4:	91000000 	add	x0, x0, #0x0
     2b8:	f9400000 	ldr	x0, [x0]
     2bc:	f94017a1 	ldr	x1, [x29,#40]
     2c0:	eb00003f 	cmp	x1, x0
     2c4:	54000201 	b.ne	304 <thread_reader+0xb4>
     2c8:	90000000 	adrp	x0, 0 <main>
     2cc:	91000000 	add	x0, x0, #0x0
     2d0:	f9400000 	ldr	x0, [x0]
     2d4:	f94013a1 	ldr	x1, [x29,#32]
     2d8:	eb00003f 	cmp	x1, x0
     2dc:	54000141 	b.ne	304 <thread_reader+0xb4>
          if( local_c0 > local_c1 )
     2e0:	f94017a1 	ldr	x1, [x29,#40]
     2e4:	f94013a0 	ldr	x0, [x29,#32]
     2e8:	eb00003f 	cmp	x1, x0
     2ec:	540000c9 	b.ls	304 <thread_reader+0xb4>
            printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
     2f0:	90000000 	adrp	x0, 0 <main>
     2f4:	91000000 	add	x0, x0, #0x0
     2f8:	f94013a2 	ldr	x2, [x29,#32]
     2fc:	f94017a1 	ldr	x1, [x29,#40]
     300:	94000000 	bl	0 <printf>
      }
      goto reader_start;
    It's -O0, so no compiler optimization. The acquire barrier is coming out as a DMB ISHLD.

    The ARM ARM sayeth;

    ISHLD

    DMB operation that waits only for loads to complete, and only applies to the inner shareable domain.
    Which sounds right.

    Right now I think I'm tending toward not buying that loads can move down over the barrier, because if they can single-threaded program order is being violated. If I write in my single-threaded view of the world that all loads before "now" must be complete before the loads after, then earlier loads *can't* be made to complete before later loads.

    It *is* true that earlier loads will not be guaranteed to have *completed* by the time the load barrier returns, but that's fine. All that matters is that when they do complete, they will complete before the later loads.

    I suspect the phrase "loads above can move down over the barrier" may actually be a way of saying that loads do not *complete* because of the load barrier. I think this is correct, and it is of course as with store barriers.

    However, the *ordering* imposed by the barrier *is* present, and the point at which the invalidation queue is flushed *is* the same.
    Last edited by Toby Douglass; 01-28-2017 at 12:20 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Suggestion Smart Pointer Class Implementation
    By Bargi in forum C++ Programming
    Replies: 4
    Last Post: 02-09-2015, 06:57 AM
  2. This Program may help you to understand pointer
    By papagym177 in forum C++ Programming
    Replies: 1
    Last Post: 07-29-2014, 05:21 AM
  3. Replies: 2
    Last Post: 03-17-2014, 01:48 PM
  4. Help for understand pointer
    By redred in forum C Programming
    Replies: 8
    Last Post: 01-19-2013, 01:41 PM
  5. Don't understand backtrace on invalid pointer
    By SterlingM in forum C++ Programming
    Replies: 5
    Last Post: 09-21-2011, 02:00 PM

Tags for this Thread