Trying to understand hazard pointer implementation

**Toby Douglass** · 01-20-2017

Originally Posted by Codeplug

I believe the Acquire fence goes directly after the label and before the loads.

I may well be wrong, but I think it is not so. (I have not gone on to consider the other changes you've mentioned because I think this first matter affects them, they do not fix this first matter, and this first matter needs to be discussed before we can look further on.)

Before I can explain my thought, the code can be simplified as a result of the change in location of the acquire fence. If the acquire fence is moved directly after the label, then I think we can dispense with the load to local_c0 and local_c1. They're now only checking that in the time from loading local_c0 and local_c1, that no changed have occurred, which is not meaningful - it has no purpose for it does not influence the results of the test.

So, given that change, the reader code looks like this;

Code:

/****************************************************************************/
void *thread_reader( void *thread_argument )
{
  assert( thread_argument != NULL );
 
  reader_start:
  {
    __atomic_thread_fence( __ATOMIC_ACQUIRE );

    if( c0 > c1 )
      printf( "uh-oh! c0 is %llu and c1 is %llu\n", c0, c1 );
  }
  goto reader_start;
 
  return (void *) 1;
}

Now, as I say - I may be completely wrong! but what I think can happen now is that the load barrier occurs, clears the invalidation request queue, and then the read thread pauses; the write thread however continues, does lots of work, and then the read thread resumes.

We see then it is as if there is no load barrier there at all, because we can be immediately after the barrier in any possible state we can be in before the load barrier.

So for example, if we consider the writer thread, it is busy incrementing c0 and c1, and then issuing a store barrier. We then imagine the reader thread has paused after it issued its load barrier, the writer thread - still busy - issues lots of increments. However, since the reader thread is now about the issue its if(), and has not issued another load barrier, it may be that it could for example see lots of updates to c0 and none of the updates to c1. The if() fails - even though we had a load barrier.

This is a race condition, and it is handled by the local_c0 and local_c1 loads.

Here copies are made of c0 and c1. The load barrier is issued. If we see c0 and c1 *have not changed since before the load barrier*, then we *know* we have values in local_c0 and local_c1 which although they may already be out of date, *are correct with regard to each other*.

**Codeplug** · 01-20-2017

https://gcc.gnu.org/onlinedocs/gcc/_...-Builtins.html
>> __ATOMIC_ACQUIRE - Creates an inter-thread happens-before constraint from the release (or stronger) semantic store to this acquire load. Can prevent hoisting of code to before the operation.
In other words, it prevents loads/stores from moving before the Acquire.

Acquire and Release Semantics

gg

**Toby Douglass** · 01-20-2017

Originally Posted by Codeplug

https://gcc.gnu.org/onlinedocs/gcc/_...-Builtins.html

__ATOMIC_ACQUIRE - Creates an inter-thread happens-before constraint from the release (or stronger) semantic store to this acquire load. Can prevent hoisting of code to before the operation.

In other words, it prevents loads/stores from moving before the Acquire.

I may be completely missing something here - I hope not :-) but that does not, according to my mental picture of what's going on with all this stuff, address the problem in any way.

The loads which occur in the reader thread are c0 and c1. They indeed will not be moved above the load barrier - but the problem is not that they are moved; the problem is that the read thread pauses after the load barrier, while the write thread continues.

One minor point - being an acquire fence, it will only prevent *loads* from moving up above the fence, not stores as well.

Would you describe how you see the code in the two threads working (your version of the code, with the acquire right after the label), with regard to loads, stores, barriers, etc?

**Toby Douglass** · 01-20-2017

One small note, since there is a small chance it might reveal a difference in our views which otherwise is hidden, and so allow us to reconcile our views : the acquire fence will only affect the physical core upon which it operates (of that I am certain, but I am less certain about how it relates to threads on that physical core - I think it will affects loads for the whole physical core, but will only affect the ordering of code execution in the thread which issues the barrier). Here we consider the reader thread being on one physical core, and the writer thread being on another physical core. (In fact, if both threads are on the same physical core, no memory barriers are required - only compiler barriers).

**Codeplug** · 01-20-2017

Removing the loops for clarity:

Code:

void Writer()       void Reader()
{                   {
    ++c0;            
    ++c1;                                
    Release <---------> Acquire
}                       LOAD c0
                        LOAD c1
                        FailureTest
                    }

The Acquire and Release work together. The loads cannot move above the line and the stores cannot move below the line. "Moving across the line" is just another mental model. If the stores move below the line, then the Reader CPU may observe c1 incremented before c0. If the loads move above the line, then there is no Acquire to match with the Release, so it can observe c1 incremented before c0 as well.

So what does the fence provide us? It guarantees that the Reader will always see one of two results: 1) c0 and c1 are the same value, 2) c0 is greater than c1.

Since you put this work inside loops, only the first iteration of your Writer loop is susceptible.

>> the problem is that the read thread pauses after the load barrier, ...
Either thread could context-switch out at any point for any length of time - allowing all possible interleavings of code execution between the 2 threads/cpus. Take the non-looping Reader/Writer above - can you come up with an interleaving that breaks the Reader's two guarantees? If not, then I would argue that the looping version also provides those two guarantees.

gg

**Toby Douglass** · 01-20-2017

Originally Posted by Codeplug

Removing the loops for clarity:

Code:

void Writer()       void Reader()
{                   {
    ++c0;            
    ++c1;                                
    Release <---------> Acquire
}                       LOAD c0
                        LOAD c1
                        FailureTest
                    }

The Acquire and Release work together. The loads cannot move above the line and the stores cannot move below the line.

I may be completely wrong, but I think what's being said is that the second thread with the acquire knows not to execute until the release in the first thread is complete.

Is that accurate?

**Codeplug** · 01-20-2017

I look at it as a first come first serve linkage. When a Release is performed, the next Acquire is its pair. When an Acquire is performed, the next Release is its pair.

>> ... knows not to execute ...
I don't look at it as a stall in execution. I see the Acquire getting paired with the Release - so I know that if the Release is reached, then the Acquire ensures that all stores prior to the Release are visible to the Reader CPU.

gg

**Toby Douglass** · 01-21-2017

Originally Posted by Codeplug

I look at it as a first come first serve linkage. When a Release is performed, the next Acquire is its pair. When an Acquire is performed, the next Release is its pair.

So I think this can happen (now I get all the fun of writing two threads vertically in text :-);

Code:

writer            reader
======            ======
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
                  load barrier
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
                  load c0
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
store c0            
store c1            
store barrier      
                  load c1
                  failuretest

The stores which occur in the writer after the reader has issued the acquire barrier can be seen in any order by the reader - so for example he might see all the c0 stores and none of the c1 stores (on systems with say split caches, where one cache handles all even addresses, the other all odd addresses, and one cache is very busy and the other is not).

This then leads to a spurious outcome in the failuretest.

I don't look at it as a stall in execution.

I concur. Both threads (all threads) execute independently, and without any knowledge of the other thread(s).

**King Mir** · 01-22-2017

Here's another possible order:

Code:

writer            reader
======            ======
store c0            
store c1            
store barrier      
                  load barrier
store c1            
                  load c1 //data race!
store c0            
store barrier      
store c0            
                  load c0 //data race again!
store c1            
store barrier

After all, there is no barrier between the loads of c0 and c1, so they can be freely reordered. Like wise can the stores of c0 and c1 can be reordered.

Generally we use an atomic variable, with either memory fences or C11 atomic functions to synchronize other variables. So not only do you need to match acquire operations with release operations, but for them to match they must be associated with the same memory location. And then you would somehow use the value contained there to make sure the threads do not race.

If your goal is to make sure that two counters are incremented always in sync, the easiest way would not make either of them atomic. Instead you'd use a third variable to control them. Something like this untested code:

Code:

int long long unsigned c1=0, c2=0;
_Bool static volatile __attribute__( (aligned(128)) )
  c3 = 0;

void increment(){  
  _Bool expected= 0;
  while(!__atomic_compare_exchange_n(&c3, &expected, 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED));
  ++c1;
  ++c2;
  __atomic_store_n(&c3, 0, __ATOMIC_RELEASE);
}

void read(){
  _Bool expected= 0;
  while(!__atomic_compare_exchange_n(&c3, &expected, 1, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED));
  if(c1 !=c2)
    printf( "uh-oh! c0 is %llu and c1 is %llu\n", c0, c1 );
  __atomic_store_n(&c3, 0, __ATOMIC_RELEASE);   
}

This is all a little bit easier in C11.

**Toby Douglass** · 01-22-2017

Mir - it's okay; we're not actually trying to get two values to update atomically :-)

There's been some discussion of the nature of memory barriers and atomic ops, and I had written some code to try to see if it was possible to demonstrate an earlier assertation of mine.

Codeplug and I are now clarifying a particular point relating to memory barriers.

**Toby Douglass** · 01-23-2017

As an aside, the method described is a user-mode mutex. I may be wrong, but I think it is not lock-free and will not scale. It would be interesting to think of a lock-free method to ensure two variables can be atomically updated together. I think probably the MCAS method from the Cambridge Group does this, but I have not learned how it works, and I have read it does not perform well.

I could imagine something like a lock-free queue, where threads enqueue an update, and then process all the queued updates.

**Codeplug** · 01-23-2017

>> Reader will always see one of two results: 1) c0 and c1 are the same value, 2) c0 is greater than c1.
That is clearly wrong based on Toby's and Mir's examples.

If you need a mutex, I tend to use a CriticalSection or pthread_mutex_t. They already have user-mode optimizations that allow them to take a fast-path when acquiring. They tend to include yields after the initial check, as apposed to a while(CAS());.

gg

**Toby Douglass** · 01-23-2017

This brings us to the work done with local_c0 and local_c1 (and will after this bring us back to the point actually in hand);

Code:

  reader_start:
  {
    local_c0 = c0;
    local_c1 = c1;
    __atomic_thread_fence( __ATOMIC_ACQUIRE );
    if( local_c0 == c0 and local_c1 == c1 )
      if( local_c0 > local_c1 )
        printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
  } 
 goto reader_start;

So, here we make local copies of c0 and c1. They can be absolutely any value at all - any ordering whatsoever. We might see tons of c0 increments and none at all of c1. This is all fine.

We then issue the load barrier. At this point, we process the outstanding invalidation request queue.

Once this is done, we then look again at c0 and c1. If they have changed - i.e. they differ from local_c0 and/or local_c1, then we had invalidation requests which we had not yet seen when we loaded into c0 and c1, or which occurred after those loads.

If this is so, then we're out - we go around and try again.

However, if the values in c0 and c1 are *the same as before the load barrier*, then there were *no* invalidation requests pending from the time we loaded c0 to the time we compared local_c0 with c0 and local_c1 with c1.

Now it may be the very instant after the compare the invalidation request queue refills to the brim - but for us, it doesn't matter; what we *know* now is that *even though we issued a load barrier*, the values in c0 and c1 *did not change by the time of our if() *AFTER* the load barrier*, and *THEREFORE* we have *actually* got hold of the real values at that point.

We're basically looping over a race condition, until it works, and when it does finally work, then we know we have values which were correctly propagated from the storing physical core to ourselves. Of course, the very instant after the if() the actual values may change - but we are still working in our code with local_c0 and local_c1, the copies of the values which we have proved to be correct with regard to each other; we have access to something which at least was true at some point. This is useful, because we then make decisions based on it, and then try to apply those decisions using atomic operations. If the truth has moved on enough that the data upon which we made our decisions has changed, such that our decision is wrong, the atomic operation where we try to apply our decision will fail, and then we iterate around again in some higher-level loop.

So you can pretty much use load barriers as lines in the sand where you can get real values - but you have to do some extra work.

After this, we will go back to whether or not store barriers actually cause all stores to complete by the time they return, or if only by the time of the first store after the store barrier.

**Toby Douglass** · 01-24-2017

When I say "got hold of real values", I mean to say "got hold of values which are correctly ordered".

**King Mir** · 01-26-2017

An acquire fence will prevent loads from being moved before the fence, but will not prevent load from being moved bellow. So your code is equivalent to:

Code:

for(;;)
 {
   __atomic_thread_fence( __ATOMIC_ACQUIRE );
   local_c0 = c0;
   local_c1 = c1;
   if( local_c0 == c0 and local_c1 == c1 )
     if( local_c0 > local_c1 )
       printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
 }

That's not very useful.

An acquire barrier is not a yield. It does not in any way signal anything to another thread. All it does it prevent your compiler and hardware from reordering instructions and forces memory consistency. At most this may cause a stall on some architectures until cache changes are visible from other cores.