Trying to understand hazard pointer implementation

**Toby Douglass** · 12-31-2016

Hi.

I am implementing hazard pointers.

The original paper is here;

Index of /downloads/white papers/[SMR]%20-%20[Michael]%20-%20Hazard%20Pointers;%20Safe%20Memory%20Reclaimati on%20for%20Lock-Free%20Objects.pdf

McKenney in this white paper;

Structured Deferral: Synchronization via Procrastination - ACM Queue

gives some example code and looks like this (copied nearly verbatim, I moved one opening curley brace);

Code:

1  int hp_record(void **p, void **hp)
2  {
3    void *tmp;
4    tmp = ACCESS_ONCE(*p);
5    ACCESS_ONCE(*hp) = tmp;
6    smp_mb();
7    if (tmp != ACCESS_ONCE(*p))
8    {
9      ACCESS_ONCE(*hp) = NULL;
10     return 0;
11   }
12   return 1;
13 }

We see here in this code there are no atomic operations. The hazard pointer is set on line 5 (and confirmed on line 7) and there is a full barrier on line 6 - this means that in principle, the write we made to "*hp" may never be seen by any other core, ever.

Imagine I want to set a hazard pointer.

The first thing I must do is get a copy of the pointer I wish to protect. I can actually do this using the code above, i.e. I get a copy by putting it into the hazard pointer, but the problem is that code does not use an atomic operation at all. It sets the hazard pointer on line 5, issues a load barrier (as part of a full barrier) on line 6 and then checks the hazard pointer is still correct on line 7; but there is in none of this a forced write to memory (an atomic operation) so in principle the hazard pointer which has been set may *never* been seen by any physical core - i.e. *it has not been made visible*, which means it cannot work.

I may well be mistaken, but reading the original white paper, the intent of line 7 appears to be to ensure that the hazard pointer is visible to other threads.

I do not see how this can be so because store buffers mean that the write to the hazard pointer on line 5 is not necessarily visible to other threads, and the check on line 7 does not address this problem - it only checks that the invalidation requests serviced by the load part of the full barrier on line 6 have caused "*p" to change, which is to say, ensures that the value written to the hazard pointer is correct. I believe it is correct, but I also - as above - believe it is not yet visible to other physical cores, which is a fatal problem.

I am wondering if I properly and fully understand how memory barriers and atomic operations work.

**salquestfl** · 01-01-2017

SMP full memory barrier means reads and writes before the barrier are visible to all processors. (both software and hardware (cache symchronization) ensures this - refer Memory Barriers)
I don't understand your particular example - because we check by reading the same memory location we have written, and here the program writes to "hp" and reads from "p". But, maybe the meaning of the full memory barrier would help you.

**Toby Douglass** · 01-01-2017

Originally Posted by salquestfl

SMP full memory barrier means reads and writes before the barrier are visible to all processors.

I may be wrong, but I think this is not so.

Paul McKenney seems very clear in his document regarding memory barriers that the barrier itself does *not* induces stores to *actually occur*.

"There is no guarantee that any of the memory accesses specified before a memory barrier will be _complete_ by the completion of a memory barrier instruction [...]"

The link you provided does not explicitly discuss this matter - it is very weakly implied in the first sentence, "Memory barriers are used to provide control over the order of memory accesses."

It is very natural to imagine barriers act to force all stores to occur (took me a good while to get past that), and so that they do not needs to be explicitly and firmly stated.

If you assume for now this is true, and look again at the code, you can see why I'm puzzled!

**Toby Douglass** · 01-01-2017

I'm wondering if the code is actually specifically for Intel. I remember reading (but not the details) that Intel with multiple cores offers quite strong guarantees of cross-processor read and write visibility.

**Toby Douglass** · 01-01-2017

salquestfi - it occurred to me that smp_mb() may be doing more than I think (I was thinking it was a memory barrier only), and I have more properly read the article you linked, which seems to say this is so : "It guarantees that any memory access initiated before the memory barrier will be complete before passing the barrier [...]"

I've been looking at __smp_mb() in the Linux source code, and apart from x86, where an atomic instruction has to be issued to issue a memory barrier, it appears to be memory barriers only; no atomic writes. This contradicts the statement in the article.

**Toby Douglass** · 01-02-2017

I emailed Paul McKenney.

He said that the time taken for the write to propagate is so short that it's not a problem.

So my understanding was correct - the store does not force writes to complete, and so there is a race condition - but the counter-argument is that the window in which the race condition can occur is so short that it is never a problem.

I'm now benchmarking to see how long propagation takes, how much variation there is (we care, after all, about the worst case) and if an atomic operation forces the store buffers to flush (and I expect it to).

**Codeplug** · 01-03-2017

When reading the paper, I would stick to the paper's definition of things over anything else.

"Compiler code motion is addressed by ACCESS_ONCE() on lines 6, 7, and 9, which may be implemented either as volatile casts or as C11/C++11 volatile relaxed atomic loads and stores."

So this takes care of compiler code motion with respect to other ACCESS_ONCE() calls. It also implies that all loads/stores are to memory. The cache coherency of the architecture will determine any possible load/store reorderings that another processor may observe.

"Compiler and CPU code motion is addressed by the memory barrier on line 8, ..."

Line 8 is "smp_mb()". This tells me it's just a full fence for code and HW. This means all loads/stores complete before crossing the fence. This does affect what reorderings another processor may observe.

gg

**Toby Douglass** · 01-03-2017

Originally Posted by Codeplug

When reading the paper, I would stick to the paper's definition of things over anything else.

Of course. However, the paper does not deal with all things. It describes the algorithm - it leaves out all system specific implementation details, such as memory barriers. If the pseudo-code in the paper is *directly* implemented, *it is unsafe*.

"Compiler code motion is addressed by ACCESS_ONCE() on lines 6, 7, and 9, which may be implemented either as volatile casts or as C11/C++11 volatile relaxed atomic loads and stores."

So this takes care of compiler code motion with respect to other ACCESS_ONCE() calls. It also implies that all loads/stores are to memory. The cache coherency of the architecture will determine any possible load/store reorderings that another processor may observe.

It doesn't.

Store buffers (and invalidation requests) mean that cache coherency as you and I would naturally think of it, *does not occur*.

You have to ensure that it does, with memory barriers *and forced writes to memory to ensure memory barriers are honoured*.

At least, that was my view - Dr. McKenney is arguing the latency on write propagation is so low you can in effect ignore it.

I'm benchmarking to find out if it's really so, and also to check if atomic operations do (as I am arguing they do) force a write to memory.

"Compiler and CPU code motion is addressed by the memory barrier on line 8, ..."

Line 8 is "smp_mb()". This tells me it's just a full fence for code and HW. This means all loads/stores complete before crossing the fence.

Specifically and exactly, and as forcefully as I can put it - *it does not*. Memory barriers have *no* impact on *when* a store occurs. *None at all*. They affect and *only* affect ordering.

When you look at that smp_mb(), you must look at it and think and only think - okay, when this physical core DOES do a write (if it ever does), then everything before this barrier will go out before eveything after.

You *cannot* think - ah, all the stores will be complete when I return from this barrier call. It is not so.

However, as mentioned earlier, Dr. McKenney argues latency on the propagation of stores is so low that although the memory barrier has no impact on the completion of stores, in fact they complete so quickly that no other thread can end up missing the hazard pointer that has been set by this thread.

My concern is that there may be many factors which can increase that propagation time - heavy load, many sockets, weak memory models, etc - and situations where the time required for the race condition to occur is minimized (very few hazard pointers).

**Codeplug** · 01-06-2017

>> Store buffers (and invalidation requests) mean that cache coherency as you and I would naturally think of it, *does not occur*. You have to ensure that it does ...

The architecture is either cache coherent, or cache non-coherent. The paper assumes a cache coherent architecture. Non-coherent architectures are much more exotic. By definition, the write *does occur* on a cache coherent architecture.

Originally Posted by Wikipedia

In a read made by a processor P1 to location X that follows a write by another processor P2 to X, with no other writes to X made by any processor occurring between the two accesses and with the read and write being sufficiently separated, X must always return the value written by P2. This condition defines the concept of coherent view of memory. Propagating the writes to the shared memory location ensures that all the caches have a coherent view of the memory. If processor P1 reads the old value of X, even after the write by P2, we can say that the memory is incoherent.

>> "with the read and write being sufficiently separated,..."

This separation is extremely small, but it doesn't matter - it's just a race condition to be avoided using tools like memory barriers.

>> When you look at that smp_mb(), you must look at it and think and only think - okay, when this physical core DOES do a write (if it ever does), then everything before this barrier will go out before everything after.

This is not how cache coherency and membars work. When there is a store that occurs before a fence, that store will be cache-coherent with all other CPU's after the fence.

gg

**Codeplug** · 01-06-2017

It seems we are both saying the same things somewhat about fences. As a mental model, I think of a fence as "this is where things before the fence *complete*". But in reality it's really just a marker-in-time that defines the ordering of load/stores, before and after the fence, as seen by other processors.

The later is accurate, but I've adopted that mental model.

gg

**Toby Douglass** · 01-08-2017

Originally Posted by Codeplug

>>
The architecture is either cache coherent, or cache non-coherent. The paper assumes a cache coherent architecture. Non-coherent architectures are much more exotic.

I am in what I am arguing speaking of MESI (or a full variant thereof) compliant systems, such as the Intel or AMD systems we use in our laptops and so on.

By definition, the write *does occur* on a cache coherent architecture.

If we consider the *principle* of a cache coherent system, then yes, of course - by definition it is so. If we consider actual systems - ARM, Intel, AMD, etc - then it is not so. Store buffers and invalidation requests get in the way. The statement which is true in principle is not true in practise, because it says nothing about memory barriers and so on.

So we do consider for example our Intel laptops cache coherent - they offer MESI or something evolved from it - but in fact, in practise, they are not cache coherent, and only become so when we as developers take additional steps in software to ensure that it is so.

>> "with the read and write being sufficiently separated,..."

This separation is extremely small, but it doesn't matter - it's just a race condition to be avoided using tools like memory barriers.

No guarantees are provided, as far as I know, by any processor manufacturers, on the period for which a store buffer will pend a write.

As such, merely waiting is not guaranteed to lead to the write occurring.

I agree that the issue is resolved by using memory barriers (and I suspect also forced writes to memory, to force the honouring of the barrier).

>> When you look at that smp_mb(), you must look at it and think and only think - okay, when this physical core DOES do a write (if it ever does), then everything before this barrier will go out before everything after.

This is not how cache coherency and membars work. When there is a store that occurs before a fence, that store will be cache-coherent with all other CPU's after the fence.

I think this is incorrect.

The store barrier only enforces ordering. It does not, and specifically does not, force writes to become visible.

Quote from the "LINUX KERNEL MEMORY BARRIERS" doc, written by Howells and McKenney (and I trust McKenney, where I do not trust the Wikipedia);

"There are certain things that the Linux kernel memory barriers do not guarantee:

(*) There is no guarantee that any of the memory accesses specified before a memory barrier will be _complete_ by the completion of a memory barrier instruction; the barrier can be considered to draw a line in that CPU's access queue that accesses of the appropriate type may not cross."

**Toby Douglass** · 01-08-2017

Originally Posted by Codeplug

It seems we are both saying the same things somewhat about fences. As a mental model, I think of a fence as "this is where things before the fence *complete*". But in reality it's really just a marker-in-time that defines the ordering of load/stores, before and after the fence, as seen by other processors.

The later is accurate, but I've adopted that mental model.

I may be wrong, but I think the inaccuracy doesn't help you; it's only an inaccuracy. Why not be accurate?

However, it is accurate for load barriers. Considering load barriers, the load barrier will force the processing of the current invalidation request queue. After the load barrier, we will be caught up with every store published prior to that time by every other physical processor; real work is actually done, and must be complete, by the time the load barrier returns.

Of course, the instant after this, however, in a lock-free system, any number of any operations could occur. The invalidation request queue could instantly become absolutely full again. All we get for our efforts is the knowledge that anything published prior to the load barrier *will* have been seen. This is all we need of course if we're dealing with allocations which although visible have been removed from a shared data structure, where by their removal other threads cannot access those allocations any more to do anything to them.

**Codeplug** · 01-10-2017

Let's start over.

Code:

int hp_record(void **p, void **hp)
{
   void *tmp;
   tmp = ACCESS_ONCE(*p);
   ACCESS_ONCE(*hp) = tmp;
   smp_mb();
   if (tmp != ACCESS_ONCE(*p))
   {
     ACCESS_ONCE(*hp) = NULL;
     return 0;
   }
   return 1;
}

Originally Posted by Toby

... I get a copy by putting it into the hazard pointer, but the problem is that code does not use an atomic operation at all. It sets the hazard pointer on line 5, issues a load barrier (as part of a full barrier) on line 6 and then checks the hazard pointer is still correct on line 7; but there is in none of this a forced write to memory (an atomic operation) so in principle the hazard pointer which has been set may *never* been seen by any physical core - i.e. *it has not been made visible*, which means it cannot work.

Atomic operations and "forced writes to memory" are not the same or even related. An operation is atomic if it can't be interrupted by a context switch. Atomic operations can work on memory or registers. "Forced writes to memory" simply means we are not writing to a CPU register.

The paper already claims that ACCESS_ONCE() implements a forced write to memory via volatile accesses or the C11/C++11 atomics-interface. The memory barrier then takes care of compiler and CPU code motion (aka ordering of visibility).

So if multiple threads are executing hp_record() with different values of "p" and the same value for "hp", then there will always been only 1 winner that returns 1. The loosing threads with return 0.

gg

**Toby Douglass** · 01-13-2017

Originally Posted by Codeplug

Atomic operations and "forced writes to memory" are not the same or even related.

I think - I am *not* certain - but I think atomic ops bypass store buffers. They force a write to memory - not just a write to a store buffer. This then I think also since it is a write to memory must cause earlier store barriers to be honoured, which means earlier stores must be flushed to memory prior to the write of the stomic operation.

An operation is atomic if it can't be interrupted by a context switch.

Also, cannot be interrupted by other cores issuing atomic ops on the same memory location.

Atomic operations can work on memory or registers.

I think it is memory only? On Intel, I'm pretty sure atomic op targets are always memory. On LL/SC, you can use the full instruction set, but the atomicity occurs when you do the SC, i.e. when you write (or try to write) to memory.

The paper already claims that ACCESS_ONCE() implements a forced write to memory via volatile accesses or the C11/C++11 atomics-interface. The memory barrier then takes care of compiler and CPU code motion (aka ordering of visibility).

I think this is wrong. Since the operation is not atomic - it's just an ordinary write, and using volatile doesn't change this - it goes to a store buffer and so in theory can be indefinitely postponed.

So if multiple threads are executing hp_record() with different values of "p" and the same value for "hp", then there will always been only 1 winner that returns 1. The loosing threads with return 0.

That's exactly what I think *isn't* true, which is why I'm concerned (and was originally confused, since it *needs* to be true).

**King Mir** · 01-14-2017

Originally Posted by Toby Douglass

I think - I am *not* certain - but I think atomic ops bypass store buffers. They force a write to memory - not just a write to a store buffer. This then I think also since it is a write to memory must cause earlier store barriers to be honoured, which means earlier stores must be flushed to memory prior to the write of the stomic operation.

No, atomics can be an ordinary move, especially on intell. In general an an atomic operation is a read and/or write that is preceded and/or succeeded by memory fences. See C/C++11 mappings to processors

Also, cannot be interrupted by other cores issuing atomic ops on the same memory location.

Nothing special needs to happen to achieve that, on a cache coherent architecture.

I think it is memory only? On Intel, I'm pretty sure atomic op targets are always memory. On LL/SC, you can use the full instruction set, but the atomicity occurs when you do the SC, i.e. when you write (or try to write) to memory.

The concept of atomicity is not specific to memory, but registers aren't shared between cores, so copying between registers does not need synchronisation. However, the entire point of atomics is to talk between threads, so a register atomic would not be useful.

I think this is wrong. Since the operation is not atomic - it's just an ordinary write, and using volatile doesn't change this - it goes to a store buffer and so in theory can be indefinitely postponed.

It's a forced memory write in the sense that repeated write calls are guaranteed to write to memory, and not be elided. That's what volatile means. Nothing skips the store buffer.

That's exactly what I think *isn't* true, which is why I'm concerned (and was originally confused, since it *needs* to be true).

Isn't that the entire point of this code?

Thread: Trying to understand hazard pointer implementation

Thread Tools

Search Thread

Display

Trying to understand hazard pointer implementation

Similar Threads

Suggestion Smart Pointer Class Implementation

This Program may help you to understand pointer

Need help in analyzing pointer and macro implementation.

Help for understand pointer

Don't understand backtrace on invalid pointer

Tags for this Thread