Of course. However, the paper does not deal with all things. It describes the algorithm - it leaves out all system specific implementation details, such as memory barriers. If the pseudo-code in the paper is *directly* implemented, *it is unsafe*.
It doesn't."Compiler code motion is addressed by ACCESS_ONCE() on lines 6, 7, and 9, which may be implemented either as volatile casts or as C11/C++11 volatile relaxed atomic loads and stores."
So this takes care of compiler code motion with respect to other ACCESS_ONCE() calls. It also implies that all loads/stores are to memory. The cache coherency of the architecture will determine any possible load/store reorderings that another processor may observe.
Store buffers (and invalidation requests) mean that cache coherency as you and I would naturally think of it, *does not occur*.
You have to ensure that it does, with memory barriers *and forced writes to memory to ensure memory barriers are honoured*.
At least, that was my view - Dr. McKenney is arguing the latency on write propagation is so low you can in effect ignore it.
I'm benchmarking to find out if it's really so, and also to check if atomic operations do (as I am arguing they do) force a write to memory.
Specifically and exactly, and as forcefully as I can put it - *it does not*. Memory barriers have *no* impact on *when* a store occurs. *None at all*. They affect and *only* affect ordering."Compiler and CPU code motion is addressed by the memory barrier on line 8, ..."
Line 8 is "smp_mb()". This tells me it's just a full fence for code and HW. This means all loads/stores complete before crossing the fence.
When you look at that smp_mb(), you must look at it and think and only think - okay, when this physical core DOES do a write (if it ever does), then everything before this barrier will go out before eveything after.
You *cannot* think - ah, all the stores will be complete when I return from this barrier call. It is not so.
However, as mentioned earlier, Dr. McKenney argues latency on the propagation of stores is so low that although the memory barrier has no impact on the completion of stores, in fact they complete so quickly that no other thread can end up missing the hazard pointer that has been set by this thread.
My concern is that there may be many factors which can increase that propagation time - heavy load, many sockets, weak memory models, etc - and situations where the time required for the race condition to occur is minimized (very few hazard pointers).