Trying to understand hazard pointer implementation

**Codeplug** · 01-15-2017

>> ... but I think atomic ops bypass store buffers.
Store buffers are just one of many techniques that cause visibility and reordering issues. Whatever the HW employs to cause things to appear non-sequential - the solution is memory barriers. Once you understand what the HW can do under the hood, you can forget about it - knowing that your membars will just "do the right thing".

gg

**Toby Douglass** · 01-16-2017

Originally Posted by Codeplug

>> ... but I think atomic ops bypass store buffers.
Store buffers are just one of many techniques that cause visibility and reordering issues. Whatever the HW employs to cause things to appear non-sequential

Yes. However, I think it all reduces down to something simple; whatever you store may never by other physical cores be seen at all, and if it is seen, it may be seen in any order. I have one unanswered question here though, which is to do with stores to the same location. I wonder if these can be sure to be seen in order of stores, or not.

- the solution is memory barriers. Once you understand what the HW can do under the hood, you can forget about it - knowing that your membars will just "do the right thing".

I may be wrong, but I think this is not so. Memory barriers do solve the ordering problem, but they do not solve the visibility problem; what you store *if it becomes visible* will become visible in the correct order (which is to say, order as constrained by store barriers) but there is no guarantee it *will become visible*. A forced write to memory (and only a force write to memory) guarantees that earlier stores will then be visible (as constrained by such store barriers issued up to that point).

However, we are now repeating ourselves =-) I think you undestand what I mean, and I understand what you mean, so that's that :-)

**Toby Douglass** · 01-16-2017

The quoting by the forum is *appalling*. It only quotes that which you wrote - it does not include what you are replying to. I am manually adding it back in, by copying the original text and then adding in ALL the markup, which is absolutely awful.

I think - I am *not* certain - but I think atomic ops bypass store buffers. They force a write to memory - not just a write to a store buffer. This then I think also since it is a write to memory must cause earlier store barriers to be honoured, which means earlier stores must be flushed to memory prior to the write of the stomic operation.

No, atomics can be an ordinary move, especially on intell. In general an an atomic operation is a read and/or write that is preceded and/or succeeded by memory fences. See C/C++11 mappings to processors

Yes. I have seen this, and it confused me when I first saw it, and I now think it wrongly named. An ordinary store, regardless of memory barriers, may never be seen by any other physical core, and I expect it also falls prey as usual to the load-modify-store problem, exemplified by multiple threads incrementing the same counter; counts are lost. In my book, atomic means in such a scenario, the count is correct; no counts are lost.

Similarly, when thinking of this (so-called) atomic load, the invalidation request queue is cleared immediately prior to the load - but then anything can happen, and the queue can become completely full, and so the load can be of an invalid-but-not-yet-invalidated cache line. This problem does not happen with (what I would call) atomic loads. You get what was really there at that moment.

Also, cannot be interrupted by other cores issuing atomic ops on the same memory location.

Nothing special needs to happen to achieve that, on a cache coherent architecture.

It was only an observation, for completeness.

I think it is memory only? On Intel, I'm pretty sure atomic op targets are always memory. On LL/SC, you can use the full instruction set, but the atomicity occurs when you do the SC, i.e. when you write (or try to write) to memory.

The concept of atomicity is not specific to memory, but registers aren't shared between cores, so copying between registers does not need synchronisation. However, the entire point of atomics is to talk between threads, so a register atomic would not be useful.

On Intel and ARM, as far as I know, the instructions for atomics do not support a register target; it was with this in mind also that I wrote.

I think this is wrong. Since the operation is not atomic - it's just an ordinary write, and using volatile doesn't change this - it goes to a store buffer and so in theory can be indefinitely postponed.
It's a forced memory write in the sense that repeated write calls are guaranteed to write to memory, and not be elided. That's what volatile means. Nothing skips the store buffer.

[snip]

That's exactly what I think *isn't* true, which is why I'm concerned (and was originally confused, since it *needs* to be true).

Isn't that the entire point of this code?

I'm not sure what you mean. Can you elucidate?

**King Mir** · 01-16-2017

Originally Posted by Toby Douglass

Yes. I have seen this, and it confused me when I first saw it, and I now think it wrongly named. An ordinary store, regardless of memory barriers, may never be seen by any other physical core, and I expect it also falls prey as usual to the load-modify-store problem, exemplified by multiple threads incrementing the same counter; counts are lost. In my book, atomic means in such a scenario, the count is correct; no counts are lost.

An atomic increment would indeed be more than just a move instruction. But a store to an atomic variable can be. Read-modify-write operations require either hardware support, or a spinlock. Like move instructions, those operations may also require barriers around them.

Similarly, when thinking of this (so-called) atomic load, the invalidation request queue is cleared immediately prior to the load - but then anything can happen, and the queue can become completely full, and so the load can be of an invalid-but-not-yet-invalidated cache line. This problem does not happen with (what I would call) atomic loads. You get what was really there at that moment.

Firstly, invalid cachelines are at a lower level of understanding than needed for C's memory model. Thinking of cachelines seems to be confusing you, so I suggest you don't, until you understand c's memory model at a higher level.

Secondly, there's no such thing as "what was really there"; C is at a higher level than hardware so this definition is meaningless in a C context. Unsynchronized memory access means instructions can be arbitrarily reordered, provided that data dependency is preserved. Even with relaxed atomics this is true. You can't reason about them as if two threads execute in program order. You can only do that if you use sequentially consistent memory order, which implies the use of memory barriers. Now if you do use sequentially consistent atomic loads and stores, then you don't need memory barriers, because those are already build into the operation.

I'm not sure what you mean. Can you elucidate?

I just meant the code is described as working, so if it doesn't make sense to you, it's probably because you don't understand something.

Originally Posted by Toby Douglass

Yes. However, I think it all reduces down to something simple; whatever you store may never by other physical cores be seen at all, and if it is seen, it may be seen in any order.

That can be true for weak memory architectures. However, working in C, you can reason at a higher level. If your code is synchronized by the right kind of atomics, or by locks, loads and stores will have the order you need.

I have one unanswered question here though, which is to do with stores to the same location. I wonder if these can be sure to be seen in order of stores, or not.

The answer is yes, provided that there is no data-race. In other words, C disallows two threads to write to the same memory location at the same time, unless the variable is atomic. This allows compiler optimizations, but as mentioned above, once compiled relaxed atomic write are often simple load instructions.

I may be wrong, but I think this is not so. Memory barriers do solve the ordering problem, but they do not solve the visibility problem; what you store *if it becomes visible* will become visible in the correct order (which is to say, order as constrained by store barriers) but there is no guarantee it *will become visible*. A forced write to memory (and only a force write to memory) guarantees that earlier stores will then be visible (as constrained by such store barriers issued up to that point).

A memory barrier does insures that memory operations become visible; that's part of their purpose. The details depend on which type of memory barrier is used.

You don't actually ever want to force a write to main memory. At most you would invalidate the L1 and L2 caches of other threads, but this is an implementation detail of memory barriers.

**Toby Douglass** · 01-17-2017

Originally Posted by King Mir

An atomic increment would indeed be more than just a move instruction. But a store to an atomic variable can be. Read-modify-write operations require either hardware support, or a spinlock. Like move instructions, those operations may also require barriers around them.

I think when you say atomic you mean a store without word tearing?

Secondly, there's no such thing as "what was really there"; C is at a higher level than hardware so this definition is meaningless in a C context. Unsynchronized memory access means instructions can be arbitrarily reordered, provided that data dependency is preserved. Even with relaxed atomics this is true. You can't reason about them as if two threads execute in program order. You can only do that if you use sequentially consistent memory order, which implies the use of memory barriers. Now if you do use sequentially consistent atomic loads and stores, then you don't need memory barriers, because those are already build into the operation.

When working with lock-free data structures, there is fairly often a need to know that a given thread will have seen operations performed by other threads *at least up to a certain point in time*. In *that* sense, you need to make sure you know "what is really there", so that you never end up having a gap between what you read and the current time.

Yes. However, I think it all reduces down to something simple; whatever you store may never by other physical cores be seen at all, and if it is seen, it may be seen in any order.

That can be true for weak memory architectures. However, working in C, you can reason at a higher level. If your code is synchronized by the right kind of atomics, or by locks, loads and stores will have the order you need.

I meant to say that in the absence of any special steps, the situation is as described.

I have one unanswered question here though, which is to do with stores to the same location. I wonder if these can be sure to be seen in order of stores, or not.

The answer is yes, provided that there is no data-race. In other words, C disallows two threads to write to the same memory location at the same time, unless the variable is atomic. This allows compiler optimizations, but as mentioned above, once compiled relaxed atomic write are often simple load instructions.

The question was with regard to a single thread performing those stores and the visibility of those stores to other threads. The single thread is the only writer to the variables in question; is it possible for those writes to be seen out-of-order by other threads, on all reasonable modern hardware (i.e. all major CPU families, embedded systems, etc)? I *think* the answer might be no, but I am not really sure.

As an aside, the C I'm using (C89) does permit multiple threads to write non-atomically to the same variable. This behaviour can sometimes be utilized by lock-free data structures to improve performance. It will be a problem if it goes away! however, I don't see how the language can know it is happening. The compiler would have to move beyond seeing the world as a single thread at a time.

Thinking about it, though, I think what you mean is that the compiler does not support the *correct* writing to the same variable by two or more threads. I don't think you can mean the compiler *prevents* it ("disallows").

A memory barrier does insures that memory operations become visible; that's part of their purpose. The details depend on which type of memory barrier is used.

Yes and no. Although I may be wrong, I think what you have said is correct but incomplete, as described in my post.

You don't actually ever want to force a write to main memory. At most you would invalidate the L1 and L2 caches of other threads, but this is an implementation detail of memory barriers.

When I say "force a write to memory", I really mean force a write to the point you reach cache coherency, because that is enough. As described, although I may be wrong, it seems to me such forced writes are necessary, because memory barriers do *not* cause stores to *actually occur*. They only control ordering when the stores *do* occur, and they may *never* occur, and the only way to ensure they *do* occur is a forced write to memory.

I am however now repeating myself! so I will leave the matter here.

**Codeplug** · 01-17-2017

>> because memory barriers do *not* cause stores to *actually occur*.
On a cache coherent architecture, it always occurs eventually. To guarantee that the store has occurred before loading it - you use membars. There is no extra voodoo.

gg

**Toby Douglass** · 01-17-2017

Originally Posted by Codeplug

>> because memory barriers do *not* cause stores to *actually occur*.
On a cache coherent architecture, it always occurs eventually. To guarantee that the store has occurred before loading it - you use membars. There is no extra voodoo.

I (to the extent I know, which isn't much) know of no processor manufacturer or designer who provides guarantees about the latency of stores reaching the cache coherency protocol and so becoming visible to other physical cores.

As such, it cannot be assumed stores will complete, and we do not know the latency which will occur even for those stores which do complete (and store barriers do not cause those stores to complete).

**Codeplug** · 01-17-2017

>> ... guarantees about the latency of stores reaching the cache ...
That latency does not matter in the presence of proper synchronization (membars). If it does, then you have a race condition and the code no good.

>> and store barriers do not cause those stores to complete
Sure they do...

Originally Posted by Intel® 64 and IA-32 Architectures
Optimization Reference Manual

7.4.5.1 SFENCE Instruction

The SFENCE (STORE FENCE) instruction makes it possible for every STORE instruction that precedes an SFENCE in program order to be globally visible before any STORE that follows the SFENCE. SFENCE provides an efficient way of ensuring ordering between routines that produce weakly-ordered results.

Intel(R) 64 and IA-32 Architectures Software Developer Manuals | Intel(R) Software

That's all there is to it. Other architectures with membars behave similarly.

gg

**Toby Douglass** · 01-17-2017

Originally Posted by Codeplug

That latency does not matter in the presence of proper synchronization (membars). If it does, then you have a race condition and the code no good.

If memory barriers do not cause stores to complete, then the latency would matter.

and store barriers do not cause those stores to complete

Sure they do...

The SFENCE (STORE FENCE) instruction makes it possible for every STORE instruction that precedes an SFENCE in program order to be globally visible before any STORE that follows the SFENCE. SFENCE provides an efficient way of ensuring ordering between routines that produce weakly-ordered results.

That's all there is to it. Other architectures with membars behave similarly.

I may be wrong, but I think you are mis-reading the documentation.

The documentation states and only states that stores preceding the fence will be globally visible before any store that follows.

It does not and it specifically does not indicate that those stores *will complete by the time the store barrier is complete*. This is important - the documentation says *nothing* about *when the stores complete*. It *only* makes statements about *ordering*.

The document *is* correct in what it says - stores before will be visible before stores after - but it is incorrect to take this to *also* mean that visibility *will have occurred by the time the barrier completes*. This is *not* the case.

The documentation is accurate in what it says, but it is insufficient, as it does not say other things which it needs to say, and by that shortcoming it leads people to misunderstand, as it is natural to assume (given prior experience with mutexes and so on) that it would mean stores are completed by the time the barrier is complete.

Put it this way. Imagine for a moment what I am saying is true, and that stores can occur at any time, and the barrier does not cause stores to complete. You can see that what is written in the docs is still correct? for it certainly remains true that when the stores *do* complete, at some later time (if they do, that is - they may be held up indefintely), they *will* complete in the order required by the memory barrier. As such, the docs do not actually state either your view of things, or mine.

**Codeplug** · 01-17-2017

>> ... makes it possible for every STORE instruction that precedes an SFENCE in program order to be globally visible before any STORE that follows the SFENCE.

To be "globally visible" implies the store completed - what else could it mean? Once the store is globally visible, the next load must return the value that was stored.

You can look at sfence as the store "occurring" at the fence - or occurring at the next store after the fence. It doesn't matter since the end result is still the same. Imagine the window of time starting at the fence and ending at the first store after the fence. Who cares if the mental model uses the front of the window or the back of the window? Whatever may be in this window we don't care about. All we care about is that our data-dependent stores are "globally visible" in the correct order - thanks to the sfence (... or is it the first store after the sfence?

)

You still claim that stores could be "held up indefinitely". This simply isn't how cache coherent architectures work. All their fancy optimization techniques don't matter as long as the membars do what they are supposed to do. You don't even need to know how the optimizations work. You just need to know that membars work.

gg

**Toby Douglass** · 01-18-2017

Originally Posted by Codeplug

... makes it possible for every STORE instruction that precedes an SFENCE in program order to be globally visible before any STORE that follows the SFENCE.[/i][/b]

To be "globally visible" implies the store completed - what else could it mean?

I may be wrong, but I think you are reading it to mean "preceeding stores will be completed and AS SUCH globally visible before later stores".

I am reading it to mean "preceeding stores will be globally visible before later stores" (with no assertation or implication about when stores complete).

However, we are repeating ourselves :-) I think you understand what I mean, and I understand what you mean, and that's that! :-)

You still claim that stores could be "held up indefinitely". This simply isn't how cache coherent architectures work. All their fancy optimization techniques don't matter as long as the membars do what they are supposed to do. You don't even need to know how the optimizations work. You just need to know that membars work.

I think it would be possible to write some code to test our views in this matter and see which actually occurs.

If stores are not complete after a store barrier, then a matching load barrier in another thread will still cause the store to the variable in question not to be seen. We should then be able to observe that thread A thinks it has written a given value, while thread B (even after a store barrier from A and a load barrier from B) will still see the previous value.

Perhaps this can be done with two counters. The first is initialized to 0, the second to 1. We then enter a loop; both counters are incremented and a store barrier issued. In another thread, we issue a load barrier and read both counters. They should always be separated by 1. If they are not, then the view that store barriers cause stores to complete is incorrect.

**Toby Douglass** · 01-18-2017

I think the method I've described isn't quite right; if we imagine the writer thread, it could increment the first counter, pause, and then during the pause the reader thread will see the two counters are equal - nothing to do with memory barriers.

However, we can modify the method slightly, and say that we should never see a difference of *two or more* between the two counters.

Note that the method I think will not work on Intel, because Intel offers guarantees above that provided by memory barriers, in particular that all stores by a single processor are seen in the same order by all other processors, i.e. it is as if there is a store barrier after every store.

I will try it on ARM and MIPS when I get home.

**Toby Douglass** · 01-18-2017

... makes it possible for every STORE instruction that precedes an SFENCE in program order to be globally visible before any STORE that follows the SFENCE.

To be "globally visible" implies the store completed - what else could it mean?

Another way to approach this has occurred to me, from looking at the language of the Intel manual. I think it is something in fact you yourself have already spotted.

The language says "every preceeding store will be globally visible before any store that follows the sfence".

It does not say "every preceeding store will be globally visible by the time the sfence completes".

We already see then that if there was an sfence, and no following stores, we could not expect the previous stores to be globally visible regardless of whether my view in this, or your view, is correct.

We do see then that the sfence itself is not causing stores to complete; if it was, the language of the manual would be that "every preceeding store will be globally visible by the time the sfence completes".

With regard to stores following an sfence, there are many factors which influence whether or not they occur, and if they do, when. The compiler can re-order stores, combine stores, or eliminate them. Store buffers play a role in delaying stores. In short, we cannot *know*, with certainty, that a store will occur unless we take steps to *guarantee* it to occur, which means an atomic instruction.

We see then we can have stores, then a store barrier, and then have no idea when the next store will occur, where it is that next store which actually - as per the language in the manual - causes earlier stores to become globally visible.

An atomic instruction *without* an store barrier will *not* cause previous stores to complete, but an atomic instruction where store barriers have previously been issued must of course cause earlier stores to complete, as the atomic instruction is guaranteed to cause a store, and of course in an order honouring the constraints specified by those earlier store barriers.

**Toby Douglass** · 01-19-2017

So, I had a go on ARM.

The core of the code looks like this;

Code:

/***** statics *****/
int long long unsigned static volatile __attribute__( (aligned(128)) )
  c0 = 0,
  c1 = 1;

/****************************************************************************/
libshared_pal_thread_return_t LIBSHARED_PAL_THREAD_CALLING_CONVENTION thread_reader( void *thread_argument )
{
  int long long unsigned
    local_c0,
    local_c1;

  assert( thread_argument != NULL );

  reader_start:
    local_c0 = c0;
    local_c1 = c1;
    __atomic_thread_fence( __ATOMIC_ACQUIRE );
    if( local_c0 == c0 and local_c1 == c1 )
      if( local_c0 > local_c1 )
        printf( "uh-oh! c0 is %llu and c1 is %llu\n", local_c0, local_c1 );
  goto reader_start;

  return LIBSHARED_PAL_THREAD_RETURN_CAST(RETURN_SUCCESS);
}

/****************************************************************************/
libshared_pal_thread_return_t LIBSHARED_PAL_THREAD_CALLING_CONVENTION thread_writer( void *thread_argument )
{
  assert( thread_argument != NULL );

  writer_start:
    c0++;
    c1++;
    __atomic_thread_fence( __ATOMIC_RELEASE );
  goto writer_start;

  return LIBSHARED_PAL_THREAD_RETURN_CAST(RETURN_SUCCESS);
}

Now, with this kind of work, it is extremely easy to make mistakes. It do not draw from this any solid conclusion, because it's just not possible to be confident enough in the code. Review by you and others would be a very good thing.

Using this code, I was on ARM64 unable to induce the printf() (i.e. a difference of two or more).

With the memory barriers removed, it showed up immediately and often.

This in and of itself isn't unexpected as such - McKenney argued the propagation delay from store buffers was so short (nanoseconds) that it wasn't a problem. It may just be that the delay is there, but it's so short the code cannot provoke the problem (especially given the work being done in the reader to perform a read).

There are however other possible causes which could be blocking the problem. On ARM I think there are "levels" of memory barrier (how far out you push the stores), and it might that they somehow differ in how long they take to execute.

Another reason that might contribute is that the system was entirely unloaded except for this code. On busy systems, the delay may (well) be increased. Or maybe it's reduced, because more stores are occurring!

**Codeplug** · 01-19-2017

I believe the Acquire fence goes directly after the label and before the loads. The fence provides an order of visibility of changes. So the goal of the fence would be for the reader to always see one of two results: 1) c0 and c1 are the same value, 2) c0 is greater than c1.

So the test-for-failure should be "if (c1 > c0)", meaning that the reader observed c1 being incremented before c0.

gg