C++11 Multi threading memory ordering: Relaxed vs Release and Consume

**SchCle** · 04-10-2015

Hi,
I've a question regarding this example

Code:

#include <thread>
#include <atomic>
#include <cassert>
#include <string>
 
std::atomic<std::string*> ptr;
int data;
 
void producer()
{
    std::string* p  = new std::string("Hello");
    data = 42;
    ptr.store(p, std::memory_order_release);
}
 
void consumer()
{
    std::string* p2;
    while (!(p2 = ptr.load(std::memory_order_consume)))
        ;
    assert(*p2 == "Hello"); // never fires: *p2 carries dependency from ptr
    assert(data == 42); // may or may not fire: data does not carry dependency from ptr
}
 
int main()
{
    std::thread t1(producer);
    std::thread t2(consumer);
    t1.join(); t2.join();
}

given here: std::memory_order - cppreference.com

(Yes, I also read that "no known production compilers track dependency chains: consume operations are lifted to acquire operations.")

I'm wondering whether a std::memory_order_relaxed ordering for the load and store would yield the same... The first assertion can only fire if *p2 != "Hello", which can only happen if the load failed. So the only scenario I can think of is if the storing is "character-wise atomic", i.e. a result could be a partial store of "Hello". If someone can elucidate this to me, I'd be grateful.

**Codeplug** · 04-10-2015

Yes - your assert() comments are still true, and the p2 results are the same, when using memory_order_relaxed.

gg

**phantomotap** · 04-10-2015

I'm wondering whether a std::memory_order_relaxed ordering for the load and store would yield the same... The first assertion can only fire if *p2 != "Hello", which can only happen if the load failed. So the only scenario I can think of is if the storing is "character-wise atomic", i.e. a result could be a partial store of "Hello". If someone can elucidate this to me, I'd be grateful.

O_o

I wouldn't use `std::memory_order_relaxed' for the store. I can't imagine poor visibility actually happening because the context would seem to be a pessimism from the perspective of an optimizing processor, but you might indeed find that the `std::memory_order_relaxed' store becomes visible before all preceding writes occurring within the constructor.

Soma

**Codeplug** · 04-10-2015

>> .. might ... becomes visible before all preceding writes occurring within the constructor.
+1

So the first assert could fire. Classic trap in DCL

.

gg

**phantomotap** · 04-10-2015

O_o

DCL?

[Edit]
*derp*

Double-Checked Locking

I... um... look a pony!
[/Edit]

Soma

**stahta01** · 04-10-2015

My first thought was how does DEC Command Language relate to this post.

Link: http://en.wikipedia.org/wiki/DIGITAL_Command_Language
I was remember the D standing for DEC; the link says Digital. Wow, its been 19 years since I first learned ADA on a VAX system.

Tim S.

Originally Posted by phantomotap

O_o

DCL?

[Edit]
*derp*

Double-Checked Locking

I... um... look a pony!
[/Edit]

Soma

**SchCle** · 04-12-2015

Originally Posted by phantomotap

I wouldn't use `std::memory_order_relaxed' for the store. I can't imagine poor visibility actually happening because the context would seem to be a pessimism from the perspective of an optimizing processor, but you might indeed find that the `std::memory_order_relaxed' store becomes visible before all preceding writes occurring within the constructor.

Thanks for your explanation. However, I'm not exactly sure why `ptr.store(p, std::memory_order_release);` can be moved (due to optimization) "above" `std::string* p = new std::string("Hello");` (this is what you're saying, if I got you right, and this is the reason why the 2nd assertion can fire). I can only understand this if `std::string* p` would be declared globally, as then the assignment `p << "Hello";` could happen before or after `ptr = p`.

**phantomotap** · 04-12-2015

this is what you're saying, if I got you right, and this is the reason why the 2nd assertion can fire

O_o

I don't know how you got any of that information from what I said, but you are wrong regardless.

I said nothing about `ptr.store(p, std::memory_order_release);' being moved. I said the visibility of `ptr' could change when using `std::memory_order_relaxed' before the visibility of earlier writes--the character array--changes. The differences regarding change in visibility has nothing really to do with moving around instructions. The differences regarding change in visibility exist because `std::memory_order_relaxed' only guarantees atomicity. (The `std::memory_order_relaxed' does not guarantee any ordering of visibility.) The `std::memory_order_relaxed' option exists for scenarios where no ordering is required, but you are wanting to use the option for a scenario where a specific order of visibility is required.

We don't need to assume an elaborate environment to see how the lack of ordering requirements may exhibit differences in visibility. The character array--the focus of earlier writes--doesn't likely live in the same operational block as the `ptr' atomic. A simple environment having operations to flush the operational cache to a memory block with specific atomic reads and writes implemented by unbreakable instructions for flushing or invalidating the block where the atomic lives is sufficient to see problems.

Let's pretend we could insert such invalidation/flushing instructions ourselves; take a look at some slightly modified example code which eliminates spurious context:

Code:

std::atomic<char *> g;

void producer()
{
    char * s(new char[6]);
    s[0] = 'H';
    s[1] = 'e';
    s[2] = 'l';
    s[3] = 'l';
    s[4] = 'o';
    s[5] = '\0';
    FLUSH(s); // 1
    // The change in the memory referenced by `s' becomes visible.
    FLUSH(&g);
    // The `consumer' thread is receives the change in `g' value.
    g.store(s, std::memory_order_release);
}

void consumer()
{
    char * s;
    INVALIDATE(&g);
    while (!(s = g.load(std::memory_order_consume)));
    // The `consumer' thread has received the change in `g' value.
    INVALIDATE(s); // 2
    assert(!strcmp(s, "Hello"));
}

We know that the memory regarding `g' and `s' will eventually be copied from processor cache to main memory, but the order the visibility changes--which copy happens first--is relevant because the correct operation of the code depends on the memory referenced by `s' being completely written. We know that `1' happens before `2' because the write to `g' isn't allowed to "float above" the earlier writes.

Let's pretend we didn't care about the order by using the `std::memory_order_relaxed' option; the following reordered instructions changing visibility is perfectly valid:

Code:

std::atomic<char *> g;

void producer()
{
    char * s(new char[6]);
    s[0] = 'H';
    s[1] = 'e';
    s[2] = 'l';
    s[3] = 'l';
    s[4] = 'o';
    s[5] = '\0';
    FLUSH(&g);
    // The `consumer' thread is receives the change in `g' value.
    FLUSH(s); // 1
    // The change in the memory referenced by `s' becomes visible.
    g.store(s, std::memory_order_relaxed);
}

void consumer()
{
    char * s;
    INVALIDATE(&g);
    while (!(s = g.load(std::memory_order_relaxed)));
    // The `consumer' thread has received the change in `g' value.
    INVALIDATE(s); // 2
    assert(!strcmp(s, "Hello"));
}

We know told the environment that we don't care about ordering. We don't know that `1' happens before `2' is completed. We know that the environment is, for some definition or parallel, executing the functions in parallel. However, the environment may take longer to complete `FLUSH(s);' than `INVALIDATE(s);' which means that the memory referenced by `s' within `consumer' doesn't hold the values we expect.

Be fairly warned, I've simplified both example and explanation to the breaking point. One could easily poke holes in both example and explanation. The fact is that the visibility requirements only place an "as if" behavioral expectation with a multitude of different behaviors seen in practice. Despite the arguably frail examples and explanation, the point remains that apparently ordered operations do not necessarily correspond to changes in visibility having the same order. Actually, you could even go a bit further. If the visibility of changes did precisely correspond to the order of operations, you wouldn't need anything other than `std::memory_order_relaxed' to reason about threaded code.

Ultimately, the `std::memory_order_release' and `std::memory_order_acquire' options in particular allow us to reason about code as if the visibility of operations corresponded to the order of operations when correctly used.

Soma

**SchCle** · 04-12-2015

Hi Soma, thanks for the example.
My reasoning behind the latter post was that a change in visibility must be accompanied by some instruction; but your saying I'm wrong because

The differences regarding change in visibility has nothing really to do with moving around instructions

?

Regarding your example, I've a question:
If `while (!(s = g.load(std::...)));` fires due to `FLUSH(&g)` (which must happen at least in the second example), `g.store(s, std::memory_order_release);` might still happen after `INVALIDATE(s)` (whatever this INVALIDATE is doing) and thus causing a problem also in the first case. So I'm quite sure that there is something wrong in my understanding of the code. Maybe lines 12 - 14 in the producer function should just resemble what the storing does?

**Codeplug** · 04-12-2015

https://channel9.msdn.com/Shows/Goin...Weapons-1-of-2

Watch that video and see if it clears anything up.

gg

**SchCle** · 04-13-2015

Hi,
watched it, the second part was more educative. What I'm then wondering about is the expensiveness in general of atomics - compared to a mutex. I can implement all reads and writes with something like

Code:

int X;
int ts_readwrite_X(const bool rw, const int i = 0)
{
mutex mt;
mt.lock();
if(rw == READ)
 return X;
else
 X = i;
mt.unlock();
}

Maybe I've missed something in this sketch, but it outlines the idea. In terms of computation time, can I argue that an atomic that's written in `release` and loaded in `acquire` mode is still less consuming, with the reason being that the visibility of the variable is allowed to float around between those actions?

What I'm after is based on this: "A load operation with this memory order performs the acquire operation on the affected memory location: no memory accesses in the current thread can be reordered before this load." (Quote from http://en.cppreference.com/w/cpp/atomic/memory_order). Compared to the implementation shown above, this sounds to me as if that is a "heavier" restriction for optimization, because with the implementation I impose a strict (and maybe easier-to-read) barrier, but memory accesses that don't concern the `int X` can still be reordered. However, I might overrate the meaning of "reordered" in the quote.

**Codeplug** · 04-13-2015

Any discussion of computation time will greatly depend on the implementation and the HW it's targeting.

Personally, I use a "lightweight mutex" (the fast-path for lock/unlock is one atomic) 99% of the time when synchronization is needed. If there is a single shared scalar with no data-dependencies then I may use atomic ops with it. If you need anything fancier than that, like lock-free data structures, take advantage of existing implementations like Intel TBB or MS PPL.

gg

**SchCle** · 04-21-2015

Hi Codeplug,
is this

Code:

class SimpleLock
{
private:

    std::atomic<uint8_t> v;

public:

    SimpleLock(void)
    {
        v.store(0);
    }
    /** \brief
     *
     * \param void
     * \return bool True if 'z' compares equal to 'v'.
     *
     * Compares the contents of the 'v' value with 'z':
     * If true, it replaces the 'v' value with 1 (like store).
     * If false, it replaces 'z' with the 'v' value.
     */
    bool try_lock(void)
    {
        uint8_t z = 0;
        return v.compare_exchange_strong(z, 1);
    }
    void lock(void)
    {
        while(try_lock() == false)
            std::this_thread::yield();
    }
    void unlock(void)
    {
        v.store(0);
    }
};

a "lightweight mutex"?

**phantomotap** · 04-21-2015

Hi Codeplug, is this [C.O.D.E] a "lightweight mutex"?

O_o

No. You have, almost, posted the quintessential non-recursive spinlock.

A lightweight mutex, called by many different labels, is still implemented with a full mutex. (I usually see a semaphore, but you can use whatever such synchronization primitives you have available if you are clever.) The difference between the mechanisms, full mutex and lightweight mutex, comes from using atomic operations to determine if the more elaborate mechanism, the full mutex, is even necessary for a give pass through a section of code.

The two lighter primitives discussed, lightweight mutex and spinlock, have different uses. The spinlock is more appropriate for situations where you know you don't have much to do in a given pass. (A few operations which generally have a very dependable cost.) The lightweight mutex is more appropriate for situations where you have quite a bit to do in a given pass yet isn't often a source of contention.

Soma

**Codeplug** · 04-21-2015

That's typically called a spin lock, but could be considered "lightweight" by my definition.

Let's talk about the "slow path". The slow path means that the lock is currently being held and so we have to do "something". Spinning is one option. The downside is poor performance when there is high contention or the lock is held for "too long" at a time with medium to high contention. What you don't want is multiple threads syncing cache lines and locking buses over and over with a strong atomic. If their is contention but the lock is always held for very short period of time, then a "back off" algorithm can help reduce bus contention.

Another option is wait in the OS until the lock is released. Here you're not using CPU resources, but at the cost of transitioning in and out of the kernel. This is ideal if you know the lock isn't always held for a very short time.

The two can also be combined, eg. the Win32API InitializeCriticalSectionAndSpinCount().

[duplicate info - I'm a slow poster]

gg

Thread: C++11 Multi threading memory ordering: Relaxed vs Release and Consume

Thread Tools

Search Thread

Display

C++11 Multi threading memory ordering: Relaxed vs Release and Consume

Similar Threads

Multi Threading

Using Multi-threading and/or Multi-process

Multi Threading

Multi-Threading

Tags for this Thread