Thread: C++11 Multi threading memory ordering: Relaxed vs Release and Consume

  1. #1
    Registered User
    Join Date
    Feb 2015
    Posts
    17

    C++11 Multi threading memory ordering: Relaxed vs Release and Consume

    Hi,
    I've a question regarding this example

    Code:
    #include <thread>
    #include <atomic>
    #include <cassert>
    #include <string>
     
    std::atomic<std::string*> ptr;
    int data;
     
    void producer()
    {
        std::string* p  = new std::string("Hello");
        data = 42;
        ptr.store(p, std::memory_order_release);
    }
     
    void consumer()
    {
        std::string* p2;
        while (!(p2 = ptr.load(std::memory_order_consume)))
            ;
        assert(*p2 == "Hello"); // never fires: *p2 carries dependency from ptr
        assert(data == 42); // may or may not fire: data does not carry dependency from ptr
    }
     
    int main()
    {
        std::thread t1(producer);
        std::thread t2(consumer);
        t1.join(); t2.join();
    }
    given here: std::memory_order - cppreference.com

    (Yes, I also read that "no known production compilers track dependency chains: consume operations are lifted to acquire operations.")

    I'm wondering whether a std::memory_order_relaxed ordering for the load and store would yield the same... The first assertion can only fire if *p2 != "Hello", which can only happen if the load failed. So the only scenario I can think of is if the storing is "character-wise atomic", i.e. a result could be a partial store of "Hello". If someone can elucidate this to me, I'd be grateful.

  2. #2
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Yes - your assert() comments are still true, and the p2 results are the same, when using memory_order_relaxed.

    gg

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    I'm wondering whether a std::memory_order_relaxed ordering for the load and store would yield the same... The first assertion can only fire if *p2 != "Hello", which can only happen if the load failed. So the only scenario I can think of is if the storing is "character-wise atomic", i.e. a result could be a partial store of "Hello". If someone can elucidate this to me, I'd be grateful.
    O_o

    I wouldn't use `std::memory_order_relaxed' for the store. I can't imagine poor visibility actually happening because the context would seem to be a pessimism from the perspective of an optimizing processor, but you might indeed find that the `std::memory_order_relaxed' store becomes visible before all preceding writes occurring within the constructor.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  4. #4
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> .. might ... becomes visible before all preceding writes occurring within the constructor.
    +1

    So the first assert could fire. Classic trap in DCL .

    gg

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    DCL?

    [Edit]
    *derp*

    Double-Checked Locking

    I... um... look a pony!
    [/Edit]

    Soma
    Last edited by phantomotap; 04-10-2015 at 11:07 AM.
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  6. #6
    Registered User
    Join Date
    May 2009
    Posts
    3,877
    My first thought was how does DEC Command Language relate to this post.

    Link: http://en.wikipedia.org/wiki/DIGITAL_Command_Language
    I was remember the D standing for DEC; the link says Digital. Wow, its been 19 years since I first learned ADA on a VAX system.

    Tim S.

    Quote Originally Posted by phantomotap View Post
    O_o

    DCL?

    [Edit]
    *derp*

    Double-Checked Locking

    I... um... look a pony!
    [/Edit]

    Soma
    Last edited by stahta01; 04-10-2015 at 01:04 PM.
    "...a computer is a stupid machine with the ability to do incredibly smart things, while computer programmers are smart people with the ability to do incredibly stupid things. They are,in short, a perfect match.." Bill Bryson

  7. #7
    Registered User
    Join Date
    Feb 2015
    Posts
    17
    Quote Originally Posted by phantomotap View Post
    I wouldn't use `std::memory_order_relaxed' for the store. I can't imagine poor visibility actually happening because the context would seem to be a pessimism from the perspective of an optimizing processor, but you might indeed find that the `std::memory_order_relaxed' store becomes visible before all preceding writes occurring within the constructor.
    Thanks for your explanation. However, I'm not exactly sure why `ptr.store(p, std::memory_order_release);` can be moved (due to optimization) "above" `std::string* p = new std::string("Hello");` (this is what you're saying, if I got you right, and this is the reason why the 2nd assertion can fire). I can only understand this if `std::string* p` would be declared globally, as then the assignment `p << "Hello";` could happen before or after `ptr = p`.

  8. #8
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    this is what you're saying, if I got you right, and this is the reason why the 2nd assertion can fire
    O_o

    I don't know how you got any of that information from what I said, but you are wrong regardless.

    I said nothing about `ptr.store(p, std::memory_order_release);' being moved. I said the visibility of `ptr' could change when using `std::memory_order_relaxed' before the visibility of earlier writes--the character array--changes. The differences regarding change in visibility has nothing really to do with moving around instructions. The differences regarding change in visibility exist because `std::memory_order_relaxed' only guarantees atomicity. (The `std::memory_order_relaxed' does not guarantee any ordering of visibility.) The `std::memory_order_relaxed' option exists for scenarios where no ordering is required, but you are wanting to use the option for a scenario where a specific order of visibility is required.

    We don't need to assume an elaborate environment to see how the lack of ordering requirements may exhibit differences in visibility. The character array--the focus of earlier writes--doesn't likely live in the same operational block as the `ptr' atomic. A simple environment having operations to flush the operational cache to a memory block with specific atomic reads and writes implemented by unbreakable instructions for flushing or invalidating the block where the atomic lives is sufficient to see problems.

    Let's pretend we could insert such invalidation/flushing instructions ourselves; take a look at some slightly modified example code which eliminates spurious context:

    Code:
    std::atomic<char *> g;
    
    void producer()
    {
        char * s(new char[6]);
        s[0] = 'H';
        s[1] = 'e';
        s[2] = 'l';
        s[3] = 'l';
        s[4] = 'o';
        s[5] = '\0';
        FLUSH(s); // 1
        // The change in the memory referenced by `s' becomes visible.
        FLUSH(&g);
        // The `consumer' thread is receives the change in `g' value.
        g.store(s, std::memory_order_release);
    }
    
    void consumer()
    {
        char * s;
        INVALIDATE(&g);
        while (!(s = g.load(std::memory_order_consume)));
        // The `consumer' thread has received the change in `g' value.
        INVALIDATE(s); // 2
        assert(!strcmp(s, "Hello"));
    }
    We know that the memory regarding `g' and `s' will eventually be copied from processor cache to main memory, but the order the visibility changes--which copy happens first--is relevant because the correct operation of the code depends on the memory referenced by `s' being completely written. We know that `1' happens before `2' because the write to `g' isn't allowed to "float above" the earlier writes.

    Let's pretend we didn't care about the order by using the `std::memory_order_relaxed' option; the following reordered instructions changing visibility is perfectly valid:

    Code:
    std::atomic<char *> g;
    
    void producer()
    {
        char * s(new char[6]);
        s[0] = 'H';
        s[1] = 'e';
        s[2] = 'l';
        s[3] = 'l';
        s[4] = 'o';
        s[5] = '\0';
        FLUSH(&g);
        // The `consumer' thread is receives the change in `g' value.
        FLUSH(s); // 1
        // The change in the memory referenced by `s' becomes visible.
        g.store(s, std::memory_order_relaxed);
    }
    
    void consumer()
    {
        char * s;
        INVALIDATE(&g);
        while (!(s = g.load(std::memory_order_relaxed)));
        // The `consumer' thread has received the change in `g' value.
        INVALIDATE(s); // 2
        assert(!strcmp(s, "Hello"));
    }
    We know told the environment that we don't care about ordering. We don't know that `1' happens before `2' is completed. We know that the environment is, for some definition or parallel, executing the functions in parallel. However, the environment may take longer to complete `FLUSH(s);' than `INVALIDATE(s);' which means that the memory referenced by `s' within `consumer' doesn't hold the values we expect.

    Be fairly warned, I've simplified both example and explanation to the breaking point. One could easily poke holes in both example and explanation. The fact is that the visibility requirements only place an "as if" behavioral expectation with a multitude of different behaviors seen in practice. Despite the arguably frail examples and explanation, the point remains that apparently ordered operations do not necessarily correspond to changes in visibility having the same order. Actually, you could even go a bit further. If the visibility of changes did precisely correspond to the order of operations, you wouldn't need anything other than `std::memory_order_relaxed' to reason about threaded code.

    Ultimately, the `std::memory_order_release' and `std::memory_order_acquire' options in particular allow us to reason about code as if the visibility of operations corresponded to the order of operations when correctly used.

    Soma
    Last edited by phantomotap; 04-12-2015 at 07:35 AM.
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  9. #9
    Registered User
    Join Date
    Feb 2015
    Posts
    17
    Hi Soma, thanks for the example.
    My reasoning behind the latter post was that a change in visibility must be accompanied by some instruction; but your saying I'm wrong because
    The differences regarding change in visibility has nothing really to do with moving around instructions
    ?

    Regarding your example, I've a question:
    If `while (!(s = g.load(std::...)));` fires due to `FLUSH(&g)` (which must happen at least in the second example), `g.store(s, std::memory_order_release);` might still happen after `INVALIDATE(s)` (whatever this INVALIDATE is doing) and thus causing a problem also in the first case. So I'm quite sure that there is something wrong in my understanding of the code. Maybe lines 12 - 14 in the producer function should just resemble what the storing does?
    Last edited by SchCle; 04-12-2015 at 09:52 AM.

  10. #10

  11. #11
    Registered User
    Join Date
    Feb 2015
    Posts
    17
    Hi,
    watched it, the second part was more educative. What I'm then wondering about is the expensiveness in general of atomics - compared to a mutex. I can implement all reads and writes with something like

    Code:
    int X;
    int ts_readwrite_X(const bool rw, const int i = 0)
    {
    mutex mt;
    mt.lock();
    if(rw == READ)
     return X;
    else
     X = i;
    mt.unlock();
    }
    Maybe I've missed something in this sketch, but it outlines the idea. In terms of computation time, can I argue that an atomic that's written in `release` and loaded in `acquire` mode is still less consuming, with the reason being that the visibility of the variable is allowed to float around between those actions?

    What I'm after is based on this: "A load operation with this memory order performs the acquire operation on the affected memory location: no memory accesses in the current thread can be reordered before this load." (Quote from http://en.cppreference.com/w/cpp/atomic/memory_order). Compared to the implementation shown above, this sounds to me as if that is a "heavier" restriction for optimization, because with the implementation I impose a strict (and maybe easier-to-read) barrier, but memory accesses that don't concern the `int X` can still be reordered. However, I might overrate the meaning of "reordered" in the quote.
    Last edited by SchCle; 04-13-2015 at 01:13 PM.

  12. #12
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Any discussion of computation time will greatly depend on the implementation and the HW it's targeting.

    Personally, I use a "lightweight mutex" (the fast-path for lock/unlock is one atomic) 99% of the time when synchronization is needed. If there is a single shared scalar with no data-dependencies then I may use atomic ops with it. If you need anything fancier than that, like lock-free data structures, take advantage of existing implementations like Intel TBB or MS PPL.

    gg

  13. #13
    Registered User
    Join Date
    Feb 2015
    Posts
    17
    Hi Codeplug,
    is this
    Code:
    class SimpleLock
    {
    private:
    
        std::atomic<uint8_t> v;
    
    public:
    
        SimpleLock(void)
        {
            v.store(0);
        }
        /** \brief
         *
         * \param void
         * \return bool True if 'z' compares equal to 'v'.
         *
         * Compares the contents of the 'v' value with 'z':
         * If true, it replaces the 'v' value with 1 (like store).
         * If false, it replaces 'z' with the 'v' value.
         */
        bool try_lock(void)
        {
            uint8_t z = 0;
            return v.compare_exchange_strong(z, 1);
        }
        void lock(void)
        {
            while(try_lock() == false)
                std::this_thread::yield();
        }
        void unlock(void)
        {
            v.store(0);
        }
    };
    a "lightweight mutex"?

  14. #14
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Hi Codeplug, is this [C.O.D.E] a "lightweight mutex"?
    O_o

    No. You have, almost, posted the quintessential non-recursive spinlock.

    A lightweight mutex, called by many different labels, is still implemented with a full mutex. (I usually see a semaphore, but you can use whatever such synchronization primitives you have available if you are clever.) The difference between the mechanisms, full mutex and lightweight mutex, comes from using atomic operations to determine if the more elaborate mechanism, the full mutex, is even necessary for a give pass through a section of code.

    The two lighter primitives discussed, lightweight mutex and spinlock, have different uses. The spinlock is more appropriate for situations where you know you don't have much to do in a given pass. (A few operations which generally have a very dependable cost.) The lightweight mutex is more appropriate for situations where you have quite a bit to do in a given pass yet isn't often a source of contention.

    Soma
    “Salem Was Wrong!” -- Pedant Necromancer
    “Four isn't random!” -- Gibbering Mouther

  15. #15
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    That's typically called a spin lock, but could be considered "lightweight" by my definition.

    Let's talk about the "slow path". The slow path means that the lock is currently being held and so we have to do "something". Spinning is one option. The downside is poor performance when there is high contention or the lock is held for "too long" at a time with medium to high contention. What you don't want is multiple threads syncing cache lines and locking buses over and over with a strong atomic. If their is contention but the lock is always held for very short period of time, then a "back off" algorithm can help reduce bus contention.

    Another option is wait in the OS until the lock is released. Here you're not using CPU resources, but at the cost of transitioning in and out of the kernel. This is ideal if you know the lock isn't always held for a very short time.

    The two can also be combined, eg. the Win32API InitializeCriticalSectionAndSpinCount().

    [duplicate info - I'm a slow poster]

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multi Threading
    By gametek101 in forum C Programming
    Replies: 19
    Last Post: 07-24-2009, 12:39 PM
  2. Using Multi-threading and/or Multi-process
    By lehe in forum C++ Programming
    Replies: 5
    Last Post: 07-14-2009, 11:43 AM
  3. Multi Threading
    By beon in forum C Programming
    Replies: 5
    Last Post: 12-04-2006, 09:21 PM
  4. Multi-Threading
    By DeepFyre in forum C++ Programming
    Replies: 5
    Last Post: 09-30-2004, 06:13 PM

Tags for this Thread