First venture into multithreading

**SyntaxError** · 06-03-2009

I think I was pretty clear. It appears barrier() is a function call, is it not? Since we don't know what's in it, and given your comment, I'm assuming it is just there to guarantee some sort of ordering by the complier. I see no reason why it should necessary do this. Yes, putting a mutex around things fixes the problem. This is "by the book". Your solution seems convoluted and not guaranteed to work.

My current game engine project uses threads. I keep thread communication to a minimum and use critical sections where I have to. The threading has turned out to be the least of my problems. I imagine that's because I don't use any circus tricks. I feel overall design simplicity is the key.

I find it interesting that you are the one who is claiming significant debugging complexity with thread programming and you are also the one using funky tricks. I'm thinking that's possibly more than a coincidence.

Then again it's your code and you can do what you want.

**bithub** · 06-04-2009

Your solution seems convoluted and not guaranteed to work.

My solution is guaranteed to work. Your comments seem to indicate you are not nearly as familiar with concurrency as you claim to be. At any rate, you definitely need to read up on memory barriers.

My current game engine project uses threads

Every game engine I've seen is event based and thus is extremely light on threads. It makes sense that you can keep thread communication to a minimum in this case. Wait until you work on something a little more complex that doesn't run off an event loop.

I find it interesting that you are the one who is claiming significant debugging complexity with thread programming and you are also the one using funky tricks

Since when are memory barriers and volatile variables considered "funky tricks"? In environments where you have mutexes that call the barrier function for you (most modern environments), you can usually get away without thinking about what is actually going on underneath. Unfortunately if you've been doing this long enough, you almost always encounter a critical section that must be optimized in some way, When you see how many instructions it takes to lock a mutex (especially on Windows!), you quickly find yourself looking to see if it's possible to do a lockless solution for the critical section in question.

**CornedBee** · 06-04-2009

It is my belief that you should only create a thread if it is the only way to elegantly solve the problem at hand.

I find such situations extremely rare.

The right place to use threads is to parallelise computations to make use of modern multi-core platforms. And the right way to use them is to use high-level constructs, like TBB's parallel_for, or at least some task scheduler. You should almost never call a CreateThread equivalent yourself.
Keep the data apart. Avoid writing to shared memory, and you can avoid most synchronization. Use pre-built components for communication, such as lock-free queues.
In the area where I work (computation-intensive simulations), I've found that these guidelines pretty much suffice.

There is no race condition in the above code

Unless you have a guarantee from the compiler that assignment to integers is atomic, you do. Also, due to caching, you could see the write to value_saved but not to value in print_value, although that would take some weird memory arrangement by the compiler. But you knew that, else you wouldn't have put the barrier into your fix.

My solution is guaranteed to work.

Only on Microsoft's compiler. No other compiler gives you any guarantee about volatile in multi-threaded environments. In particular, no other compiler will emit the acquire barrier necessary on loading value_saved and value.

**SyntaxError** · 06-04-2009

All I see is a function call. Is it a system function? Is it yours? On what machine(s) does this work? Your link talks about machine code which is architecture dependant. Read the section on "Out-of-order execution versus compiler reordering optimizations". I'm not sure volatile will even fix that for you. Also please post a link to the documentation for your barrier() function or post the code.

PTreads and other threading libraries are guaranteed to work as documented on the architectures they are implemented on. I will stick with that and stay out of trouble by not using parlor tricks since I find them completely unnecessary.

Also a simple ordering of instructions does nothing to guarantee thread safe code in many situations that matter. I don't really care if data protection takes a few extra clocks. I care if it's safe. In my case I generally have to read/write once per frame which is only about 60 times a second max. There is far more expensive stuff that happens during that time. If I'm using the same data in multiple threads frequently though the code it means my code is poorly designed.

It is easy enough to put a critical section or mutex where needed and get your code to perform properly. You can use whatever you like, but again I'm not the one with debugging issues.

**bithub** · 06-04-2009

All I see is a function call. Is it a system function? Is it yours?

It is an OS dependent function. It is the equivalent of saying LockMutex(). LockMutex() isn't a real call, but everyone knows what the intention is. Since there is no standard way to do a memory barrier (as there is no standard way to lock a mutex), I just put in some pseudo code. I assumed you would know what it meant.

Also a simple ordering of instructions does nothing to guarantee thread safe code in many situations that matter

I agree that it only works in some cases. My example was one of those cases though, so I don't understand your point here.

Keep the data apart. Avoid writing to shared memory, and you can avoid most synchronization.

I agree that would be nice, but in practice I find that it is very hard to avoid writing to shared memory between threads in many cases.

Unless you have a guarantee from the compiler that assignment to integers is atomic, you do.

I agree, I should have used an atomic_t variable for clarity. In reality there are very few architectures that support threads yet do not support atomic read/writes with integers. At work I have the benefit of knowing what hardware my software runs on, so sometimes I make assumptions like this.

No other compiler gives you any guarantee about volatile in multi-threaded environments

It works with GCC since linux's atomic primitives rely on it.

Also please post a link to the documentation for your barrier() function or post the code.

MemoryBarrier() (Windows)
smp_mb() and friends (Linux)

I don't really care if data protection takes a few extra clocks.

Actually it's more like a few hundred. I read somewhere (I don't remember where, so this may be inaccurate) that on Windows it takes over 600 cycles to lock a mutex.

Look guys, I agree that where simple threading models can be used, threading issues are not as big of a problem. Maybe I jumped the gun a little bit on that one. Unfortunately I am not working in a situation where that is possible, and that is probably biasing my opinion.

**CornedBee** · 06-04-2009

Locking a mutex (as created by CreateMutex) on Windows involves kernel mode code. So locking it takes two mode switches, which means you can say good-bye to performance.

A CRITICAL_SECTION is far faster to lock in the no-contention case.

**Codeplug** · 06-04-2009

>> It works with GCC since linux's atomic primitives rely on it.
That's incorrect.
LXR linux/arch/x86/include/asm/atomic_32.h
You'll notice there's just "asm volatile" - with similar constructs for other architectures.

The C/C++ keyword "volatile" has nothing to do with MT programming - even on GCC.

gg

**SyntaxError** · 06-04-2009

Originally Posted by bithub

It is an OS dependent function. It is the equivalent of saying LockMutex(). LockMutex() isn't a real call, but everyone knows what the intention is. Since there is no standard way to do a memory barrier (as there is no standard way to lock a mutex), I just put in some pseudo code. I assumed you would know what it meant.

But everyone does not know. Most people simply use the thread libraries and it works for them. Also threading may not be standard but PThreads is pretty standard for Unix systems. That along with windows threading handles a majority of the computers out there and in practice it's easy to write a set of macros that get your job done for most systems.

Originally Posted by bithub

I agree that would be nice, but in practice I find that it is very hard to avoid writing to shared memory between threads in many cases.

This is a design issue. In general you must have some sort of communication between threads but that should be kept to a minimum. I don't know exactly what you are doing but if you are running 600 threads my guess is it's a bad design. There are very few circumstances where this would be justified. We used to have Intel machines with hundreds of processors for doing stuff like atomic bomb simulations. Maybe in that case it is justified. However those machines required special libraries anyway. I can't even begin to imagine why you would "need" 600 or even 200 threads for an application on your average machine.

In general using low level operating system or machine dependant minutia doesn't stike me as a good idea. I don't even look into these things until I have a problem I can't solve with the usual tools. I have done plenty of thread programming and I have never had to deal with barriers. Volatile and the normal threading libraries are very adequate. This kind of stuff is only justified in cases when all other avenues have been exhausted. More often than not it's an excuse for poor design. I used to be guilty of this myself. Using fancy little tricks where I didn't really have if I had started with a better design. In my view a good programmer writes code that is simple as practically possible and easy for others to support and understand. Code should not be written for the sole purpose of impressing someone with knowledge of low level details. I am always more impressed with the simple elegant solution to any problem. I think this soap box is about to collapse so I will get off it now

Originally Posted by bithub

Actually it's more like a few hundred. I read somewhere (I don't remember where, so this may be inaccurate) that on Windows it takes over 600 cycles to lock a mutex.

The windows mutex works across processes. They have now provided citical sections for threads which is faster. In either case it's only a minimal difference to me.

Originally Posted by bithub

Look guys, I agree that where simple threading models can be used, threading issues are not as big of a problem. Maybe I jumped the gun a little bit on that one. Unfortunately I am not working in a situation where that is possible, and that is probably biasing my opinion.

In the past I have had to tell my managers something needs to be rewritten if I think it's klugey. Some software gets to be a support nightmare and it's better to bite the bullet and rewrite it. Maybe you don't have that option. Maybe this is one of those incredibly rare cases where 600 threads makes sense. I'm just getting the feeling that there is probably a better way to do what you are doing, if you are encountering a lot of threading issues.

Edit: One final note

Originally Posted by bithub

MemoryBarrier() (Windows)

From MSDN: Minimum supported client Windows Vista

**bithub** · 06-04-2009

Originally Posted by Codeplug

>> It works with GCC since linux's atomic primitives rely on it.
That's incorrect.
LXR linux/arch/x86/include/asm/atomic_32.h
You'll notice there's just "asm volatile" - with similar constructs for other architectures.

Are you sure? The atomic primitive is declared as volatile:

From /usr/include/asm/atomic.h:

Code:

/*
 * Make sure gcc doesn't try to be clever and move things around
 * on us. We need to use _exactly_ the address the user gave us,
 * not some alias that contains the same information.
 */
typedef struct { volatile int counter; } atomic_t;

The C/C++ keyword "volatile" has nothing to do with MT programming - even on GCC.

I don't think that's true. This article by Andrei Alexandrescu explains it differently. Now I have heard that the volatile keyword is not needed if you use memory barriers, but I am uncertain if this is true. If one thread caches a value in a register, and another thread on a different CPU core changes that value, won't you need to declare the type as volatile to ensure the first thread will read the correct value?

The windows mutex works across processes. They have now provided citical sections for threads which is faster.

Ah, so that's why. I don't do much Windows programming these days, so I had never heard this.

Maybe this is one of those incredibly rare cases where 600 threads makes sense.

Well, about 200-300 of those threads makes sense. The rest comes from programmers that are too used to a blocking paradigm instead of something event based. At any rate, I don't think management will agree to rewrite an application that has taken 3 years of work (1.5 years of devel, and another 1.5 years of maintenence).

**Codeplug** · 06-04-2009

>> Are you sure?
Yes, fairly sure. It's more likely that 'counter' is volatile because there are referencing assembly instructions which require a physical memory address.

Most compilers agree on the interpretation of volatile, in that a volatile variable will not be hoisted into a register as an optimization. This is a side effect of the common interpretation of what a volatile provides you from a standards perspective.

>> This article by Andrei Alexandrescu explains it differently.
Brewbuck tried that approach once... (response)

Here's another thread with rantings on volatile...

gg

**bithub** · 06-04-2009

It's more likely that 'counter' is volatile because there are referencing assembly instructions which require a physical memory address.

No, that's not true. In fact early versions of GCC did not even declare atomic_t values as volatile. Here is a snippet from the patch that changed the declaration to volatile.

Code:

+/*
+ * Make atomic_t volatile to remove the need for barriers in loops that
+ * wait for an outside event.  We generally want to re-load the atomic_t
+ * variable each time anyway, but don't need to re-load everything else.
+ */

The comment at the top of the link seems to indicate the main reason for the change was to support SMP programming.

The 2 links you posted still leave me unconvinced. I had always believed that the general consensus was that volatile variables force the compiler to not optimize the variable by storing its value in a register. Obviously this is something that is useful in MT programming if you don't need a full out lock or memory barrier. If you can provide a resource that definitively contradicts this assertion, I would be interested in reading it.

**CornedBee** · 06-04-2009

I had always believed that the general consensus was that volatile variables force the compiler to not optimize the variable by storing its value in a register.

True. Also, it disallows optimizing away dead writes.

Obviously this is something that is useful in MT programming if you don't need a full out lock or memory barrier.

No, it's not, unless you're on a platform that guarantees cache coherency. If you're not, the cache means that you can loop over that volatile variable forever, and you'll only notice its update if the core that updated it "feels like it", so to say. That's why MS explicitly defines volatile access to emit memory barriers.

Anyway, C++0x ends this debate. Multiple thread access to a variable where at least one access is a write is undefined unless:
a) it's atomic or
b) it's synchronized under the rules of the C++0x memory model.
Volatile doesn't come into it. It plays no part in that model.

**Codeplug** · 06-04-2009

Some architectures currently do not declare the contents of an atomic_t to be
volatile. This causes confusion since atomic_read() might not actually read
anything if an optimizing compiler re-uses a value stored in a register, which
can break code that loops until something external changes the value of an
atomic_t. Avoiding such bugs requires using barrier(), which causes re-loads
of all registers used in the loop, thus hurting performance instead of helping
it, particularly on architectures where it's unnecessary. Since we generally
want to re-read the contents of an atomic variable on every access anyway,
let's standardize the behavior across all architectures and avoid the
performance and correctness problems of requiring the use of barrier() in
loops that expect atomic_t variables to change externally. This is relevant
even on non-smp architectures, since drivers may use atomic operations in
interrupt handlers.

Or signal handlers. Any way an "external change" can occur. The addition of volatile was to prevent it from being cached in a register - just not for the reason I assumed.

gg

**bithub** · 06-04-2009

No, it's not, unless you're on a platform that guarantees cache coherency.

Wouldn't that indicate that you always need a barrier when sharing variables across threads? If that's the case, then you are correct; volatile is not enough to validly share atomic variables among threads. I guess that also means my use of volatile in addition to barriers has always been redundant.

Anyway, C++0x ends this debate. Multiple thread access to a variable where at least one access is a write is undefined unless:

It doesn't really end the debate since:
1) We are talking about atomic variables (atomic_t)
2) Even though this is the C++ forum, the current discussion just as easily pertains to C.
3) C++0x isn't even finalized yet.

If anything ends the debate, it's your comment pointing out that volatile does not have an effect on the CPU cache.

**CornedBee** · 06-04-2009

Originally Posted by bithub

Wouldn't that indicate that you always need a barrier when sharing variables across threads?

You do.
volatile was originally intended for variables that are mapped to special addresses, like memory-mapped device registers. Action registers react to every write, even if it appears dead to the compiler, and status registers can update at any time. Some registers hold different values than what you write to them, too.
Since the CPU knows these special addresses, it doesn't need volatile to be told about it - but by the same line of reasoning, if volatile marks a variable that is not actually in special memory, the CPU will reorder accesses to it and cache it.

It doesn't really end the debate since:
1) We are talking about atomic variables (atomic_t)

atomic_t is not defined by the C++ standard. But let's assume that it is atomic. Then the debate is ended since the volatile has no further effect on it in the context of multithreading.

2) Even though this is the C++ forum, the current discussion just as easily pertains to C.

The C++0x memory model was developed in close cooperation with the C committee. The next C standard (dubbed C1x) will adopt it, most likely without any changes from C++0x, unless there are errata by that point.

3) C++0x isn't even finalized yet.

The memory model won't change. There are five open issues against that part of the standard, but they're all purely editorial.