An atomic increment would indeed be more than just a move instruction. But a store to an atomic variable can be. Read-modify-write operations require either hardware support, or a spinlock. Like move instructions, those operations may also require barriers around them.
Firstly, invalid cachelines are at a lower level of understanding than needed for C's memory model. Thinking of cachelines seems to be confusing you, so I suggest you don't, until you understand c's memory model at a higher level.Similarly, when thinking of this (so-called) atomic load, the invalidation request queue is cleared immediately prior to the load - but then anything can happen, and the queue can become completely full, and so the load can be of an invalid-but-not-yet-invalidated cache line. This problem does not happen with (what I would call) atomic loads. You get what was really there at that moment.
Secondly, there's no such thing as "what was really there"; C is at a higher level than hardware so this definition is meaningless in a C context. Unsynchronized memory access means instructions can be arbitrarily reordered, provided that data dependency is preserved. Even with relaxed atomics this is true. You can't reason about them as if two threads execute in program order. You can only do that if you use sequentially consistent memory order, which implies the use of memory barriers. Now if you do use sequentially consistent atomic loads and stores, then you don't need memory barriers, because those are already build into the operation.
I just meant the code is described as working, so if it doesn't make sense to you, it's probably because you don't understand something.I'm not sure what you mean. Can you elucidate?
That can be true for weak memory architectures. However, working in C, you can reason at a higher level. If your code is synchronized by the right kind of atomics, or by locks, loads and stores will have the order you need.
The answer is yes, provided that there is no data-race. In other words, C disallows two threads to write to the same memory location at the same time, unless the variable is atomic. This allows compiler optimizations, but as mentioned above, once compiled relaxed atomic write are often simple load instructions.I have one unanswered question here though, which is to do with stores to the same location. I wonder if these can be sure to be seen in order of stores, or not.
A memory barrier does insures that memory operations become visible; that's part of their purpose. The details depend on which type of memory barrier is used.I may be wrong, but I think this is not so. Memory barriers do solve the ordering problem, but they do not solve the visibility problem; what you store *if it becomes visible* will become visible in the correct order (which is to say, order as constrained by store barriers) but there is no guarantee it *will become visible*. A forced write to memory (and only a force write to memory) guarantees that earlier stores will then be visible (as constrained by such store barriers issued up to that point).
You don't actually ever want to force a write to main memory. At most you would invalidate the L1 and L2 caches of other threads, but this is an implementation detail of memory barriers.