|Reply to:||Hans-J. Boehm|
long long x = 0;then the following program:
|x = -1;||r1 = x;|
could never result in the local variable r1 being assigned a variable other than 0 or -1. In fact, it is likely that on a 32-bit machine, the assignment of -1 would require two separate store instructions, and thread 2 might see the intermediate value. And it is often expensive to prevent such outcomes.
The preceding example is only one of many cases in which attempting to fully define the semantics of programs with data races would severely constrain the implementation. By prohibiting conflicting concurrent accesses, we remain consistent with pthread practice. We allow nearly all conventional compiler transformations on synchronization-free code, since we disallow any program that could detect invalid intermediate states introduced in the process.
Disallowing data races also has some more subtle consequences. In particular, when constructing an object with a virtual function table, we do not need to generate code that guards against another thread "seeing" the object with an uninitialized pointer to the table. Doing so would often require inserting expensive memory fence instructions.
Here we do not focus on the treatment of sequential code, since that is assumed to be understood. We instead try to clarify how N2052. assigns meaning to concurrent programs by defining which values can be "seen" by a particular evaluation or object access, and the definition and role of data races.
We will assume that each thread performs a sequence of evaluations in a known order, described by a sequenced before relation. If a and b are performed by the same thread, and a "comes first", we say that a is sequenced before b. (In reality, C++ allows a number of different evaluation orders for each thread, notably as a result of varying argument evaluation order, and this choice may vary each time an expression is evaluated. Here we assume that each thread has already chosen its argument evaluation orders in some way, and we simply define which multithreaded executions are consistent with this choice.)
For the sake of simplicity, it may help to view sequenced before as a relation that orders all evaluations within a thread. (In fact, it does not, even taking the previous parenthetical comment into account. The language specification allows certain unsequenced operations within a thread to themselves be interleaved, as opposed to ordered in one of several ways (indeterminately ordered). But that is not crucial to understanding the rest of this paper.)
x = 1; x = 2; in parallel do Thread 1: x = 3; Thread 2: r1 = x;The reference to x in thread 2 may "see" either a value of 2 or 3, since in each case the corresponding assignment is not required to be executed after the assignment to r1, and there is no other intervening assignment to x. Thread 2 may not see the value of 1, since the assignment x = 2; intervenes.
Thus our goal is essentially to define the multi-threaded version of the sequenced before relation. We do this by defining an additional relation called happens-before. For reasons of technical convenience, we define happens-before so that it only fully reflects ordering constraints that involve inter-thread communication. An evaluation a will thus have to become visible before an evaluation b if either a happens-before b, or a is sequenced-before b.
A thread T1 normally communicates with a thread T2 by assigning to some shared variable x and then synchronizing with T2. Most commonly,this synchronization would involve T1 acquiring a lock while it updates x, and then T2 acquiring the same lock while it reads x. Certainly any assignment performed prior to releasing a lock should be visible to another thread when acquiring the lock.
We describe this in several stages:
So far our discussion has been in terms of threads that communicate via lock-protected accesses to shared variables. This should indeed be the common case. But it is not the only case we wish to support.
Atomic variables are another, less common, way to communicate between threads. Experience has shown that such variables are most useful if they have at least the same kind of acquire-release semantics as locks. However, there are occasionally situations in which such ordering is not desired. (Additional ordering properties are also commonly desired. We expect that these will be specified in the atomics library, and are not addressed here.)
If atomic variables have acquire/release properties, then we can ensure that the following code does not result in an assertion failure. (This is of course still not proper C++ syntax, which is still TBD.)
int x = 0; atomic_int y = 0; in parallel do Thread 1: x = 17; y = 1; Thread 2: while (!y); assert(x == 17);
In this case, the assignment to y has release semantics, while the reference to y in the while condition has acquire semantics. The pair behaves essentially like a lock release and acquisition with respect to the memory model. As a result, the assignment x = 17 is inter-thread ordered before the release operation y = 1. The release operation synchronizes with the last evaluation of the while condition, which is an acquire operation and loads the value stored by y = 1. The evaluation of y in the condition is again inter-thread ordered before the evaluation of x in the assertion. Thus the assignment x = 17 happens-before the evaluation of x in the assertion, and hence the assertion cannot fail, since the initialization of x to zero is no longer visible.
Once we have defined our happens-before ordering in this way, we largely define visibility as in the sequential case:
If a happens-before b but a sees a value assigned by b, we immediately get a precedes b and b precedes a, and hence, by transitivity, a precedes itself. Below we show that it also precludes more complicated "causal cycles".
(The current version of N2052. misstates this, in that it inadvertently neglects to state that we allow a chain of such relationships. This is an error that will be fixed. We really need to consider the transitive closure of the precedes relation defined in N2052.)
The condition prohibiting intervening assignments is stated directly as part of 1.10p10.
To understand the impact of such raw atomic operations it is best to look at a sequence of examples. We will write =raw for assignments that involve either an unordered load or store. We will assume that all variables have atomic integer type and initially have zero values.
First observe that in the absence of acquire and release operations, we effectively do not impose any ordering on operations performed within a thread. Consider the (perhaps overused) example:
x =raw 1; |
r1 =raw y;
y =raw 1; |
r2 =raw x;
If we view threads as being executed by interleaving the evaluation of statements (or instructions) from each thread, then we should never get r1 = r2 = 0. In reality, this is possible because both
This is already reflected in our definition of happens-before. There are no happens-before relationships between any of the operations performed by this program. (Formally, only the static initializations of x and y happen-before any of the expression evaluations in the program.) Thus neither of the loads of x and y are restricted from seeing either the corresponding stores or static initializations, and r1 = r2 = 0 does not contradict any of our rules.
(Note that in this case there is actually no difference between raw operations and acquire-release operations. The latter might introduce synchronizes with relationships, but since the stores precedes the loads, there are no interesting inter-thread ordered before relationships. This will be different in the following examples.)
(So far the rules are also the same for ordinary integer variables as for atomics with unordered ("raw") assignments. The crucial difference is that if we used ordinary variables, all the examples in this section would contain data races, and thus exhibit undefined behavior.)
Next we consider two slightly different examples, the first one of which we already handle correctly. Consider first:
r1 =raw x; |
y =raw 1;
r2 =raw y; |
x =raw 1;
Again there are no happens-before relationships, except for those involving static initialization. Hence both r1 and r2 may obtain either a zero or one value. Although r1 = r2 = 1 is less likely in practice than r1 = r2 = 0 in the preceding example, and fewer hardware architectures will allow reordering in this case, it has to be allowed for essentially the same reasons.
But now contrast this with
r1 =raw x; |
y =raw r1;
r2 =raw y; |
x =raw r2;
By the rules we have specified so far, it is also permissible for the loads appearing first in each thread to see either the initial zero value, or the value stored by the other thread. This means that r1 = r2 = 1 continues to be possible if both of the initial loads see a value of 1, causing the stores to store 1 into x and y. The happens-before rules are not violated and each thread in isolation executes according to the language rules.
Unfortunately, the same reasoning can be used to justify r1 = r2 = 42 in the preceding example. The reasoning to justify this is blatantly circular, but nothing prevents this. So far, there are no cyclic precedes relationships, since there is no inter-thread ordered before relationship between the two statements in each thread.
In order to avoid this kind of circularity, and the resulting potential for "out of thin air" results, we added a second clause to the definition of inter-thread ordered before in 1.10p5 to order stores that depend on prior loads. Thus in the actual N2052 model, this example, but not the preceding one, results in a cyclic precedes relation, and r1 = r2 = 1 (or r1 = r2 = 42) is disallowed.
There are several possible definitions of a data race. Probably the most intuitive definition is that it occurs when two ordinary accesses to a scalar accesses, at least one of which is a write, are performed simultaneously by different threads. Our definition is actually quite close to this, but varies in two ways: