Doc. No.: N1284
Date: 2008-03-12
Reply to: Clark Nelson Hans-J. Boehm Lawrence Crowl
Phone: +1-503-712-8433 +1-650-857-3406 +1-650-253-3677
Email: clark.nelson@intel.com Hans.Boehm@hp.com
boehm@acm.org
crowl@google.com
Lawrence@Crowl.org

Parallel memory sequencing model

At its meeting in Kona (2007-10), WG21 adopted a memory sequencing model for parallel programming, along with library facilities for expressing accesses to atomic objects, by which locking and lock-free synchronization can be implemented.

This paper combines the text of WG21/N2429, which contains the wording added to the language section of the C++ standard, with the relevant parts of WG21/N2427, which contains the wording added to the library section. The language changes describe the way existing language operations, including assignments and loads, are affected by atomic operations, and by inference the optimizations that are and are not allowed to be performed in the presence of atomic operations. The library section describes the facilities by which atomic operations are expressed.

For explanation and rationale of the memory model overall, please see WG14 paper N1276 in the post-Kona mailing. This paper contains rationale for the design of the atomic (library) facilities, and continues with the normative text proposed for addition to the C++ standard, modified slightly to fit better into the C standard.

The wording will of course need to be modified further; at this stage, this should not be considered a complete and polished proposal for standardese, but rather a starting point. In particular, wording to specify type-generic macros in C needs close review. Also, the C++ standard in general, and the memory model proposal in particular, makes extensive use of non-normative inline notes (as opposed to footnotes) for explanation and rationale. The C standard today has no corresponding convention. I am not today prepared to propose either to eliminate the included explanation and rationale, or to invent a presentation convention appropriate to the C standard. Similarly, the presentation style of the descriptions of library functions matches C++ conventions, not C conventions.

It should be noted that some further additions to the model have been proposed in WG21, for data-dependency ordering. The most recent published formulation of these proposals may be found in WG21/N2492 and N2493. But as these proposals have not yet been adopted by WG21, they are not included in this paper.

In the remainder of this document, references to N-numbered papers are for WG21 documents by default.

Introduction

Rationale

The traditional shared-memory notion that every store is instantly visible to all threads induces an unacceptable performance loss on modern systems. Therefore programmers must have a mechanism to indicate when stores in one thread should be communicated to another.

Specifically, a program that wishes to communicate the fact that a particular piece of data prepared by one thread is ready to be examined by another thread, needs a shared variable flag, that both:

Although the second aspect is often glossed over, it is usually not automatic with modern hardware and compilers, and is just as important as the first in ensuring correctness.

In POSIX, this communication is achieved by calling certain functions, particularly mutex lock and unlock. While mutexes are adequate for many applications, there is a significant need for finer-grained atomic operations:

Lock-free concurrent data structures
Lock-free concurrent data structures are difficult to design and implement correctly. As such, there is a significant need for programmers capable of doing so to write portably, and for that they need standards.
Inter-thread memory synchronization
Occasionally synchronization with locks offers insufficient performance, and other, more specific idioms suffice.

Goals

We had several goals for our design. Each had significant impact on the details.

Discussion of Design

While our proposal is based on existing practice, that practice is somewhat fragmented, so some design choices may not be obvious. In this section, we discuss our choices and their reasoning.

A Library API

We propose to add atomic types and operations purely as a library API. In practice, for C or C++, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.) For C, a compiler implementation is required for the type-generic macros.

Atomic Types, Operations, and Volatile

We propose atomic types rather than atomic operations on general-purpose types. Doing so enforces a single synchronization protocol for all operations on an object. Using two protocols simultaneously will result in synchronization failure.

We chose to specify atomic types with conventional type names rather than modify the volatile type qualifier for concurrency. The volatile type qualifier has a long history within C and C++, and changing its meaning is both risky and unnecessary. In addition, the existing meaning of volatile, "may be modified by external agents", is somewhat orthogonal to "may be modified concurrently by the program". See N2016 Should volatile Acquire Atomicity and Thread Visibility Semantics? for a more complete discussion. As a consequence, objects of atomic type may also be volatile qualified. Compilers may optimize some non-volatile atomic operations, where they would not be able to optimize volatile operations.

C and C++ Interoperability

The proposal defines the atomic types as C++ standard-layout structs. In C, these would be simply opaque types. Headers common to both C and C++ use exactly the same syntax to declare atomic variables.

Furthermore, the proposal defines the C++ atomic types such that the static initialization syntax can be identical to C aggregate initialization syntax. That is, atomic variables can be initialized in common headers as well.

The core atomic operations are free functions that take pointers to atomic types. Programmers use the same syntax for these operations in both C and C++. That is, a header included from both C and C++ can provide inline functions that operate on atomic types.

The proposal defines the core functions as overloaded functions in C++ and as type-generic macros in C. This approach helps programmers avoid changing code when an atomic type changes size.

Because free functions are occasionally clumsy, the proposal additionally provides member operators and member functions so that C++ programmers may use a more concise syntax.

Memory Ordering

Synchronization operations in the memory model may be either acquire or release operations, or both. These operations govern the communication of non-atomic stores between threads. A release operation ensures that prior memory operations become visible to a thread performing subsequent acquire operation on the same object.

Rather than have atomic operations implicitly provide both acquire and release semantics, we choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages do not, and instead provide only a single version of operations, like compare-and-swap, which implicitly include a full memory fence.

Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.

The proposal defines an enumeration that enables detailed specification of the memory order for every operation.

Consistency

A strict interpretation of acquire and release yields a fairly weak consistency model, which allows threads to have a different notion of the order of writes. For stronger consistency, this proposal distinguishes between an operation with acquire and release semantics and an operation with sequentially consistent semantics. See N2177, Sequential Consistency for Atomics.

The member operators provide no mechanism for specifying memory order enumeration, and so they have only the strongest synchronization. This default is not an undue burden because using weaker synchronization requires considerable thought. Any extra syntactic burden is dwarfed by the semantic burden.

Program auditors can search for uses of the memory order enumeration to identify code that uses synchronization models that are weaker than sequentially consistent.

Fences

It is also unclear how a convention requiring full global memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)

There are two possible approaches to specifying ordering constraints:

Both approaches appear to have their merits. We chose the latter for reasons described in N2176.

Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:

However, we have provided per-variable fence operations, which are semantically modeled as read-modify-write operations. Such fences enable programmers to conditionalize memory synchronization, which can substantially improve performance. See N2153 A simple and efficient memory model for weakly-ordered architectures for motivating examples and the Lazy Initialization example from N2427 for a use of the proposed syntax.

We expect that implementations that have hardware fences will use such operations to implement the language fences. See N2362 Converting Memory Fences to N2324 Form, for a discussion of the issues and approaches to converting from existing practice of global fences to the per-variable fences in this proposal.

Dependent Memory Order

Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java.

For C++, there is near-universal agreement that it would be nice to have some such guarantees. The fundamental issues are:

See papers N2492 C++ Data-Dependency Ordering: Atomics and Memory Model and N2493 C++ Dependency Ordering: Function Annotation for exploration of these issues.

Lock-Free Property

In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock.

The proposal defines feature queries to determine whether or not operations are lock-free. We provide two kinds of feature queries, compile-time preprocessor macros and run-time functions. To facilitate optimal storage use, the proposal supplies feature macros that indicates general lock-free status of integral and address atomic types. The run-time function provide per-object information.

The proposal provides run-time lock-free query functions rather than compile-time constants because subsequent implementations of a platform may upgrade locking operations with lock-free operations, so it is common for systems to abstract such facilities behind dynamic libraries, and we wish to leave that possibility open. Furthermore, we recommend that implementations without hardware atomic support use that technique.

The proposal provides lock-free query functions on individual objects rather than whole types to permit unavoidably misaligned atomic variables without penalizing the performance of aligned atomic variables.

Because consistent use of operations requires that all operations on a type must use the same protocol, all operations are lock-free or none of them are. That is, the lock-free property applies to whole objects, not individual operations.

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state. While such a definition is beyond the scope of the standard, a clear statement of our intent will enable a portable expression of class of a programs already extant.

Flag Type and Operations

The proposal includes a very simple atomic flag type, providing two primary operations, test-and-set and clear. This type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. Few programmers should be using this type.

We considered adding a wait operation in this proposal, but ultimately rejected it because pure busy waiting has pathological performance characteristics on multi-programmed machines and because doing better requires interaction with the operating system scheduler, which seems inappropriate to a processor-level facility.

Integral and Address Types

The proposal includes a full set of integral atomic types and an address atomic type.

This proposal includes atomic integral types smaller than a machine word, even though many architectures do not have such operations. For machines that implement a word-based compare-and-swap operation, the effect of operations can be achieved by loading the containing word, modifying the sub-word in place, and performing a compare-and-swap on the containing word. In the event that no compare-and-swap is available, the implementation may need to either implement smaller types with larger types or use locking algorithms. Using the same size for the atomic type as its base type eases effort required to port existing code, e.g. code using __sync.

Atomic Operations

The simplest atomic operations are load and store.

There are times when one wishes to store a new value, but also load the old value. An atomic load followed by an atomic store is not an atomic sequence, so we provide an atomic swap operation that does a combined load and store atomically.

We also provide the common fetch-and-modify operations. The fetch-and-modify functions return the original stored value. The original stored value is required for fetch-and-or and fetch-and-and because there is no means to compute to the original stored value from the new value and the modifying argument. In contrast to the functions, the fetch-and-modify assignment operators return the new value. We do this for consistency with normal assignment operators. Unlike normal C++ assignment operators, though, the atomic assignments return values rather than references, which is like C. The reason is that another thread might intervene between an assignment and a subsequent read. Rather than introduce this classic parallel programming bug, we return a value.

The normal signed integral addition operations have undefined behavior in the event of an overflow. For atomic variables, this undefined behavior introduces significant burden because there is in general no way to pre-test for overflow. Rather than leave overflow undefined, we recognize the de facto behavior of modern systems and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic. We are aware of no implementation of C or C++ for which this definition is a problem.

The compare-and-swap operation seems to be the minimum general-purpose synchronization primitive. It appears to be both necessary and sufficient for most interesting lock-free algorithms. The compare-and-swap operations may fail spuriously. That is, the stored value may be known to be equal to the expected value, and yet the operation fails to swap. We permit this failure for two reasons. First, it appears unavoidable for load-locked store-conditional implementations. Second, hardware memory switches may prefer to fail some operations to ensure overall machine performance.

The compare-and-swap operations replace the expected value with the found value in the event that the found value does not match the expected value. The intent of the design is to set up a compare-and-swap loop for the next iteration. The reason for this design is that:

Unlike other operations, the compare-and-swap operations have two memory synchronization order parameters. The first is for operations that succeed; the second is for those that fail. The reason for this pair of specifications is that there are several circumstances in which the failure case can have substantially weaker, and substantially cheaper, synchronization. However, for sequential consistency, full synchronization is required.

Template Types

In C++, atomic type templates provide significant benefits:

The atomic type template has a partial specialization for pointers (derived from the atomic address type), and full specializations for integral types (derived from the atomic integrals).

The primary problem with an atomic template is that effective use of machine operations requires that their parameter types are trivially copy-assignable and bitwise equality-comparable.

Furthermore, parameter types that are not also statically initializable and trivially destructible may be difficult to use.

In the present language, there is no mechanism to require these properties of a type parameter. Roland Schwarz suggests using a template union to enforce POD type parameters. Unfortunately, that approach also prevents the derivation of specializations of atomic for the types above, which is unacceptable. Furthermore, WG21 has accepted a proposal (N2544) to remove POD restrictions on types of union members. We believe that concepts are a more appropriate mechanism to enforce this restriction, and so we have proposed the concepts and concept maps necessary for safe use of atomic templates. The burden on programmers using atomic templates is to declare that types passed as template parameters are in fact bitwise equality-comparable. The proposed mechanisms will infer trivially copy-assignable types.

The intent is that vendors will specialize a fully-general locking implementation of an atomic template with implementations using hardware primitives when those primitives are applicable. This specialization can be accomplished by defining a base template with the size and alignment of the parameter type as additional template parameters, and then specializing on those two arguments.

Implementability

We believe that there is ample evidence for implementability.

The Intel/gcc __sync intrinsics provide evidence for compiler implementability of the proposed interface.

(We do not advocate standardizing these intrinsics as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. We believe a number of issues were not fully understood when that design was developed, and it could greatly benefit from another iteration at this stage.)

Other packages, particularly Boehm's atomic_ops package provide evidence of efficient implementability over a range of architectures.

The remaining implementation issue is the burden on implementers to produce a minimally conforming implementation. The minimum hardware support required is the atomic test-and-set and clear operations that form the basis of the atomic flag type. The original proposal includes an example implementation based on that minimal hardware support, and thus shows the vendor work required.

Non-terminating loops

It is generally felt that it is important to allow the transformation of potentially non-terminating loops (e.g. by merging two loops that iterate over the same potentially infinite set, or by eliminating a side-effect-free loop), even when that may not otherwise be justified in the case in which the first loop never terminates.

Existing compilers commonly assume that code immediately following a loop is executed if and only if code immediately preceding a loop is executed. This assumption is clearly invalid if the loop fails to terminate. Even if we wanted to prohibit this behavior, it is unclear that all relevant compilers could comply in a reasonable amount of time. The assumption appears both pervasive and hard to test for.

The treatment of non-terminating loops in the current standard is very unclear. We believe that some implementations already eliminate potentially non-terminating, side-effect-free, loops, probably based on 1.9p9, which appears to impose very weak requirements on conforming implementations for non-terminating programs. We had previously arrived at a tentative conclusion that non-terminating loops were already sufficiently weakly specified that no changes were needed. We no longer believe this, for the following reasons:

Changes to the Standard

The definition of "memory location"

Add a new subclause to clause 3:

memory location

either an object of scalar type, or a maximal sequence of adjacent bit-fields all having non-zero width.

Note: Two threads of execution can update and access separate memory locations without interfering with each other.

Note: A bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields, no matter what the sizes of those intervening bit-fields happen to be.

Example: A structure declared as struct {char a; int b:5, c:11, :0, d:8; struct {int ee:8;} e;} contains four separate memory locations: The field a, and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be.

Multi-threaded executions and data races

Insert a new section after 5.1.2.3, titled "Multi-threaded executions and data races".

5.1.2.4p1:

Under a hosted implementation, a C program can have more than one thread of execution (a.k.a. thread) running concurrently. The execution of each thread proceeds as defined by the remainder of this standard. The execution of the entire program consists of an execution of all of its threads. [Note: Usually the execution can be viewed as an interleaving of all its threads. However some kinds of atomic operations, for example, allow executions inconsistent with a simple interleaving, as described below. —end note ] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.

5.1.2.4p2:

The value of an object visible to a thread T at a particular point might be the initial value of the object, a value assigned to the object by T, or a value assigned to the object by another thread, according to the rules below. [Note: In some cases, there may instead be undefined behavior. Much of this section is motivated by the desire to support atomic operations with explicit and detailed visibility constraints. However, it also implicitly supports a simpler view for more restricted programs. —end note ]

5.1.2.4p3:

Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.

5.1.2.4p4:

The library defines a number of atomic operations («ATM») and operations on locks (???) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation is either an acquire operation or a release operation, or both, on one or more memory locations; the semantics of these are described below. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics, also described below. [Note: For example, a call that acquires a lock will perform an acquire operation on the locations comprising the lock. Correspondingly, a call that releases the same lock will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire operation on A. We do not include "relaxed" atomic operations as "synchronization" operations though, like synchronization operations, they cannot contribute to data races. —end note ]

5.1.2.4p5:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens before B, then A shall precede B in the modification order of M, which is defined below. [Note: This states that the modification orders must respect "happens before". —end note ] [Note: There is a separate order for each scalar object. There is no requirement that these can be combined into a single total order for all objects. In general this will be impossible since different threads may observe modifications to different variables in inconsistent orders. —end note ]

5.1.2.4p6:

A release sequence on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is a release, and every subsequent operation

5.1.2.4p7:

An evaluation A that performs a release operation on an object M synchronizes with an evaluation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. [Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ] [Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic variables, the definition is clear. All operations on a given lock occur in a single total order. Each lock acquisition "reads the value written" by the last lock release. —end note ]

5.1.2.4p8:

An evaluation A happens before an evaluation B if:

5.1.2.4p9:

A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:

The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A. [ Note: If there is ambiguity about which side effect to a non-atomic object is visible, then there is a data race, and the behavior is undefined. —end note ] [ Note: This states that operations on ordinary variables are not visibly reordered. This is not actually detectable without data races, but it is necessary to ensure that data races, as defined here, and with suitable restrictions on the use of atomics, correspond to data races in a simple interleaved (sequentially consistent) execution. —end note ]

5.1.2.4p10:

The visible sequence of side effects on an atomic object M, with respect to a value computation B of M, is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first side effect is visible with respect to B, and for every subsequent side effect, it is not the case that B happens before it. The value of an atomic object M, as determined by evaluation B, shall be the value stored by some operation in the visible sequence of M with respect to B. Furthermore, if a value computation A of an atomic object M happens before a value computation B of M, and the value computed by A corresponds to the value stored by side effect X, then the value computed by B shall either equal the value computed by A, or be the value stored by side effect Y, where Y follows X in the modification order of M. [Note: This effectively disallows compiler reordering of atomic operations to a single object, even if both operations are "relaxed" loads. By doing so, we effectively make the "cache coherence" guarantee provided by most hardware available to C atomic operations.—end note ] [Note: The visible sequence depends on the "happens before" relation, which depends on the values observed by loads of atomics, which we are restricting here. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the "happens before" relation derived as described above, satisfy the resulting constraints as imposed here. -- end note.]

5.1.2.4p11:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. [Note: It can be shown that programs that correctly use simple locks to prevent all data races, and use no other synchronization operations, behave as though the executions of their constituent threads were simply interleaved, with each observed value of an object being the last value assigned in that interleaving. This is normally referred to as "sequential consistency". However, this applies only to race-free programs, and race-free programs cannot observe most program transformations that do not change single-threaded program semantics. In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation. —end note ]

5.1.2.4p12:

[Note: Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. We also generally preclude reordering of atomic loads in cases in which the atomics in question may alias, since this may violate the "visible sequence" rules. —end note ]

5.1.2.4p13:

[Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. —end note ]

Non-terminating loops

Add a new paragraph as 6.8.5p6:

An iteration statement that

in its body, controlling expression, or, in the case of a for statement, its expression-3, may be assumed by the implementation to terminate. [Note: This is intended to allow compiler transformations, such as removal of empty loops, even when termination cannot be proven. —end note]

«ATM» Atomics <stdatomic.h>

Add a new library subclause with the following paragraphs. «ATM» will be used to refer to the number of this new subclause.

This clause describes components for fine-grained atomic access. This access is provided via operations on atomic objects. [Footnote: Atomic objects are neither active nor radioactive. —end footnote]

The following subclauses describe atomics requirements, and components for types and operations, as summarized below.

Atomics library summary
Subclause Header(s)
«ATM».1 Order and consistency <stdatomic.h>
«ATM».2 Lock-free property
«ATM».3 Atomic flag type and operations
«ATM».4 Atomic integral and address types
«ATM».5 Operations on atomic integral and address types

«ATM».1 Order and consistency

Add a new sub-clause with the following paragraphs.

The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in 5.1.2.4 and may provide for operation ordering. Its enumerated values and their meanings are as follows.

memory_order_relaxed
The operation does not order memory.
memory_order_release
Performs a release operation on the affected memory locations, thus making regular memory writes visible to other threads through the atomic variable to which it is applied.
memory_order_acquire
Performs an acquire operation on the affected memory locations, thus making regular memory writes in other threads released through the atomic variable to which it is applied, visible to the current thread.
memory_order_acq_rel
The operation has both acquire and release semantics.
memory_order_seq_cst
The operation has both acquire and release semantics, and in addition, has sequentially-consistent operation ordering.

The memory_order_seq_cst operations that load a value are acquire operations on the affected locations. The memory_order_seq_cst operations that store a value are release operations on the affected locations. In addition, in a consistent execution, there must be a single total order S on all memory_order_seq_cst operations, consistent with the happens before order and modification orders for all affected locations, such that each memory_order_seq_cst operation observes either the last preceding modification according to this order S, or the result of an operation that is not memory_order_seq_cst. [Note: Although we do not explicitly require that S include locks, it can always be extended to an order that does include lock and unlock operations, since the ordering between those is already included in the happens before ordering. —end note]

An atomic store shall only store a value that has been computed from constants and program input values by a finite sequence of program evaluations, such that each evaluation observes the values of variables as computed by the last prior assignment in the sequence. [Footnote: Among other implications, atomic variables shall not decay. —end footnote] The ordering of evaluations in this sequence must be such that

  1. If an evaluation B observes a value computed by A in a different thread, then B must not happen before A.
  2. If an evaluation A is included in the sequence, then all evaluations that assign to the same variable and are sequenced before or happens-before A must be included.

[Note: Requirement 2 disallows "out-of-thin-air", or "speculative" stores of atomics when relaxed atomics are used. Since unordered operations are involved, evaluations may appear in this sequence out of thread order. For example, with x and y initially zero,

Thread 1:
r1 = atomic_load_explicit(&y, memory_order_relaxed);
atomic_store_explicit(&x, r1, memory_order_relaxed);
Thread 2:
r2 = atomic_load_explicit(&x, memory_order_relaxed);
atomic_store_explicit(&y, 42, memory_order_relaxed);

is allowed to produce r1 == 42 && r2 == 42. The sequence of evaluations justifying this consists of:

atomic_store_explicit(&y, 42, memory_order_relaxed);
r1 = atomic_load_explicit(&y, memory_order_relaxed);
atomic_store_explicit(&x, r1, memory_order_relaxed);
r2 = atomic_load_explicit(&x, memory_order_relaxed);

On the other hand,

Thread 1:
r1 = atomic_load_explicit(&y, memory_order_relaxed);
atomic_store_explicit(&x, r1, memory_order_relaxed);
Thread 2:
r2 = atomic_load_explicit(&x, memory_order_relaxed);
atomic_store_explicit(&y, r2, memory_order_relaxed);

may not produce r1 == 42 && r2 = 42, since there is no sequence of evaluations that results in the computation of 42. In the absence of "relaxed" operations and read-modify-write operations with weaker than memory_order_acq_rel ordering, it has no impact. —end note]

[Note: The requirements do allow r1 == 42 && r2 == 42 in (x, y initially zero):

Thread 1:
r1 = atomic_load_explicit(&x, memory_order_relaxed);
if (r1 == 42) atomic_store_explicit(&y, r1, memory_order_relaxed);
Thread 2:
r2 = atomic_load_explicit(&y, memory_order_relaxed);
if (r2 == 42) atomic_store_explicit(&x, 42, memory_order_relaxed);

Implementations are discouraged from allowing such behavior. —end note]

Implementations shall strive to make atomic stores visible to atomic loads within a reasonable amount of time. They shall never move an atomic operation out of an unbounded loop.

«ATM».2 Lock-Free Property

Add a new sub-clause with the following paragraphs.

The macros ATOMIC_INTEGRAL_LOCK_FREE and ATOMIC_ADDRESS_LOCK_FREE indicates the general lock-free property of integral and address atomic types. A value of 0 indicates that the types are never lock-free. A value of 1 indicates that the types are sometimes lock-free. A value of 2 indicates that the types are always lock-free.

The function atomic_is_lock_free indicates whether or not the object is lock-free. The result of a lock-free query on one object cannot be inferred from the result of a lock-free query on another object.

[Note: Operations that are lock-free should also be also address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state. This restriction enables communication via memory mapped into a process more than once and memory shared between two processes. —end note]

«ATM».3 Atomic flag type and operations

Add a new sub-clause with the following paragraphs.

atomic_flag is an object type providing the classic test-and-set functionality. It has two states, set and clear.

Operations on an atomic_flag must be lock free. [Note: Hence the operations must be address-free. No other type requires lock-free operations, and hence the atomic_flag type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with atomic_flag, though with less than ideal properties. —end note]

The macro ATOMIC_FLAG_INIT shall be used to initialize an atomic_flag to the clear state. Such initialization shall be valid for static-duration variables. An atomic_flag shall not be zero initialized. [Example:

atomic_flag guard = ATOMIC_FLAG_INIT;

—end example].

bool atomic_flag_test_and_set(volatile atomic_flag *object);
bool atomic_flag_test_and_set_explicit(volatile atomic_flag *object,
					memory_order);
Effects:
Atomically, set the value in object to true. Memory is affected as per order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.
Returns:
Atomically, true if the set of object immediately before the effects.
void atomic_flag_clear(volatile atomic_flag *object);
void atomic_flag_clear_explicit(volatile atomic_flag *object, memory_order);
Effects:
Atomically, set the value in object to true. Memory is affected as per order.
Requires:
The order argument shall be neither memory_order_acquire nor memory_order_acq_rel.
void atomic_flag_fence(const volatile atomic_flag *object, memory_order order);
Effects:
Memory is affected as per order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence such an operation synchronizes with any operation on the same variable.
Requires:
The order argument shall not be memory_order_relaxed.

The global variable atomic_global_fence_compatibility of type atomic_flag may be used to emulate global fences. [Example:

atomic_flag_fence(&atomic_global_fence_compatibility, memory_order_acquire);

end example]

«ATM».4 Atomic integral and address types

Add a new sub-clause with the following paragraphs.

The atomic integral and address types are listed below. These types shall be object types.

The set of operations available for these types are defined in «ATM».5 Operations on atomic integral and address types.

The atomic_bool type provides an atomic boolean.

The atomic_address type provides atomic void* operations. The unit of addition/subtraction is one byte.

There are a full set of atomic integral types.

atomic_char, atomic_schar, atomic_uchar, atomic_short, atomic_ushort, atomic_int, atomic_uint, atomic_long, atomic_ulong, atomic_llong, atomic_ullong atomic_char16_t, atomic_char32_t, atomic_wchar_t

In addition to integral types, there are typedefs for atomic types corresponding to the <cstdint> typedefs.

atomic_int_least8_t, atomic_uint_least8_t, atomic_int_least16_t, atomic_uint_least16_t, atomic_int_least32_t, atomic_uint_least32_t, atomic_int_least64_t, atomic_uint_least64_t, atomic_int_fast8_t, atomic_uint_fast8_t, atomic_int_fast16_t, atomic_uint_fast16_t, atomic_int_fast32_t, atomic_uint_fast32_t, atomic_int_fast64_t, atomic_uint_fast64_t, atomic_intptr_t, atomic_uintptr_t, atomic_size_t, atomic_ssize_t, atomic_ptrdiff_t, atomic_intmax_t, atomic_uintmax_t

[Note: The representation of atomic integral and address types need not have the same size as their corresponding regular types. They should have the same size whenever possible, as it eases effort required to port existing code. —end note]

«ATM».5 Operations on integral and address types

Add a new sub-clause with the following paragraphs.

There are only a few kinds of operations on atomic types, though there are many instances on those kinds. This section specifies each general kind.

In the following operation definitions:

[Note: Many operations are volatile-qualified. The "volatile as device register" semantics have not changed in the standard. This qualification means that volatility is preserved when applying these operations to volatile objects. It does not mean that operations on non-volatile objects become volatile. Thus, volatile qualified operations on non-volatile objects may be merged under some conditions. —end note]

bool atomic_is_lock_free(const volatile A *object);
Returns:
True if the object's operations are lock-free, false otherwise.
void atomic_store(volatile A *object, C desired);
void atomic_store_explicit(volatile A *object, C desired, memory_order order);
Effects:
Atomically replace the value in object with desired. Memory is affected as per order.
Requires:
The order argument shall be neither memory_order_acquire nor memory_order_acq_rel.
C atomic_load(volatile A *object);
C atomic_load_explicit(volatile A *object, memory_order);
Effects:
Memory is affected as per order.
Returns:
Atomically return value in object. Memory is affected as per order.
Requires:
The order argument shall be neither memory_order_release nor memory_order_acq_rel.
C atomic_swap(volatile A *object, C desired);
C atomic_swap_explicit(volatile A *object, C desired, memory_order);
Effects:
Atomically replace the value in object with desired. Memory is affected as per order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.
Returns:
Atomically, the value of object immediately before the effects.
bool atomic_compare_swap(volatile A *object, C *expected, C desired);
bool atomic_compare_swap_explicit(volatile A *object,
        C *expected, C desired, memory_order success, memory_order failure);
Effects:
Atomically, compares the value in object for equality with that in expected, and if true, replaces the value in object with desired, and if false, updates the value in expected with the value in object. If true, memory is affected as per success. If false, memory is affected as per failure. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.
Returns:
The result of the comparison.
Requires:
The failure argument shall be neither memory_order_release nor memory_order_acq_rel. The failure argument shall be no stronger than the success argument.

[Note: The effect of the compare-and-swap operations is

if (*object == *expected)
    *object = desired;
else
    *expected = *object;

end note]

The compare-and-swap operations may fail spuriously, that is return false while leaving the value in expected unchanged. [Note: This spurious failure enables implementation of compare-and-swap on a broader class of machines, e.g. load-locked store-conditional machines. —end note] [Example: A consequence of spurious failure is that nearly all uses of compare-and-swap will be in a loop.

expected = atomic_load(&current);
do desired = function(expected);
while (!atomic_compare_swap(&current, expected, desired));

end example]

void atomic_fence(const volatile A *object, memory_order order);
Effects:
Memory is affected as per order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence such an operation synchronizes with any operation on the same variable.
Requires:
The failure argument shall not be memory_order_relaxed.

The following operations perform arithmetic and bitwise computations. All of these operations are applicable to an object of any atomic integer type. Only addition and subtraction are applicable to atomic_address. None of these operations is applicable to atomic_bool. The key, operator, and computation correspondence is:

key op computation key op computation
add + addition sub - subtraction
or | bitwise inclusive or xor ^ bitwise exclusive or
and & bitwise and
C atomic_fetch_key(volatile A *object, M operand);
C atomic_fetch_key_explicit(volatile A *object, M operand, memory_order);
Effects:
Atomically replace the value in object with result of the computation applied to the value in object and the given operand. Memory is affected as per order. These operations are read-modify-write operations in the sense of the "synchronizes with" definition in 5.1.2.4, and hence both such an operation and the evaluation that produced the input value synchronize with any evaluation that reads the updated value.
Returns:
Atomically, the value of object immediately before the effects.

For signed integral types, arithmetic is defined to use two's-complement representation. There are no undefined results. For address types, the result may be an undefined address, but the operations otherwise have no undefined behavior.