Doc. No.:	N1284
Date:	2008-03-12
Reply to:	Clark Nelson	Hans-J. Boehm	Lawrence Crowl
Phone:	+1-503-712-8433	+1-650-857-3406	+1-650-253-3677
Email:	clark.nelson@intel.com	Hans.Boehm@hp.com boehm@acm.org	crowl@google.com Lawrence@Crowl.org

Parallel memory sequencing model

At its meeting in Kona (2007-10), WG21 adopted a memory sequencing model for parallel programming, along with library facilities for expressing accesses to atomic objects, by which locking and lock-free synchronization can be implemented.

This paper combines the text of WG21/N2429, which contains the wording added to the language section of the C++ standard, with the relevant parts of WG21/N2427, which contains the wording added to the library section. The language changes describe the way existing language operations, including assignments and loads, are affected by atomic operations, and by inference the optimizations that are and are not allowed to be performed in the presence of atomic operations. The library section describes the facilities by which atomic operations are expressed.

For explanation and rationale of the memory model overall, please see WG14 paper N1276 in the post-Kona mailing. This paper contains rationale for the design of the atomic (library) facilities, and continues with the normative text proposed for addition to the C++ standard, modified slightly to fit better into the C standard.

The wording will of course need to be modified further; at this stage, this should not be considered a complete and polished proposal for standardese, but rather a starting point. In particular, wording to specify type-generic macros in C needs close review. Also, the C++ standard in general, and the memory model proposal in particular, makes extensive use of non-normative inline notes (as opposed to footnotes) for explanation and rationale. The C standard today has no corresponding convention. I am not today prepared to propose either to eliminate the included explanation and rationale, or to invent a presentation convention appropriate to the C standard. Similarly, the presentation style of the descriptions of library functions matches C++ conventions, not C conventions.

It should be noted that some further additions to the model have been proposed in WG21, for data-dependency ordering. The most recent published formulation of these proposals may be found in WG21/N2492 and N2493. But as these proposals have not yet been adopted by WG21, they are not included in this paper.

In the remainder of this document, references to N-numbered papers are for WG21 documents by default.

Introduction

Rationale

The traditional shared-memory notion that every store is instantly visible to all threads induces an unacceptable performance loss on modern systems. Therefore programmers must have a mechanism to indicate when stores in one thread should be communicated to another.

Specifically, a program that wishes to communicate the fact that a particular piece of data prepared by one thread is ready to be examined by another thread, needs a shared variable flag, that both:

Allows atomic accesses, in the sense that concurrent reads and writes are allowed and that the reads result in only one of the assigned values and never undefined behavior.
Ensures that any ordinary data written before the flag is set (i.e. the prepared data) is seen correctly by another thread after it sees a set flag.

Although the second aspect is often glossed over, it is usually not automatic with modern hardware and compilers, and is just as important as the first in ensuring correctness.

In POSIX, this communication is achieved by calling certain functions, particularly mutex lock and unlock. While mutexes are adequate for many applications, there is a significant need for finer-grained atomic operations:

Lock-free concurrent data structures: Lock-free concurrent data structures are difficult to design and implement correctly. As such, there is a significant need for programmers capable of doing so to write portably, and for that they need standards.
Inter-thread memory synchronization: Occasionally synchronization with locks offers insufficient performance, and other, more specific idioms suffice.

Goals

We had several goals for our design. Each had significant impact on the details.

Provide the primitives critical to inter-thread memory synchronization.
Provide the primitives critical to effective lock-free algorithms.
Provide a very high degree of C and C++ interoperability.
Cover the common range of existing hardware primitives.
Expose those hardware primitives efficiently to programmers, that is, the design should minimize overhead.
Provide a syntax that reduces errors.
Support template programming.

Discussion of Design

While our proposal is based on existing practice, that practice is somewhat fragmented, so some design choices may not be obvious. In this section, we discuss our choices and their reasoning.

A Library API

We propose to add atomic types and operations purely as a library API. In practice, for C or C++, this API would have to be implemented largely with either compiler intrinsics or assembly code. (As such, this proposal should be implemented by compiler vendors, not library vendors, much as the exception facilities are implemented by compiler vendors.) For C, a compiler implementation is required for the type-generic macros.

Atomic Types, Operations, and Volatile

We propose atomic types rather than atomic operations on general-purpose types. Doing so enforces a single synchronization protocol for all operations on an object. Using two protocols simultaneously will result in synchronization failure.

We chose to specify atomic types with conventional type names rather than modify the volatile type qualifier for concurrency. The volatile type qualifier has a long history within C and C++, and changing its meaning is both risky and unnecessary. In addition, the existing meaning of volatile, "may be modified by external agents", is somewhat orthogonal to "may be modified concurrently by the program". See N2016 Should volatile Acquire Atomicity and Thread Visibility Semantics? for a more complete discussion. As a consequence, objects of atomic type may also be volatile qualified. Compilers may optimize some non-volatile atomic operations, where they would not be able to optimize volatile operations.

C and C++ Interoperability

The proposal defines the atomic types as C++ standard-layout structs. In C, these would be simply opaque types. Headers common to both C and C++ use exactly the same syntax to declare atomic variables.

Furthermore, the proposal defines the C++ atomic types such that the static initialization syntax can be identical to C aggregate initialization syntax. That is, atomic variables can be initialized in common headers as well.

The core atomic operations are free functions that take pointers to atomic types. Programmers use the same syntax for these operations in both C and C++. That is, a header included from both C and C++ can provide inline functions that operate on atomic types.

The proposal defines the core functions as overloaded functions in C++ and as type-generic macros in C. This approach helps programmers avoid changing code when an atomic type changes size.

Because free functions are occasionally clumsy, the proposal additionally provides member operators and member functions so that C++ programmers may use a more concise syntax.

Memory Ordering

Synchronization operations in the memory model may be either acquire or release operations, or both. These operations govern the communication of non-atomic stores between threads. A release operation ensures that prior memory operations become visible to a thread performing subsequent acquire operation on the same object.

Rather than have atomic operations implicitly provide both acquire and release semantics, we choose to complicate the interface by adding explicit ordering specifications to various operations. Many comparable packages do not, and instead provide only a single version of operations, like compare-and-swap, which implicitly include a full memory fence.

Unfortunately, the extra ordering constraint introduced by the single version is almost never completely necessary. For example, an atomic increment operation may be used simply to count the number of times a function is called, as in a profiler. This requires atomicity, but no ordering constraints. And on many architectures (PowerPC, ARM, Itanium, Alpha, though not X86), the extra ordering constraints are at least moderately expensive.

The proposal defines an enumeration that enables detailed specification of the memory order for every operation.

Consistency

A strict interpretation of acquire and release yields a fairly weak consistency model, which allows threads to have a different notion of the order of writes. For stronger consistency, this proposal distinguishes between an operation with acquire and release semantics and an operation with sequentially consistent semantics. See N2177, Sequential Consistency for Atomics.

The member operators provide no mechanism for specifying memory order enumeration, and so they have only the strongest synchronization. This default is not an undue burden because using weaker synchronization requires considerable thought. Any extra syntactic burden is dwarfed by the semantic burden.

Program auditors can search for uses of the memory order enumeration to identify code that uses synchronization models that are weaker than sequentially consistent.

Fences

It is also unclear how a convention requiring full global memory fences would properly interact with an interface that supported simple atomic loads and stores. Here a full memory fence would generally multiply the cost by a large factor. (The current gcc interface does not seem to support simple atomic loads and stores explicitly, which makes it unclear how to support e.g. lock-based emulation, or architectures on which the relevant loads and stores are not implicitly atomic.)

There are two possible approaches to specifying ordering constraints:

Have the programmer provide explicit memory fences/barriers, perhaps most usefully in a way that is analogous to the SPARC membar instructions.
Associate the ordering semantics with operations. The closest hardware analog for this is probably Itanium, though we carry this through more consistently.

Both approaches appear to have their merits. We chose the latter for reasons described in N2176.

Some architectures provide fences that are limited to loads or stores. We have, so far, not included them, since it seems to be hard to find cases in which both:

Such limited ordering constraints are useful and not excessively brittle.
They actually result in a performance benefit over the more general constraint.

However, we have provided per-variable fence operations, which are semantically modeled as read-modify-write operations. Such fences enable programmers to conditionalize memory synchronization, which can substantially improve performance. See N2153 A simple and efficient memory model for weakly-ordered architectures for motivating examples and the Lazy Initialization example from N2427 for a use of the proposed syntax.

We expect that implementations that have hardware fences will use such operations to implement the language fences. See N2362 Converting Memory Fences to N2324 Form, for a discussion of the issues and approaches to converting from existing practice of global fences to the per-variable fences in this proposal.

Dependent Memory Order

Most architectures provide additional ordering guarantees if one memory operation is dependent on another. In fact, these are critical for efficient implementation of languages like Java.

For C++, there is near-universal agreement that it would be nice to have some such guarantees. The fundamental issues are:

Compilers may remove or change data and/or control dependencies.
Detailed guarantees vary across architectures.

See papers N2492 C++ Data-Dependency Ordering: Atomics and Memory Model and N2493 C++ Dependency Ordering: Function Annotation for exploration of these issues.

Lock-Free Property

In some cases, both the decision to use a lock-free algorithm, and sometimes the choice of lock-free algorithm, depends on the availability of underlying hardware primitives. In other cases, e.g. when dealing with asynchronous signals, it may be important to know that operations like compare-and-swap are really lock-free, because a lock-based emulation might result in deadlock.

The proposal defines feature queries to determine whether or not operations are lock-free. We provide two kinds of feature queries, compile-time preprocessor macros and run-time functions. To facilitate optimal storage use, the proposal supplies feature macros that indicates general lock-free status of integral and address atomic types. The run-time function provide per-object information.

The proposal provides run-time lock-free query functions rather than compile-time constants because subsequent implementations of a platform may upgrade locking operations with lock-free operations, so it is common for systems to abstract such facilities behind dynamic libraries, and we wish to leave that possibility open. Furthermore, we recommend that implementations without hardware atomic support use that technique.

The proposal provides lock-free query functions on individual objects rather than whole types to permit unavoidably misaligned atomic variables without penalizing the performance of aligned atomic variables.

Because consistent use of operations requires that all operations on a type must use the same protocol, all operations are lock-free or none of them are. That is, the lock-free property applies to whole objects, not individual operations.

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state. While such a definition is beyond the scope of the standard, a clear statement of our intent will enable a portable expression of class of a programs already extant.

Flag Type and Operations

The proposal includes a very simple atomic flag type, providing two primary operations, test-and-set and clear. This type is the minimum hardware-implemented type needed to conform to this standard. The remaining types can be emulated with the atomic flag, though with less than ideal properties. Few programmers should be using this type.

We considered adding a wait operation in this proposal, but ultimately rejected it because pure busy waiting has pathological performance characteristics on multi-programmed machines and because doing better requires interaction with the operating system scheduler, which seems inappropriate to a processor-level facility.

Integral and Address Types

The proposal includes a full set of integral atomic types and an address atomic type.

This proposal includes atomic integral types smaller than a machine word, even though many architectures do not have such operations. For machines that implement a word-based compare-and-swap operation, the effect of operations can be achieved by loading the containing word, modifying the sub-word in place, and performing a compare-and-swap on the containing word. In the event that no compare-and-swap is available, the implementation may need to either implement smaller types with larger types or use locking algorithms. Using the same size for the atomic type as its base type eases effort required to port existing code, e.g. code using __sync.

Atomic Operations

The simplest atomic operations are load and store.

There are times when one wishes to store a new value, but also load the old value. An atomic load followed by an atomic store is not an atomic sequence, so we provide an atomic swap operation that does a combined load and store atomically.

We also provide the common fetch-and-modify operations. The fetch-and-modify functions return the original stored value. The original stored value is required for fetch-and-or and fetch-and-and because there is no means to compute to the original stored value from the new value and the modifying argument. In contrast to the functions, the fetch-and-modify assignment operators return the new value. We do this for consistency with normal assignment operators. Unlike normal C++ assignment operators, though, the atomic assignments return values rather than references, which is like C. The reason is that another thread might intervene between an assignment and a subsequent read. Rather than introduce this classic parallel programming bug, we return a value.

The normal signed integral addition operations have undefined behavior in the event of an overflow. For atomic variables, this undefined behavior introduces significant burden because there is in general no way to pre-test for overflow. Rather than leave overflow undefined, we recognize the de facto behavior of modern systems and define atomic fetch-and-add (-subtract) to use two's-complement arithmetic. We are aware of no implementation of C or C++ for which this definition is a problem.

The compare-and-swap operation seems to be the minimum general-purpose synchronization primitive. It appears to be both necessary and sufficient for most interesting lock-free algorithms. The compare-and-swap operations may fail spuriously. That is, the stored value may be known to be equal to the expected value, and yet the operation fails to swap. We permit this failure for two reasons. First, it appears unavoidable for load-locked store-conditional implementations. Second, hardware memory switches may prefer to fail some operations to ensure overall machine performance.

The compare-and-swap operations replace the expected value with the found value in the event that the found value does not match the expected value. The intent of the design is to set up a compare-and-swap loop for the next iteration. The reason for this design is that:

Recovering from spurious failure requires a loop.
Setting up the loop for the next iteration requires the least syntax in user code.
The more comprehensive operation subsumes the two hardware variants of compare and swap. Some architectures (e.g. SPARC) return the mismatched value in hardware, and it is more efficient to pass that back up rather than artificially introduce another atomic load. Architectures that don't provide the mismatched value must do a subsequent load. Likewise some architectures do not set conditions (e.g. SPARC) and those will need to insert a comparison.
Compilers will be able to examine the compilation context and remove any unnecessary overhead for the relatively rare cases where a loop is not needed.

Unlike other operations, the compare-and-swap operations have two memory synchronization order parameters. The first is for operations that succeed; the second is for those that fail. The reason for this pair of specifications is that there are several circumstances in which the failure case can have substantially weaker, and substantially cheaper, synchronization. However, for sequential consistency, full synchronization is required.

Template Types

In C++, atomic type templates provide significant benefits:

They ease type-safe use of atomic pointers.
They ease the process of making atomic version of user-defined types.
They enable generic algorithms, such as lock-free pointer-based queues.

The atomic type template has a partial specialization for pointers (derived from the atomic address type), and full specializations for integral types (derived from the atomic integrals).

The primary problem with an atomic template is that effective use of machine operations requires that their parameter types are trivially copy-assignable and bitwise equality-comparable.

Furthermore, parameter types that are not also statically initializable and trivially destructible may be difficult to use.

In the present language, there is no mechanism to require these properties of a type parameter. Roland Schwarz suggests using a template union to enforce POD type parameters. Unfortunately, that approach also prevents the derivation of specializations of atomic for the types above, which is unacceptable. Furthermore, WG21 has accepted a proposal (N2544) to remove POD restrictions on types of union members. We believe that concepts are a more appropriate mechanism to enforce this restriction, and so we have proposed the concepts and concept maps necessary for safe use of atomic templates. The burden on programmers using atomic templates is to declare that types passed as template parameters are in fact bitwise equality-comparable. The proposed mechanisms will infer trivially copy-assignable types.

The intent is that vendors will specialize a fully-general locking implementation of an atomic template with implementations using hardware primitives when those primitives are applicable. This specialization can be accomplished by defining a base template with the size and alignment of the parameter type as additional template parameters, and then specializing on those two arguments.

Implementability

We believe that there is ample evidence for implementability.

The Intel/gcc __sync intrinsics provide evidence for compiler implementability of the proposed interface.

(We do not advocate standardizing these intrinsics as is. They provide far less control over memory ordering than we advocated above. For example, they provide no way to atomically increment a counter without imposing unnecessary ordering constraints. The lack of appropriate ordering control appears to already have resulted in implementation shortcuts, e.g. gcc does not implement __sync_synchronize() as a full memory barrier on X86, in spite of the documentation. We believe a number of issues were not fully understood when that design was developed, and it could greatly benefit from another iteration at this stage.)

Other packages, particularly Boehm's atomic_ops package provide evidence of efficient implementability over a range of architectures.

The remaining implementation issue is the burden on implementers to produce a minimally conforming implementation. The minimum hardware support required is the atomic test-and-set and clear operations that form the basis of the atomic flag type. The original proposal includes an example implementation based on that minimal hardware support, and thus shows the vendor work required.

Non-terminating loops

It is generally felt that it is important to allow the transformation of potentially non-terminating loops (e.g. by merging two loops that iterate over the same potentially infinite set, or by eliminating a side-effect-free loop), even when that may not otherwise be justified in the case in which the first loop never terminates.

Existing compilers commonly assume that code immediately following a loop is executed if and only if code immediately preceding a loop is executed. This assumption is clearly invalid if the loop fails to terminate. Even if we wanted to prohibit this behavior, it is unclear that all relevant compilers could comply in a reasonable amount of time. The assumption appears both pervasive and hard to test for.

The treatment of non-terminating loops in the current standard is very unclear. We believe that some implementations already eliminate potentially non-terminating, side-effect-free, loops, probably based on 1.9p9, which appears to impose very weak requirements on conforming implementations for non-terminating programs. We had previously arrived at a tentative conclusion that non-terminating loops were already sufficiently weakly specified that no changes were needed. We no longer believe this, for the following reasons:

On closer inspection, it is at best unclear that this reasoning would continue to apply in a world in which the program may terminate even if one of the threads does not.
In the presence of threads, the elimination of certain side-effect-free potentially infinite loops (e.g. while (!please_self_destruct.load_acquire()) {}; self_destruct()) is clearly hazardous, and a bit more clarity seems appropriate.

Changes to the Standard

The definition of "memory location"

Add a new subclause to clause 3:

memory location

either an object of scalar type, or a maximal sequence of adjacent bit-fields all having non-zero width.

Note: Two threads of execution can update and access separate memory locations without interfering with each other.

Note: A bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields, no matter what the sizes of those intervening bit-fields happen to be.

Example: A structure declared as struct {char a; int b:5, c:11, :0, d:8; struct {int ee:8;} e;} contains four separate memory locations: The field a, and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be.

Multi-threaded executions and data races

Insert a new section after 5.1.2.3, titled "Multi-threaded executions and data races".

5.1.2.4p1:

Under a hosted implementation, a C program can have more than one thread of execution (a.k.a. thread) running concurrently. The execution of each thread proceeds as defined by the remainder of this standard. The execution of the entire program consists of an execution of all of its threads. [Note: Usually the execution can be viewed as an interleaving of all its threads. However some kinds of atomic operations, for example, allow executions inconsistent with a simple interleaving, as described below. —end note ] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.

5.1.2.4p2:

The value of an object visible to a thread T at a particular point might be the initial value of the object, a value assigned to the object by T, or a value assigned to the object by another thread, according to the rules below. [Note: In some cases, there may instead be undefined behavior. Much of this section is motivated by the desire to support atomic operations with explicit and detailed visibility constraints. However, it also implicitly supports a simpler view for more restricted programs. —end note ]

5.1.2.4p3:

Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.

5.1.2.4p4:

The library defines a number of atomic operations («ATM») and operations on locks (???) that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation is either an acquire operation or a release operation, or both, on one or more memory locations; the semantics of these are described below. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics, also described below. [Note: For example, a call that acquires a lock will perform an acquire operation on the locations comprising the lock. Correspondingly, a call that releases the same lock will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform an acquire operation on A. We do not include "relaxed" atomic operations as "synchronization" operations though, like synchronization operations, they cannot contribute to data races. —end note ]

5.1.2.4p5:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M, and A happens before B, then A shall precede B in the modification order of M, which is defined below. [Note: This states that the modification orders must respect "happens before". —end note ] [Note: There is a separate order for each scalar object. There is no requirement that these can be combined into a single total order for all objects. In general this will be impossible since different threads may observe modifications to different variables in inconsistent orders. —end note ]

5.1.2.4p6:

A release sequence on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is a release, and every subsequent operation

is performed by the same thread that performed the release, or

is a non-relaxed atomic read-modify-write operation.

5.1.2.4p7:

An evaluation A that performs a release operation on an object M synchronizes with an evaluation B that performs an acquire operation on M and reads a value written by any side effect in the release sequence headed by A. [Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ] [Note: The specifications of the synchronization operations define when one reads the value written by another. For atomic variables, the definition is clear. All operations on a given lock occur in a single total order. Each lock acquisition "reads the value written" by the last lock release. —end note ]

5.1.2.4p8:

An evaluation A happens before an evaluation B if:

A is sequenced before B or

A synchronizes with B; or

for some evaluation X, A happens before X and X happens before B.

5.1.2.4p9:

A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:

A happens before B, and

there is no other side effect X to M such that A happens before X and X happens before B.

The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A. [ Note: If there is ambiguity about which side effect to a non-atomic object is visible, then there is a data race, and the behavior is undefined. —end note ] [ Note: This states that operations on ordinary variables are not visibly reordered. This is not actually detectable without data races, but it is necessary to ensure that data races, as defined here, and with suitable restrictions on the use of atomics, correspond to data races in a simple interleaved (sequentially consistent) execution. —end note ]

5.1.2.4p10:

The visible sequence of side effects on an atomic object M, with respect to a value computation B of M, is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first side effect is visible with respect to B, and for every subsequent side effect, it is not the case that B happens before it. The value of an atomic object M, as determined by evaluation B, shall be the value stored by some operation in the visible sequence of M with respect to B. Furthermore, if a value computation A of an atomic object M happens before a value computation B of M, and the value computed by A corresponds to the value stored by side effect X, then the value computed by B shall either equal the value computed by A, or be the value stored by side effect Y, where Y follows X in the modification order of M. [Note: This effectively disallows compiler reordering of atomic operations to a single object, even if both operations are "relaxed" loads. By doing so, we effectively make the "cache coherence" guarantee provided by most hardware available to C atomic operations.—end note ] [Note: The visible sequence depends on the "happens before" relation, which depends on the values observed by loads of atomics, which we are restricting here. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the "happens before" relation derived as described above, satisfy the resulting constraints as imposed here. -- end note.]

5.1.2.4p11:

The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. [Note: It can be shown that programs that correctly use simple locks to prevent all data races, and use no other synchronization operations, behave as though the executions of their constituent threads were simply interleaved, with each observed value of an object being the last value assigned in that interleaving. This is normally referred to as "sequential consistency". However, this applies only to race-free programs, and race-free programs cannot observe most program transformations that do not change single-threaded program semantics. In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation. —end note ]

5.1.2.4p12:

[Note: Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. We also generally preclude reordering of atomic loads in cases in which the atomics in question may alias, since this may violate the "visible sequence" rules. —end note ]

5.1.2.4p13:

[Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. —end note ]

Non-terminating loops

Add a new paragraph as 6.8.5p6:

An iteration statement that

performs no I/O operations, and

does not access or modify volatile objects, and

performs no synchronization or atomic operations

in its body, controlling expression, or, in the case of a for statement, its expression-3, may be assumed by the implementation to terminate. [Note: This is intended to allow compiler transformations, such as removal of empty loops, even when termination cannot be proven. —end note]

«ATM» Atomics `<stdatomic.h>`

Add a new library subclause with the following paragraphs. «ATM» will be used to refer to the number of this new subclause.

This clause describes components for fine-grained atomic access. This access is provided via operations on atomic objects. [Footnote: Atomic objects are neither active nor radioactive. —end footnote]

The following subclauses describe atomics requirements, and components for types and operations, as summarized below.

Atomics library summary
Subclause Header(s)

«ATM».1 Order and consistency <stdatomic.h>

«ATM».2 Lock-free property

«ATM».3 Atomic flag type and operations

«ATM».4 Atomic integral and address types

«ATM».5 Operations on atomic integral and address types

Atomics library summary
Subclause	Header(s)
«ATM».1 Order and consistency	`<stdatomic.h>`
«ATM».2 Lock-free property
«ATM».3 Atomic flag type and operations
«ATM».4 Atomic integral and address types
«ATM».5 Operations on atomic integral and address types