ISO/IEC JTC1 SC22 WG21 N4374 - 2015-02-06
Paul E. McKenney, email@example.com
This is a revision of N4322, with updates based on subsequent discussions. This revision adds references to litmus tests in userspace RCU, a paragraph stating goals, and a section discussing the relationship between volatile atomics and loop unrolling.
The Linux-kernel memory model is currently defined very informally in the memory-barriers.txt and atomic_ops.txt files in the source tree. Although these two files appear to have been reasonably effective at helping kernel hackers understand what is and is not permitted, they are not necessarily sufficient for deriving the corresponding formal model. This document is a first attempt to bridge this gap.
The hope is that this document will help the C and C++ standard committees understand the existing practice and the constraints from the Linux kernel, and also that it will help the Linux community evaluate which portions of the C11 and C++11 memory models might be useful in the Linux kernel.
Loads from and stores to normal variables should be protected with the
macro, for example:
r1 = ACCESS_ONCE(x); ACCESS_ONCE(y) = 1;
ACCESS_ONCE() access may be modeled as a
However, please note that
ACCESS_ONCE() is defined
only for properly aligned machine-word-sized variables.
ACCESS_ONCE() to a large array or structure
is unlikely to do anything useful.
Note that the
volatile is absolutely required:
To see this, consider that
ACCESS_ONCE() is used to prevent
concurrently modified accesses from being hoisted out of a loop or out
of unrolled instances of a loop.
For example, given this loop:
while (tmp = atomic_load_explicit(a, memory_order_relaxed)) do_something_with(tmp);
The compiler would be permitted to unroll it as follows:
while (tmp = atomic_load_explicit(a, memory_order_relaxed)) do_something_with(tmp); do_something_with(tmp); do_something_with(tmp); do_something_with(tmp); }
This would be unacceptable for real-time applications, which need the
value to be reloaded from
a on each iteration, unrolled
volatile qualifier prevents this transformation.
Note that the Linux kernel will likely be replacing
with a pair of APIs, one for loading and the other for storing.
This will have the benefit of mapping more directly onto C and C++
At one time,
gcc guaranteed that properly aligned accesses
to machine-word-sized variables would be atomic.
gcc no longer documents this guarantee, there is
still code in the Linux kernel that relies on it.
These accesses could be modeled as non-
smp_store_release() may be modeled as a
smp_load_acquire() may be modeled as a
r1 = smp_load_acquire(x); smp_store_release(y, 1);
Members of the
rcu_dereference() family can be modeled
Members of this family include:
rcu_dereference() should be representative for
litmus-test purposes, at least initially.
rcu_assign_pointer() can be modeled as a
set_mb() function assigns the specified value to the
specified variable, then executes a full memory barrier, which is
described in the next section.
This isn't as strong as a
memory_order_seq_cst store because
the following code fragment does not guarantee that the stores to
y will be ordered.
smp_store_release(x, 1); set_mb(y, 1);
set_mb() provides exactly the ordering required
for manipulating task state, which is the job for which it was created.
The Linux kernel has a variety of memory barriers:
barrier(), which can be modeled as an
smp_mb(), which does not have a direct C11 or C++11 counterpart. On an ARM, PowerPC, or x86 system, it can be modeled as a full memory-barrier instruction (
mfence, respectively). On an Itanium system, it can be modeled as an
mfinstruction, but this relies on
ACCESS_ONCE()load and an
smp_rmb(), which can be modeled (overly conservatively) as an
atomic_thread_fence(memory_order_acq_rel). One difference is that
smp_rmb()need not order prior loads against later stores, or prior stores against later stores. Another difference is that
smp_rmb()need not provide any sort of transitivity, having (lack of) transitivity properties similar to ARM's or PowerPC's address/control/data dependencies.
smp_wmb(), which can be modeled (again overly conservatively) as an
atomic_thread_fence(memory_order_acq_rel). One difference is that
smp_rmb()need not order prior loads against later stores, nor prior loads against later loads. Similar to
smp_wmb()need not provide any sort of transitivity.
smp_read_barrier_depends(), which is a no-op on all architectures other than Alpha. On Alpha,
smp_read_barrier_depends()may be modeled as a
atomic_thread_fence(memory_order_acq_rel)or as a
smp_mb__before_atomic(), which provides a full memory barrier before the immediately following non-value-returning atomic operation.
smp_mb__after_atomic(), which provides a full memory barrier after the immediately preceding non-value-returning atomic operation. Both
smp_mb__after_atomic()are described in more detail in the later section on atomic operations.
smp_mb__after_unlock_lock(), which provides a full memory barrier after the immediately preceding lock operation, but only when paired with a preceding unlock operation by this same thread or a preceding unlock operation on the same lock variable. The use of
smp_mb__after_unlock_lock()is described in more detail in the second on locking.
There are some additional memory barriers including
however, these cover interactions with memory-mapped I/O, so have no
counterpart in C11 and C++11 (which is most likely as it should be for
the foreseeable future).
Some use cases for these memory barriers may be found
These are for the userspace RCU library, so drop the leading
to get the corresponding Linux-kernel primitive.
For example, the userspace
translates to the Linux-kernel
The Linux kernel features “roach motel” ordering on its locking primitives: Prior operations can be reordered to follow a later acquire, and subsequent operations can be reordered to precede an earlier release. The CPU is permitted to reorder acquire and release operations in this way, but the compiler is not, as compiler-based reordering could result in deadlock.
Note that a release-acquire pair does not necessarily result in a
To see this consider the following litmus test, with
y both initially zero, and locks
l3 both initially held by the threads releasing them:
Thread 1 Thread 2 -------- -------- y = 1; x = 1; spin_unlock(&l1); spin_unlock(&l3); spin_lock(&l2); spin_lock(&l4); r1 = x; r2 = y; assert(r1 != 0 || r2 != 0);
In the above litmus test, the assertion can trigger, meaning that an
unlock followed by a lock is not guaranteed to be a full memory barrier.
And this is where
smp_mb__after_unlock_lock() comes in:
Thread 1 Thread 2 -------- -------- y = 1; x = 1; spin_unlock(&l1); spin_unlock(&l3); spin_lock(&l2); spin_lock(&l4); smp_mb__after_unlock_lock(); smp_mb__after_unlock_lock(); r1 = x; r2 = y; assert(r1 != 0 || r2 != 0);
In contrast, after addition of
the assertion cannot trigger.
The above example showed how
can cause an unlock-lock sequence in the same thread to act as a full
barrier, but it also applies in cases where one thread unlocks and
another thread locks the same lock, as shown below:
Thread 1 Thread 2 Thread 3 -------- -------- -------- y = 1; spin_lock(&l1); x = 1; spin_unlock(&l1); smp_mb__after_unlock_lock(); smp_mb(); r1 = y; r3 = y; r2 = x; assert(r1 == 0 || r2 != 0 || r3 != 0);
smp_mb__after_unlock_lock(), the above assertion
can trigger, and with it, it cannot.
The fact that it can trigger without might seem strange at first glance,
but locks are only guaranteed to give sequentially consistent ordering
to their critical sections.
If you want an observer thread to see the ordering without holding
the lock, you need
(Note that there is some possibility that the Linux kernel's memory
model will change such that an unlock followed by a lock forms
a full memory barrier even without the
The Linux kernel has an embarrassingly large number of locking primitives,
spin_unlock() should be
representative for litmus-test purposes, at least initially.
Atomic operations have three sets of operations,
those that are defined on
those that are defined on
and those that are defined on aligned machine-sized variables, currently
However, in the near term, it should be acceptable to focus on a
small subset of these operations.
Variables of type
atomic_t may be stored to
atomic_set() and variables of type
atomic_long_t may be stored to using
Similarly, variables of these types may be loaded from using
The historical definition of these primitives has lacked any
sort of concurrency-safe semantics, so the user is responsible
for ensuring that these primitives are not used concurrently
in a conflicting manner.
That said, many architectures treat
memory_order_relaxed loads and a few architectures
There is therefore some chance that concurrent conflicting accesses
will be allowed at some point in the future, at which point
their semantics will be those of
The remaining atomic operations are divided into those that return
a value and those that do not.
The atomic operations that do not return a value are similar to
However, the Linux-kernel atomic operations that do return a value cannot be
implemented in terms of the C11 atomic operations.
These operations can instead be modeled as
operations that are both preceded and followed by the Linux-kernel
smp_mb() full memory barrier, which is implemented using
DMB instruction on ARM and
sync instruction on PowerPC.
Note that in the case of the CAS operations
full barriers are required in both the success and failure cases.
Strong memory ordering can be added to the non-value-returning atomic
smp_mb__before_atomic() before and/or
The operations are summarized in the following table.
An initial implementation of a tool could start with
The rows marked “(Generic)” are type-generic, applying to any
aligned machine-word-sized quantity supported by all architectures that the
Linux kernel runs on.
The set of types is currently those of size
those of size
The “Lock-Barrier Operations” have
_atomic_dec_and_lock(), and have
memory_order_release for the other primitives.
Otherwise, the usual Linux-kernel rule holds: If no value is returned,
memory_order_relaxed semantics apply, otherwise the
operations behave as if there was
smp_mb() before and after.
The Linux kernel provides a limited notion of control dependencies, ordering prior loads against control-depedendent stores in some cases. Extreme care is required to avoid control-dependency-destroying compiler optimizations. The restrictions applying to control dependencies include the following:
memory_order_acquireif you require this behavior.
ACCESS_ONCE(). Similarly, the store at the other end of a control dependency must also use
switchstatement store the same value to the same variable, then those stores cannot participate in control-dependency ordering.
The C and C++ standards do not guarantee any sort of control dependency. Therefore, this list of restriction is subject to change as compilers become increasingly clever and aggressive.
The publish-subscribe portions of RCU are captured by the combination
rcu_assign_pointer(), which can be modeled as a
memory_order_release store, and of the
rcu_dereference() family of primitives, which can be
memory_order_consume loads, as was noted
Grace periods can be modeled as described in Appendix D of
User-Level Implementations of Read-Copy Update.
There are a number of grace-period primitives in the Linux kernel,
synchronize_rcu() are good places to start.
The grace-period relationships can be describe using the following
abstract litmus test:
Thread 1 Thread 2 -------- -------- rcu_read_lock(); S2a; S1a; synchronize_rcu(); S1b; S2b; rcu_read_unlock();
If either of
then both must precede
Conversely, if either of
S2b, then both must follow
Additional litmus tests may be found
Again, these are for the userspace RCU library, so drop the leading
cmm_ to get the corresponding Linux-kernel primitives.
This document makes a first attempt to present a formalizable model of the Linux kernel memory model, including variable access, memory barriers, locking operations, atomic operations, control dependencies, and RCU grace-period relationships. The general approach is to reduce the kernel's memory model to some aspect of memory models that have already been formalized, in particular to those of C11, C++11, ARM, and PowerPC.