Rationale for C-Language Dependency Ordering

ISO/IEC JTC1 SC22 WG14 N1461 – 2010-05-09

Paul E. McKenney, paulmck@linux.vnet.ibm.com

Scope

This document lays out a rationale for the memory model, including dependency ordering. There are many parallel programming models, and different applications will be best served by different programming models. As always, use the right tool for the job at hand. This document's focus is on the case where the right tool for the job is the multithreaded shared-memory programming model, as this is the programming model that stresses the memory model, which is the topic at hand.

Alternative Approaches

Even within the confines of the multithreaded shared-memory programming model, there are a number of synchronization mechanisms and a number of ways of using them. Locking is perhaps the best known and has perhaps the longest history of successful use. A very natural question is therefore “why not restrict the memory model to the minimum required to support locking?”

The first thing to keep in mind is that a great many projects that ostensibly use only simple locking in fact use very sophisticated techniques on critical code paths. In most cases, these techniques rely on non-standard extensions, and in too many cases, they are vulnerable to any number of compiler and hardware optimizations. A major goal of the memory model is therefore to provide a standard basis for such techniques, in order to avoid accidental but subtle breakage of critical code paths.

Of course, it is possible to achieve quite a bit within the confines of simple locking. Unfortunately, so-called “simple” locking can be surprisingly complex due to the fact that it is not possible to protect a dynamically allocated data element using a lock located inside of that element. The reason for this restriction can be seen by considering situations involving concurrent access to and deletion of that data element. For more detail on this issue, see Section 5 of Gamsa's 1999 paper, where it is called “existence guarantees.” Some form of existence guarantee is required to attain decent scalability and performance, even on multicore systems of moderate size.

Production software has used a number of ways of providing existence guarantees without resorting to dependency ordering, but all of them are problematic in one way or another:

Fixed-length arrays of type-invariant elements
Global lock
Global hashed array of locks
Partition data across threads
Global reference counter
Global hashed array of reference counters
Per-thread reference counters
Per-thread pair of reference counters
Garbage collectors

Each of these approaches is described in one of the following sections.

Fixed-length arrays of type-invariant elements

One way of sidestepping the need to provide existence guarantees is to avoid deallocation. The traditional method of accomplishing this without leaking memory is to provide fixed-length arrays for each type of data element, and this in fact was the approach taken by Sequent Computer Systems throughout the 1980s and the early 1990s. The fatal flaw with this approach is that it requires that each system be painstakingly configured, which leaves such systems woefully vulnerable to changes in workload.

There are nevertheless some situations in which preconfigured fixed-length arrays are the right tool for the job, for example, in small fixed-function devices with predictable workloads. However, this approach is simply no longer practical for any sort of general-purpose computing system.

Global lock

This approach uses a global lock to guard acquiring a reference to the data element in question. This same global lock must then be held when deleting this same data element.

In many cases, use of a global lock will result in problematic levels of lock contention, even on modest systems by 2010 standards. Use of a global lock also introduces other problems that are discussed in the next section.

Global hashed array of locks

When faced with a lock-contention bottleneck such as the one mentioned in the preceding section, the standard parallel-programming technique is to break up the lock in the hope of reduce contention. One way to accomplish this is to replace the single global lock with a fixed array of locks. The address of the data element in question is hashed to obtain the index of the lock to be used to guard acquisition of a reference to that data element. This approach permits lock contention to be reduced to an arbitrarily low level by using an arbitrarily large global array of locks, but there are a number of problems with this approach:

The hashed array of locks introduces another level of locking, which in turn increases the difficulty of avoiding deadlocks.
The random nature of hashing implies that there is poor locality of reference to the locks, which in turn results in poor performance due to cache thrashing of the cache lines containing the data structures representing the locks.
It is not uncommon for data elements to contain other data elements, and these inner data elements are often linked into lists. Of course, code traversing these lists would naturally use the address of the inner data structure to guard acquisition of a reference to that inner data structure. Unfortunately, developers writing the code that deallocated this aggregate data element would naturally tend to guard deallocation using the lock corresponding to the address of the overall element, thus failing to exclude traversal of lists linking the inner data elements, thus in turn resulting in bugs. Although it is possible to work around this problem, it remains a fertile field for complex and subtle bugs.
The pointer through which a reference is being acquired might well change during the time the lock is being acquired. In this case, the lock just acquired might no longer match the pointer, which means that the lock must be released and the correct lock acquired. Murphy's Law implies that this process could repeat indefinitely, which degrades performance and real-time response.

Despite its severe disadvantages, this approach is heavily used in practice. In my opinion, the pain induced by heavy use of this approach is an important reason for the perceived difficulty of shared-memory parallel programming.

Partition data across threads

Another solution to lock contention is to partition the data across the threads. Per-thread locks are used to guard access to the corresponding thread's data elements.

This partitioning across threads usually avoids the poor locality of reference, and usually also avoids the confusion as to which lock should be acquired. However, the increased difficulty of avoiding deadlock still remains, and if data elements can migrate among threads, the need to repeatedly acquire locks also remains. In addition, it is not always feasible to partition data across threads.

Global reference counter

Yet another solution to lock contention is to avoid locks, for example, using a global reference counter to guard access to the data elements. Deleting a given element requires it to be removed from the enclosing data structure (so that no new references to it can be obtained), waiting for the reference count to drop to zero, then freeing the element. Reference counting is heavily used in production code.

Unfortunately, all the atomic operations on the single global reference counter can result in intolerable levels of memory contention, degrading both scalability and performance.

Furthermore, this approach is vulnerable to compiler and hardware optimizations when attempting to acquire a reference to a data element that is being concurrently added to the enclosing data structure. Such optimizations can result in the thread acquiring the reference seeing pre-initialization contents of the data element. One way to solve this problem is to require use of atomic operations on the pointer itself or the fields of the data structure pointed to. Of course, another way to solve this problem is to use dependency ordering, as discussed below.

Regardless of the solution used, this violation of causality is profoundly counter-intuitive to most developers, even more so than the possibility that different threads might see unrelated operations occur in different orders. I am a case in point. It took the DEC Alpha architects a long time to convince me that DEC Alpha could violate causality in this manner — and even longer for me to convince them that their documentation of this point was insufficient.

In addition, this approach suffers from many of the problems listed in the next section.

Global hashed array of reference counters

The standard approach to solving memory-contention problems is similar to that for lock-contention problems, leading to a hashed array of reference counters. This approach suffers from many of the same problems that plague hashed arrays of locks:

The random nature of hashing implies that there is poor locality of reference to the reference counters, which in turn results in poor performance due to cache thrashing of the cache lines containing the reference counters.
Aggregated data structures can result in confusion as to which reference counter to acquire, similar to the situation with hashed arrays of locks.
Concurrent modifications to the enclosing data structure can result in acquiring a reference to a data element while holding the wrong reference counter, in a manner similar to that for the hashed array of locks. And also as with the hashed array of locks, this issue requires repeatedly attempting to gain access until the reference counter and the pointer match.
If there are many concurrent readers acquiring references to a given data element, the corresponding reference counter might never reach zero, preventing any element hashing to this same reference counter from ever being freed.
As mentioned in the earlier section, this approach is vulnerable to hardware and compiler optimizations that can result in a thread acquiring a reference to a newly added data element and seeing pre-initialization values for that elements fields.

This approach sees some use in practice, and, as with the hashed arrays of locks, is partially responsible for parallel programming's reputation of complexity.

Per-thread reference counters

Similar to locking, another solution to the memory contention problem is to partition the data elements across threads, at least in cases where such partitioning is feasible. Again similar to locking, such partitioning usually avoids both the poor locality of reference and the confusion as to which reference should be acquired. However, if data elements can migrate among threads, the need to repeatedly acquire reference counters still remains. And regardless of whether or not data elements can migrate, updaters can potentially be indefinitely postponed while waiting for a reference counter to go to zero.

Per-thread pair of reference counters

The indefinite-postponement issue called out in the previous section can be avoided by using a pair of reference counters, in a manner similar to “generation counters.” The idea is to keep a notion of the “current” member of the pair, so that threads attempting to acquire new references to data elements increment the current member. However, these threads keep track of which counter they incremented, and decrement the same counter that they incremented, regardless of which happens to be current at the time of the decrement.

Threads deleting elements can then remove the data element from the enclosing data structure, switch which counter of the pair is current, and then wait for the other counter to decrease to zero. As long as each thread that increments a reference counter decrements it in finite time, indefinite postponement cannot happen.

Please note that there are complications that must be handled correctly, so please see a real implementation (LGPL license) rather than attempting to implement something on the basis of this skeletal description. However, these complications can be hidden behind a simple API.

Garbage collectors

The preceding discussion of reference counting might well have suggested the garbage collectors could be helpful, and in fact they are. However, garbage collectors only help when removing data elements from the enclosing data structure. In particular, garbage collectors do nothing to prevent compiler and hardware optimizations from engaging in the profoundly counter-intuitive practice of breaking dependency chains, thus permitting a thread accessing a newly inserted data element to see the pre-initialization values of that element's fields.

Furthermore, not all workloads can tolerate a garbage collector. However, other methods, such as RCU can often be used for these workloads. That said, these other methods still must handle the pre-initialization-values issue.

Shortcomings of Alternative Approaches

The main shortcoming remaining in the best-of-breed approaches (garbage collectors and RCU) is the lack of dependency ordering.

Although many projects have lived for quite some time without use of dependency ordering, I hope that the problems listed in this section demonstrate why such a life is not necessarily a happy one. Worse yet, many of these projects are likely to be relying on dependency ordering without realizing that hardware and software optimizations can break dependency chains, and perhaps without even realizing that they are relying on dependency ordering to begin with.

This situation motivated the proposals to add dependency ordering to the C standard, most recently in the form of n1444.

Of course, as noted earlier, practitioners should use the right tool for the job. For a great many jobs, the right tool will continue to be single-threaded programs.

Objections to Dependency Ordering

There have been a number of further objections to dependency ordering. This section discusses these objections.

Complexity of compiler implementation, particularly for implementations whose users will not avail themselves of much multithreading. A key goal throughout the development of the memory model was permitting simple implementations, a goal that I firmly believe has been met. In particular, the trivial implementation of dependency ordering simply promotes memory_order_consume to memory_order_release. Such implementations will suffer some performance degradation and will unnecessarily prohibit some types of code-motion optimizations, but this is nevertheless a valid design point. Implementations supporting the gcc-style memory directive to the __asm directive can gain similar simplicity with somewhat smaller performance degradation.
Lack of user experience with this technique. There is in fact ample experience stretching over decades. The first publication of this technique in a garbage-collected environment appeared in 1980. More recently, this technique has seen heavy use over a period of more than seven years in the Linux kernel, and within the past few years it has made its way into user applications as well.
Lack of compiler experience with this technique. Compilers have had just as much experience with this technique as have developers. What has changed is compiler writer's willingness to apply aggressive optimizations that break programmers' intuitions. I am quite happy to trust you when you assert that these optimizations are critically needed if you provide the tools needed to get my job done despite their presence.
Compilers should not engage in optimizations with such counter-intuitive results in the first place. I have nothing but sympathy for this line of argument, as I attempted it myself a few years ago. Needless to say, it failed to gain support. First, developers rely on a very small fraction of the dependency chains in their programs, so a blanket prohibition will unnecessarily prohibit valuable optimizations. Second, real-world code that relies on dependency chains marks those dependency chains in order to enable software correctness-checking tooling. It was therefore quite difficult to argue that the standard should not also require marking of important dependency chains.

In summary, it has been about three decades since the notion of relying on dependency ordering was introduced. It is time to get it into the C standard.