A Formal Model of Linux-Kernel Memory Ordering (Part 1) [LWN.net]

April 5, 2017

This article was contributed by Jade Alglave, Paul E. McKenney, Alan Stern, Luc Maranget, and Andrea Parri

Introduction

It has been said that Documentation/memory-barriers.txt can be used to frighten small children, and perhaps this is true. But even if it is true, it is woefully inefficient. After all, there is a huge number of children in this world, so a correspondingly huge amount of time and effort would be required in order to read it to them all.

This situation clearly calls for automated tooling, which is now available in prototype form. This tool takes short fragments of concurrent C code as input, and exhaustively analyzes the possible results. In other words, instead of perusing memory-barriers.txt to find the answer to a memory-ordering question, you can get your answer by writing a (very!) small test case and feeding it to the tool, at least for test cases within the tool's current capabilities and limitations. This article gives an introduction to the tool, describing how to use it and how it works.

To the best of our knowledge, this tool is the first realistic automated representation of Linux-kernel memory ordering, and is also the first to incorporate RCU ordering properties.

This article is organized as follows, with the intended audience for each section in parentheses:

Why Formal Memory Models? (all).
Guiding Principles (all).
Causality and Ordering (people interested in using memory-ordering tools).

This is followed by the inevitable answers to the quick quizzes.

Those wishing to dive directly into the strong model will find it here. (Yes, there is also a weak model, but it is described in terms of how it differs from the strong model. So you should start with the strong model.)

Why Formal Memory Models?

Even before Linux, kernel hacking has tended to involve more intuition and less formal methods. Formal methods can nevertheless be useful for providing definite answers to difficult questions.

For example, how many different behaviors can a computer program exhibit? Particularly one that uses only values in memory, with no user input or output? Computers being the deterministic automata they are, most people would say only one, and for uniprocessor systems they would be basically correct. But multiprocessor systems can give rise to a much wider range of behaviors, owing to subtle variations in the relative timing of the processors and the signals transmitted among them, their caches, and main memory. Memory models try to bring some order to the picture, first and foremost by characterizing exactly which outcomes are possible for a Symmetric Multiprocessor (SMP) system running a certain (small!) program.

Even better, a formal memory model enables tools to automatically analyze small programs, as described here and here. However, those tools are specialized to specific CPU families. For analyzing the Linux kernel, what we need is a tool targeted at a higher level, one that will be applicable to every CPU architecture supported by the kernel.

Formal memory models require extreme precision, far beyond what the informal discussion in memory-barriers.txt can possibly provide. To bridge this gap in the best way possible, we have formulated the guiding principles listed in the following section.

Guiding Principles

Our memory model is highly constrained because it must match the kernel's behavior (or intended behavior!). However, there are numerous choices to be made, so we formulated the following principles to guide those choices:

Strength Preferred to Weakness.
Simplicity Preferred to Complexity.
Support Existing Non-Buggy Linux-Kernel Code.
Be Compatible with Hardware Supported by the Linux Kernel.
Support Future Hardware, Within Reason.
Be Compatible with the C11 Memory Model, Where Prudent and Reasonable.
Expose Questions and Areas of Uncertainty.

Strength Preferred to Weakness

When all else is equal, a stronger memory model is clearly better, but this raises the question of what is meant by “stronger”. For our purposes, one memory model is considered to be stronger than another if it rules out a larger set of behaviors. Thus, the weakest possible memory model is one that would allow a program to behave in any way at all (as exemplified by the “undefined behavior” so common in programming-language standards), whereas the strongest possible memory model is one that says no program can ever do anything. Of course, neither of these extremes is appropriate for the Linux kernel, or for much of anything else.

The strongest memory model typically considered is sequential consistency (SC), and the weakest is release consistency process consistency (RC_pc). SC prohibits any and all reordering, so that all processes agree on some global order of all processes' accesses, which is theoretically appealing but expensive, so much so that no mainstream microprocessor provides SC by default. In contrast, RC_pc is fairly close to the memory models we propose for the Linux kernel, courtesy of the Alpha, ARM, Itanium, MIPS, and PowerPC hardware that the Linux kernel supports.

On the other hand, we don't want to go overboard. Although strength is preferred over weakness as a general rule, small increases in strength are not worth order-of-magnitude increases in complexity.

Simplicity Preferred to Complexity

Simpler is clearly better; however, simplicity will always be a subjective notion. A formal-methods expert might prefer a model with a more elegant definition, while a kernel hacker might prefer the model that best matched his or her intuition. Nevertheless, simplicity remains a useful decision criterion. For example, assuming all else is equal, a model with a simpler definition that better matched the typical kernel hacker's intuition would clearly be preferred over a complex counterintuitive model.

Support Existing Non-Buggy Linux-Kernel Code

The memory model must support existing non-buggy code in the Linux kernel. However, our model (in its current form) is rather limited in scope. Because it is not intended to be a replacement for either hardware emulators or production compilers, it does not support:

Any number of compiler optimizations. For example, our model currently does not account for compiler optimizations that hoist identical stores from both branches of an “if” statement to precede that “if” statement. (On the other hand, the model also does not cover normal variable access, instead requiring at least READ_ONCE() or WRITE_ONCE(), each of which greatly limits the compiler's ability to optimize. This restriction is therefore less of a problem than it might at first appear.)
Arithmetic. Not even integer arithmetic!
Multiple access sizes.
Partially overlapping accesses.
Nontrivial data, including arrays and structures. However, trivial linked lists are possible.
Dynamic memory allocation.
Complete modeling of read-modify-write atomic operations. Currently, only atomic exchange is supported.
Locking, though some subset of the Linux kernel's numerous locking primitives is likely be added to a future version. In the meantime, locking may be emulated using atomic exchange.
Exceptions and interrupts.
I/O, including DMA.
Self-modifying code, as found in the kernel's alternative mechanism, function tracer, Berkeley Packet Filter JIT compiler, and module loader.
Complete modeling of read-copy update (RCU). For example, we currently exclude asynchronous grace-period primitives such as call_rcu() and rcu_barrier(). However, we believe that this work includes the first comprehensive formal model of the interaction between RCU reader and synchronous grace periods with memory accesses and memory-ordering primitives.

Quick Quiz 1: But my code contains simple unadorned accesses to shared variables! So what possible use is this memory model to me?
Answer

As always, adding more detail and functionality to the model slows it down, so the goal is therefore to balance the needs for speed and for functionality. The current model is a starting point, and we hope to incorporate additional functionality over time. We also hope that others will incorporate this memory model into their tools.

Be Compatible with Hardware Supported by the Linux Kernel

The memory model must be compatible with the hardware that the Linux kernel runs on. Although the memory model can be (and is) looser than any given instance of hardware, it absolutely must not be more strict. In other words, the memory model must in some sense provide the least common denominator of the guarantees of all memory models of all CPU families that run the Linux kernel. This requirement is ameliorated, to some extent, by the ability of the compiler and the Linux kernel to mask hardware weaknesses. For example:

The Alpha port of the Linux kernel provides memory-barrier instructions as needed to compensate for the fact that Alpha does not respect read-to-read address dependencies.
The Itanium port of gcc emits ld.acq for volatile loads and st.rel for volatile stores, which compensates for the fact that Itanium does not guarantee read-to-read ordering for normal loads from the same variable.

Nevertheless, the memory model must be sufficiently weak that it does not rule out behaviors exhibited by any of the CPU architectures the Linux kernel has been ported to. Different CPU families can have quite divergent properties, so that each of Alpha, ARM, Itanium, MIPS, and PowerPC required special attention at some point or another. In addition, hardware memory models are subject to change over time, as are the use cases within the Linux kernel. The Linux-kernel memory model must therefore evolve over time to accommodate these changes, which means that the version presented in this paper should be considered to be an initial draft rather than as being set in stone. It seems likely that this memory model will have the same rate of change as does Documentation/memory-barriers.txt.

Providing compatibility with all the SMP systems supporting Linux is one of the biggest memory-model challenges, especially given that some systems' memory models have not yet been fully defined and documented. In each case, we have had to take our best guess based on:

Existing documentation.
Consultation with those hardware architects willing to consult.
Formal memory models, for those systems having them.
Experiments on real hardware, for those systems we have access to. In at least one case, this might someday involve a computer museum.

Thankfully, this situation has been improving. For example, although formal memory models have been available for quite some time (such as here [PDF]), tools that apply memory models to litmus tests have only appeared much more recently. We most certainly hope that this trend towards more accessible and better-defined memory models continues, but in the meantime we will continue to work with whatever is available.

Support Future Hardware, Within Reason

The memory model should support future hardware, within reason. Linux-kernel ports to new hardware must supply their own code for the various memory barriers, and might one day also need to supply their own code for similar common-code primitives. But since common code is valuable, an architecture wishing to supply its own code for (say) READ_ONCE() will need a very good reason for doing so.

This proposal assumes that future hardware will not deviate too far from current practice. For example, if you are porting Linux to a quantum supercomputer, the memory model is likely to be the least of your worries.

Be Compatible with the C11 Memory Model, Where Prudent and Reasonable

Where prudent and reasonable, the model should be compatible with the existing C and C++ memory models. However, there are a couple areas where it is necessary to depart from these memory models:

The smp_mb() full memory barrier is stronger than that of C and C++. But let's face it, smp_mb() was there first, and there is a lot of code in the kernel that might be adapted to smp_mb()'s current semantics.
The Linux kernel's value-returning read-modify-write atomics feature ordering properties that are not found in their C/C++ counterparts.
The smp_mb__before_atomic(), smp_mb__after_atomic(), and smp_mb__after_unlock_lock() barrier-amplification APIs have no counterparts in the C/C++ API.
The smp_read_barrier_depends() macro does not have a direct equivalent in the C/C++ memory model.
The Linux-kernel notion of control dependency does not exist in C/C++. However, control dependencies are an important example of instruction ordering, so the memory model must account for them.
The Linux-kernel notion of RCU grace periods does not exist in C/C++. (However, the RCU-related proposals P0190R3, P0461R1, P0462R1, P0561R0, and P0566R0 are being considered by the committee.)

On the positive side, the Linux kernel has recently been adding functionality that is closer to that of C and C++ atomics, with the ongoing move from ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() being one example and the addition of smp_load_acquire() and smp_store_release() being another.

Expose Questions and Areas of Uncertainty

Defining a memory model inevitably uncovers interesting questions and areas of uncertainty. For example:

The Linux-kernel memory model is more strict than that of C11. It is useful to flag the differences in order to alert people who might otherwise be tempted to rely solely on C11. It is also quite possible that some of the Linux-kernel strictness is strictly historical, in which case it might (or might not!) be worth considering matching C11 semantics for those specific situations.
Release-acquire chains are required to provide ordering to those tasks participating in the chain. Failure to provide such ordering would have many problematic consequences, not least being that locking would not work correctly. For tasks external to the chain, ordering cannot be provided for a write preceding the first release and a read following the last acquire due to hardware limitations. For example, if one process writes to variable x while holding a lock and a later critical section for that same lock reads from variable y, the read of y might execute before the write of x has propagated to an unrelated process not holding the lock.
It turns out that release-acquire chains can be implemented using READ_ONCE() instead of smp_load_acquire(). (However, substituting WRITE_ONCE() for smp_store_release() does not work on all architectures.) Should the model require the use of smp_load_acquire()?
Some architectures can “erase” writes, so that ordering specific to a given write might not apply to a later write to that same variable by that same task, even though coherence ordering would normally order the two writes. This can give rise to bizarre results, such as the possible outcomes of a code sequence depending on the code that follows it. (However, such results appear to be restricted to litmus tests that can best be described as “rather strange”.)
One interesting corner case of hardware memory models is that weak barriers (i.e., smp_wmb()) suffice to provide transitive orderings when all accesses are writes. However, we were unable to come up with reasonable use cases, and furthermore, the things that looked most reasonable proved to be attractive nuisances. Should the memory model nevertheless provide ordering in this case? (If you know of some reason why this ordering should be respected by the memory models, please don't keep it a secret!)

In a perfect world, we would resolve each and every area of uncertainty, then produce a single model reflecting full knowledge of all the hardware that the Linux kernel supports. However, astute readers might have noticed that the world is imperfect. Furthermore, rock-solid certainties can suddenly be cast into doubt, either with the addition of an important new architecture or with the uncovering of a misunderstanding or an error in documentation of some existing architecture. It will therefore be sometimes necessary for the Linux kernel memory model to say “maybe”.

Unfortunately, existing software tools are unable to say “maybe” in response to a litmus test. We therefore constructed not one but two formal models, one strong and the other less strong. These two models will disagree in “maybe” cases. Kernel hackers should feel comfortable relying on ordering only in cases where both models agree that ordering should be provided, and hardware architects should feel the need to provide strong ordering unless both models agree that strong ordering need not be provided. (Currently these models are still very much under development, so it is still unwise to trust either model all that much.)

Causality and Ordering

Causality is an important property of memory models, in part because causality looms large in most peoples' intuitive understanding of concurrent code. However, causality is a very generic term, lacking the precision required for a formal memory model. In this article we will therefore use the terms “causality” and “causal relationship” quite sparingly, instead defining precise terms that will be used directly within the memory model. But a brief discussion now will help illuminate the topic and will introduce some important relationships between causality, ordering, and memory models.

Causality is simply the principle that a cause happens before its effect, not after. It is therefore a statement about ordering of events in time. Let's start with the simplest and most direct example. If CPU A writes a value to a shared variable in memory, and CPU B reads that value back from the shared variable, then A's write must execute before B's read. This truly is an example of a cause-and-effect relation; the only way B can possibly know the value stored by A is to receive some sort of message sent directly or indirectly by A (for example, a cache-coherence protocol message). Messages take time to propagate from one CPU or cache to another, and they cannot be received before they have been sent. (In theory, B could guess the value of A's write, act on that guess, check the guess once the write message arrived, and if the guess was wrong, cancel any actions that were inconsistent with the actual value written. Nevertheless, B could not be entirely certain that its guess is correct until the message arrives—and our memory models assume that CPUs do not engage in this sort of guessing, at least not unless they completely hide its effects from the software they are running.) On the other hand, if B does not read the value stored by A but rather an earlier value, then there need not be any particular temporal relation between A's write and B's read. B's read could have executed either before or after A's write, as long as it executed before the write message reached B. In fact, on some architectures, the read could return the old value even if it executed a short time after the message's arrival! A fortiori, there would be no cause-and-effect relation.

Another example of ordering also involves the propagation of writes from one CPU to another. If CPU A writes to two shared variables, these writes need not propagate to CPU B in the same order as the writes were executed. In some architectures it is entirely possible for B to receive the messages conveying the new values in the opposite order. In fact, it is even possible for the writes to propagate to CPU B in one order and to CPU C in the other order. The only portable way for the programmer to enforce write propagation in the order given by the program is to use appropriate memory barriers or barrier-like constructs, such as smp_mb(), smp_store_release(), or C11 non-relaxed atomic operations.

A third example of ordering involves events occurring entirely within a single CPU. Modern CPUs can and do reorder instructions, executing them in an order different from the order they occur in the instruction stream. There are architectural limits to this sort of thing, of course. Perhaps the most pertinent for memory models is the general principle that a CPU cannot execute an instruction before it knows what that instruction is supposed to do. For example, consider the statement “x = y;”. To carry out this statement, a CPU must first load the value of y from memory and then store that value to x. It cannot execute the store before the load; if it tried then it would not know what value to store. This is an example of a data dependency. There are also address dependencies (for example, “a[n] = 3;” where the value of n must be loaded before the CPU can know where to store the value 3). Finally, there are control dependencies (for example, “if (i == 0) y = 5;” where the value of i must be loaded before the CPU can know whether to store anything into y). In the general case where no dependency is present, however, the only portable way for the programmer to force instructions to be executed in the order given by the program is to use appropriate memory barriers or barrier-like constructs.

Finally, at a higher level of abstraction, source code statements can be reordered or even eliminated entirely by an optimizing compiler. We won't discuss this very much here; memory-barriers.txt contains a number of examples demonstrating the sort of shenanigans a compiler can get up to when translating a program from source code to object code.

Acknowledgments

We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew, Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their patient explanations of their respective systems' memory models. We are indebted to Peter Sewell, Susmit Sarkar, and their groups for their seminal work formalizing many of these same memory models. We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for their help making this human-readable. We are also grateful to Michelle Rankin and Jim Wasko for their support of this effort.

This work represents the views of the authors and does not necessarily represent the views of University College London, INRIA Paris, Scuola Superiore Sant'Anna, Harvard University, or IBM Corporation.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: But my code contains simple unadorned accesses to shared variables! So what possible use is this memory model to me?

Answer: You are of course free to use simple unadorned accesses to shared variables in your code, but you are then required to make sure that the compiler isn't going to trip you up—as has always been the case. Once you have made sure that the compiler won't trip you up, simply translate those accesses to use READ_ONCE() and WRITE_ONCE() when using the model. Of course, if your code gives the compiler the freedom to rearrange your memory accesses, you may need multiple translations, one for each possible rearrangement.

Back to Quick Quiz 1.