Emulating RCU Read-Side Critical Sections with Transactional Memory

In theory, transactional memory (TM) can emulate RCU read-side critical sections quite straightforwardly:

Change all occurrences of rcu_read_lock() into your favorite start-of-transaction primitive.
Change all occurrences of rcu_read_unlock() into your favorite end-of-transaction primitive.
Replace all rcu_dereference() primitives with their arguments.
Rewrite all RCU updates as transactions.

All else being equal, the fewer synchronization mechanisms the better, so if TM can subsume RCU, this would be a good thing. But how well does this theoretical transformation work in practice?

In practice, TM emulation of RCU suffers from some severe shortcomings, including the following:

Hardware transactional memory (HTM) requires special hardware
HTM transaction-size limitations
TM's atomicity requirements
TM's ordering requirements
TM non-deterministic overhead
TM limits non-idempotent operations
TM does not support parallelism within transactions

These shortcomings are described in the following sections, followed by a section describing how TM might overcome some of them.

HTM Requires Special Hardware

Recent years have brought us a few commercial implementations of HTM, for example, Sun's Rock hardware TM implementation and Azul Systems's HTM implementation.

However, these implementations are not commonly available, and, as will be seen in the next section, come with severe constraints. In contrast, RCU requires no special hardware features, and thus operates perfectly well not only on current commodity hardware, but also on decades-old commodity hardware.

HTM's Transaction-Size Limitations

In Sun's Rock hardware TM implementation, transactions must fit within the system's store buffer and must not encounter any of a number of hardware events, such as TLB faults and function returns. Azul's implementation tolerates a greater number of hardware events, but still limits the size of a transaction to that of the CPU's L1 cache.

Some research work puts forward the notion of unbounded transactional memory (UTM), while other work describes a log-based implementation which is expected to support extremely large transactions, but which incurs expensive writes to and reads from a DRAM-based log.

In contrast, RCU read-side critical-section size is not subject to arbitrary limitations, even on pre-existing commodity hardware. Of course, software transactional memory (STM) can also handle arbitrarily large transactions, but STM suffers from some other limitations discussed below.

TM's Atomicity Requirements

Transactions by definition must execute as if they were atomic. This means that although pairs of disjoint transactions can and do execute concurrently, the execution of any pair of conflicting transactions (where both transactions update the same variable or where one transaction reads a variable that the other writes) must be at least partially serialized.

Due to things like aliasing, compile-time detection of conflicts is not always possible. For example, a pair of transactions might both write through a pair of pointers, so that Transaction A writes to *pa1 and *pa2, while Transaction writes to *pb1 and *pb2, both in that order. If the values of pa1 and pb2 are equal, and if the values of pa2 and pb1 are equal, then the transactions conflict. In general, the compiler cannot be expected to be able to determine the values of the pointers, so run-time checks will be required in general. But if both transactions have already written through their first pointer at the time that the conflict is detected, then it will not be possible to serialize the transactions without aborting one or the other of them.

These aborts (or rollbacks) cause delays, raising serious questions about the suitability of TM for real-time applications. In contrast, RCU reader and updater execution can proceed without aborts or rollbacks, even in cases where the equivalent transactions would conflict. Any attempt to subsume RCU into TM in real-time situations must therefore address TM's issues in this area, perhaps by using conservative strategies to determine which transactions may safely exploit RCU's weaker guarantees.

TM's Ordering Requirements

All else being equal, larger transactions have higher conflict probabilities, which has the interesting consequence that support for large transactions is a two-edged sword. However, small transactions are not problem-free, courtesy of the fact that any transaction that observes one effect of some other transaction must observe all effects of that other transaction. This property requires that the start and end of a transaction must have memory-ordering semantics, which will incur significant overhead, even for read-only transactions.

In contrast, there are a number of RCU implementations that do not incur memory-ordering overhead in the rcu_read_lock() and rcu_read_unlock() primitives marking the beginning and end of an RCU read-side critical section. In fact, there are a number of RCU implementations where these primitives incur no overhead whatsoever. Any production-quality attempt to subsume RCU into TM should therefore address TM's relatively high transaction overheads.

TM Non-Deterministic Overhead

Converting RCU read-side critical sections to transactions replaces RCU's deterministic properties with TM's non-deterministic failure/rollback/retry semantics. These TM semantics raise the possibility of spurious HTM transaction failure, for example, due to cache geometry, interrupts, and even the single-step exceptions used by many debuggers. Because failure/rollback/retry probability generally increases with increasing transaction duration and memory footprint, the larger the RCU read-side critical section, the greater the probability of non-deterministic behavior for the corresponding transaction. In some TM implementations, these difficulties can include starvation and livelock.

In contrast, RCU updates do not interfere with RCU read-side critical sections and vice versa. Any attempt to subsume RCU into TM in the general case must therefore address TM's non-deterministic behavior.

TM Limits Non-Idempotent Operations

TM has well-known difficulties with non-idempotent operations such as I/O operations (especially things like RPCs), system calls (especially memory-mapping operations, exec(), and time delays), locking (especially reader-writer locking), and dynamic linking and loading. Although there are TM proposals such as “inevitable transactions” that accommodate non-idempotent operations, these proposals have severe performance and scalability shortcomings.

In contrast, the fact that RCU read-side critical sections do not suffer from aborts, rollbacks, and retries means that non-idempotent operations may be freely included in RCU read-side critical sections. Any attempt to subsume RCU read-side critical sections into TM in the general case must therefore surmount TM's non-idempotency challenges.

TM Does Not Support Parallelism Within Transactions

Although there has been some recent work on providing parallel transactions in certain special cases, TM currently does not support general-purpose parallelism within a single transaction.

In contrast, several RCU implementations support fully general parallel execution within an RCU read-side critical section. For example, in a user-mode RCU implementation, one could do the following:

rcu_read_lock();
pthread_create(...);
do_something();
pthread_join(...);
rcu_read_unlock();

The RCU read-side critical section delimited by the rcu_read_lock() and rcu_read_unlock() primitives protects not just the intervening code, but any thread spawned by and joined to the intervening code. The RCU guarantees also apply recursively, so that any thread spawned by a protected thread and joined to a thread also protected by this same RCU read-side critical section is itself protected. This is not specific to RCU, for example, locking also supports parallelism within critical sections.

If TM is to subsume all possible uses of RCU, then TM will need to provide equally general support of parallelism within a transaction.

TM: Towards True Support for RCU

Transactional memory clearly fails to support a number of RCU's properties. This leads to the question: "what must happen for TM to subsume RCU?"

Some claim that hardware penalties for supporting TM semantics have been decreasing with time and will continue to decrease, so that there will shortly come a time where TM can trivially support all of RCU's properties. It is not clear that this claim will hold up, but it is quite clear that even if it does hold, it addresses only one of the five shortcomings listed above, namely the performance impact of TM's ordering requirements. TM's other shortcomings would still need to be addressed.

Another approach is to recognize situations where TM's heavy-weight guarantees are unnecessary, perhaps permitting the compiler to generate RCU read-side critical sections in place of transactions in such situations. Although this might one day be possible, current TM work requires that the developer recognize and explicitly flag such situations. And indeed, automatic conversion of transactions into RCU appears to be at least as challenging as generating fine-grain-locking from transactions. It also appears that RCU patterns of usage are often quite far removed from transactional modes of thinking, particularly when the RCU read-side critical sections contains updates. It therefore seems that it would be unwise to hold one's breath waiting for such compiler support to appear.

Therefore, transformations from transactions to RCU will probably require hints from the developer, or, at the very least, conformance by the developer to transactional constructs that the compiler can easily recognize. Note that transaction-by-transaction analysis is insufficient: in order to determine whether a given read-only transaction may be converted to RCU, it is also necessary to analyze any other transactions reading or updating the variables read by the given transaction. In short, a profound level of understanding of the whole program is required to apply RCU optimizations to a given transaction. It will be a challenge to provide the compiler with this level of understanding while limiting the developers' burden so as to be consistent with any reasonable definition of “ease of use” on the other.

This situation might best be addressed by allowing the developer to manually specify the RCU read-side critical sections, permitting the compiler to take on the somewhat easier job of issuing diagnostics when the developer fails to adhere to good practice. This approach has the added benefit of permitting the developer to avail himself or herself of the RCU capabilities that TM currently lacks. Of course, this approach does requires solving the problem of interfacing RCU to TM.

Does TM Have Any Redeeming Qualities???

Lest the reader conclude that TM has no possibility of having any redeeming qualities whatsoever, it is important to note that RCU is a specialized mechanism, that it focuses on read-mostly situations, and that some other mechanism (perhaps TM) is required to coordinate RCU updates. Recent work optimizes read-only STM transactions through use of a global sequence lock, and in addition, there is the intriguing possibility that, given manual annotation of accesses within transactions, TM will provide good support for disjoint parallelism, providing many of the benefits of fine-grained locking, but at lower levels of complexity — at least assuming that the annotations are easier to carry out than is fine-grained locking. It also seems possible that HTM will be quite useful for small transactions that may be used to atomically update classic data structures.

So, yes, TM does have redeeming qualities, and might well turn out to be the best tool for some important jobs. However, recent TM work focuses primarily on TM performance, which as noted earlier is but one of the difficulties that TM must overcome in order to subsume RCU.

Concluding Remarks

In short, the statement “we will just subsume RCU into TM” might well make a stirring sound bite, but a very large amount of hard work needs to be done if it is to stand up to close inspection. Although “one synchronization mechanisms to replace them all” might well be a noble goal, it is a goal that TM cannot realistically aspire to any time soon.

Acknowledgments

Many thanks to Dmitriy V'jukov, Jonathan Walpole, and Phil Howard for their insightful comments and discussions.