Why HTM Transaction Size Limits?

Why are there size limits for hardware transactional memory (HTM)?

Here are a few possible reasons:

HTM implementations typically use caches and speculative execution to make each transaction appear atomic with respect to other transactions. This means that cache geometry limits the maximum size of the transaction: A transaction that cannot fit into cache cannot complete successfully. Note that cache associativity can be just as limiting as the cache's total size. Although a transaction whose size is only slightly larger than the cache's associativity has an extremely high probability of success, its success cannot be guaranteed, and the probability of success decreases with increasing transaction size. Furthermore, if a given CPU reacts to cache SRAM failure by decreasing associativity, then it is this decreased associativity that can limit transactions' cache footprint.
Because current HTM implementations cannot tolerate interrupts or exceptions within a transaction, there cannot be a page fault within a transaction, which further means that the translation lookaside buffer (TLB) limits transaction size. As with the CPU caches, the TLB's associativity is the limiting factor, not necessarily its overall size.
Because current HTM implementations cannot tolerate interrupts or exceptions within a transaction, the expected duration of a given transaction must be significantly shorter than the expected time between interrupts. If a given transaction is too long, it will always be interrupted, and therefore can never complete successfully.
Because current HTM implementations cannot tolerate interrupts or exceptions within a transaction, breakpoints and single-step exceptions will abort the enclosing transaction (note that load-linked/store-conditional (LL/SC) sequences have similar problems). This means that normal debuggers do not work within transactions, so that debugging considerations strongly favor small transactions.
Current HTM implementations do not handle debugging printf() statements gracefully. On the one hand, unbuffered debugging print statements abort the transaction, while on the other hand, buffered debugging print statements are of no help if the transaction aborts. Furthermore, buffered I/O increases the size of the transaction, which increases probability of abort due to transaction size limitations.
Conflict probabilities increase with increasing transaction size. If transactions' conflict probabilities are too large, both performance and scalability will suffer.

HTM implementations based on unbounded transactional memory (UTM) might eventually offer significantly larger transaction size limits, though some form of associativity limitation would likely still be in force. Use of high-associativity victim caches could help alleviate associativity limits.

Debugging support might be provided via emulators, but the low performance of typical emulators is likely to be a significant problem for a number of workloads. Alternatively, although software transactional memory (STM) could be used while debugging, there are subtle differences between HTM and STM that could prove problematic in some cases. For but one example, consider a program that uses both locking and transactions running on a lock-based STM implementation. Testing on STM could result in false-positive deadlocks involving the locks used by the STM implementation. These deadlocks would not occur while running on the HTM.

Adaptive tickless kernels might help as well by reducing the frequency of scheduling-clock interrupts. However, this reduction requires that there be only one runnable user thread on a given CPU at any given time, which will not be the case for all workloads. Therefore, although adaptive tickless kernels would greatly increase the probability of HTM transaction success on some workloads (for example, high performance computing (HPC)), it will not be helpful on others. This should be no surprise: To the best of my knowledge, adaptive tickless kernels were not designed with HTM in mind.

Although we should expect continued HTM innovation, transaction sizes are likely to remain limited. However, the question is whether the limits will grow to the point beyond which no one will care, and if so, when. In the meantime, it will continue to be very important to combine HTM with other synchronization mechanisms that are less subject to size limitations. Failing to do so will result in HTM techniques that work extremely well on toy problems, but that are subject to embarrassing failures when applied to large real-world applications.