Elegant C: TSX

Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year. I still have not done much testing personally, but I did direct two students who explored and measured this support. As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks. Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report. Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code. Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive. This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates. Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach. However, their measurements revealed some counter-intuitive results. For example, an AVL tree is a self-balancing data structure. The self-balanced nature is a benefit in that fewer memory accesses should be required. Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation. We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache). As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort. Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions. In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads). In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations. It still remains the advice that simple atomic updates should utilize the appropriate instructions. What if you need to perform several of these operations? Again, we turn to the final report. The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction". Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.