I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year. I still have not done much testing personally, but I did direct two students who explored and measured this support. As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks. Being a hardware implementation, it has certain fixed costs and constraints.
Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report. Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code. Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive. This can require putting in additional operations versus a pure, transactional approach.
Second, they measured the cost of concurrent data structure updates. Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach. However, their measurements revealed some counter-intuitive results. For example, an AVL tree is a self-balancing data structure. The self-balanced nature is a benefit in that fewer memory accesses should be required. Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.
To understand this, we must turn to the actual implementation. We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache). As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort. Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions. In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads). In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).
Third, using TSX has an inherent cost that is significantly higher than other atomic operations. It still remains the advice that simple atomic updates should utilize the appropriate instructions. What if you need to perform several of these operations? Again, we turn to the final report. The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction". Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.
It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.
No comments:
Post a Comment