Elegant C: transactional memory

Showing posts with label transactional memory. Show all posts

Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.

Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year. I still have not done much testing personally, but I did direct two students who explored and measured this support. As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks. Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report. Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code. Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive. This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates. Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach. However, their measurements revealed some counter-intuitive results. For example, an AVL tree is a self-balancing data structure. The self-balanced nature is a benefit in that fewer memory accesses should be required. Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation. We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache). As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort. Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions. In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads). In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations. It still remains the advice that simple atomic updates should utilize the appropriate instructions. What if you need to perform several of these operations? Again, we turn to the final report. The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction". Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Wednesday, May 22, 2013

More on Haswell

Since I am already excited about some of the features of the new Haswell family, it was great to see an even deeper look into microarchitecture. There are the usual improvements, like expanding the instruction window or increasing cache bandwidth. So read for yourselves at Ars Technica.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Friday, March 4, 2011

Parallel Programming - First Pass

With my time dominated by preparing to take PhD qualifying exams (i.e., quals), I have been even more slack than usual with regards to preparing regular posts. Nonetheless, let's talk a little on parallel programming. In one aspect, the parallel paradigm is the future of computer science, even if I remain highly skeptical about what the specifics of this computing will be. But just because its usage in general computing may be occluded, the specific usefulness of parallel computing is not in doubt. This post will serve as an overview of several concepts in parallel programming.

First to distinguish between concurrent and parallel execution. Concurrent execution has the possibility or potential for executing simultaneously. Parallel execution is when this potential is realized. Concurrent execution is possible with a single core; however, parallel execution is not.

Synchronization is the main question when writing concurrent code. Synchronization introduces a specific ordering to what was otherwise independent execution. There are two common flavors: exclusion and notification. Exclusion consists of mutexes, spinlocks, and other constructs that guarantee a single instance of concurrent execution performing a specific set of operations. With notification, concurrent executions establish information with respect to each other, for example every instance has reached a specific point (e.g., barrier).

An ongoing quest with synchronization research is transactional memory (TM). TM provides the ability to make a set of memory updates atomicly. Processors provide the ability to make simple updates atomic (see Compiler Intrinsics), yet a series of updates requires the explicit exclusion guarantee provided by spinlocks, etc. TM brings the exclusion to the memory address itself, rather than the abstract object protected by the spinlock, and allows an arbitrary set of accesses to be encapsulated in the atomic operation. However, TM is not presently feasible.

Parallel patterns are formed based on the observation that parallel programs and algorithms can be classified into several distinct groups (i.e., patterns). An assembly line is a parallel operation and fits the "pipelined" pattern. By the programmer recognizing the pattern, certain common errors can be avoided. With the pipeline, the programmer recognizes that the data is to be passed through discreet stages.

Well, that's my prelude to what will likely be many more posts on parallel programming.

Monday, December 6, 2010

Conference Time: MICRO-43 Morning Day 1

As of noon, I've spent most of my morning as a volunteer at MICRO; however, I did slip in to hear the keynote and AMD's talk about an ISA extension for supporting transactional memory. Transactional memory (TM) is a technology of which I am deeply skeptical. But given that AMD's talk was from the ISA perspective rather than the HW itself, I gave it a go.

Their approach was to try and find the minimal set of instructions required to provide some meaningful TM functionality for programs. SPECULATE, COMMIT, and LOCK MOV are the three instructions. Furthermore, in providing minimal support, programs are not burdened by unnecessary overhead. For example, SPECULATE begins a speculative region (i.e., transaction), but only checkpoints the instruction and stack pointers.

Within this region, only the memory operations that are made with "LOCK MOV" are given transactional semantics. All other operations are non-speculative and therefore have normal semantics. This enables a speculative region to only incur the overhead from required operations and not from incidental work. And the HW support for transactions is reduced as fewer operations need to be tracked to provide sufficient support for applications.

COMMIT closes the speculative region and clears the transaction bits from the pending memory operations. If there is a conflict during the region, then the HW rolls back to the old RIP which has a trampoline to determine the appropriate fail behavior, like retry, etc.

This approach had far greater appeal to me than previous works. I think the minimalist approach is favorable from a performance standpoint. This approach also provides greater flexibility to use the TM support for other applications.

For more details see: AMD's ASF or the MICRO 43 proceedings (not yet available)