Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing.  What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency.  There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration.  However, the naive interleaving provides O(2^n) states to test.  [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster.  This approach can often require annotations, so adoption requires automated instrumentation.  This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points.  Finer-grained finds more bugs, but increases the states to test.  Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses.  Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points.  Each subset also represents smaller parts of the larger possible problem space.  If these subsets are all satisfied, then represents a full verification of the program.  Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago.  Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable.  That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework.  For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias.  Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking.  Aborted transactions are observationally equivalent to an immediately failed transaction.  Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts.  As prior posts covered, this fact requires that any transaction have a valid abort path.  And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock.  Each set of tests has a concurrency or performance bug in the implementations.  What about demonstrating that there are no bugs in implementation?  With 10 hours of CPU time, verification is only possible for small cases on complex code.  That said, practical testing so far only requires <4 preemptions to create the buggy scenario.  There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc.  Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.


No comments: