Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing.  What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency.  There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration.  However, the naive interleaving provides O(2^n) states to test.  [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster.  This approach can often require annotations, so adoption requires automated instrumentation.  This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points.  Finer-grained finds more bugs, but increases the states to test.  Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses.  Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points.  Each subset also represents smaller parts of the larger possible problem space.  If these subsets are all satisfied, then represents a full verification of the program.  Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago.  Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable.  That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework.  For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias.  Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking.  Aborted transactions are observationally equivalent to an immediately failed transaction.  Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts.  As prior posts covered, this fact requires that any transaction have a valid abort path.  And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock.  Each set of tests has a concurrency or performance bug in the implementations.  What about demonstrating that there are no bugs in implementation?  With 10 hours of CPU time, verification is only possible for small cases on complex code.  That said, practical testing so far only requires <4 preemptions to create the buggy scenario.  There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc.  Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.


Wednesday, October 10, 2018

NASA Talk: Making Miracles Happen

Dr. Thomas H. Zurbuchen the Associate Administrator of NASA's Science Mission Directorate gave a talk today about some key lessons he has learned and observed from his years with NASA, as well as years before.  The following is my quick notes from his talk.

Planning for the future, Voyager launched with the plan to visit Jupiter and Saturn.  But the scientists still had plans for the possibility to visit Uranus and Neptune based on that launch window.  And also, that in the lifetime of the probes, they may leave the Heliosphere.  Imagine that over 40 years after launch, there are still papers to be written in Nature or Science from the mission.  Voyager 2 is nearing this mark, which Voyager 1 already crossed.

Solar probe problem was to explore near to the Sun.  The desire was there from 60 years ago.  And eventually from decades of other work, the technology was there to support components such as solar panels that can operate close to Sun.  Once the technology was available, the proposal was based on a $4 billion probe, which was outside the budget.  But NASA offered to design / launch for $1 billion.  This forced a complete redesign.  The original mission used a single Jupiter flyby to bring the probe to 4 solar radii.  For $1 billion, the flight path would instead use Venus to do repeated flybys and eventually lower the perihelion to 9 solar radii.  While 9 > 4, the Venus flight path provides many flybys of the Sun, which provides a richer dataset from each flyby.  This probe is the Parker Solar Probe, also exceptionally named for the still living scientist Eugene Parker.

Cassini reinforced the need for teams to have diverse backgrounds and training.  That greater success is possible from having the great teams.

Hubble provided the research, just last year alone, for ~1000 publications.  On the initial launch, there was the well known flaw in the mirror, when NASA had been predicting great images.  After studying the results, a tech worked out that the mirror was slightly out of position, and by a small shift of the sensor, everything worked.

Shortly after finishing his PhD, Dr. Z came up with a new sensor in '98 to be possibly part of Messenger, at its launch in 2004, which was multiple times added and removed from the craft.  And even after the launch, it took Messenger 7 years to complete the orbital path and remove sufficient energy so that the probe could enter orbit of Mercury.  This requires patience.

When working in teams, he tells about being the Swiss army.  Barbed wire had to be placed around the camp.  This was the job given to those in trouble.  But he talked with them on this duty and helped.  So eventually, rather than just being the trouble job, the team was a good team, and some soldiers wanted to work on those teams.  Better to be on a good team doing a bad task, than a bad team doing a good task.

The National Academy of Science sets the science priorities (this process from priority to mission I read significantly about in Chasing New Horizons), but ultimately the specific mission to meet the science priority is decided by Dr. Z.  Then that mission moves through multiple steps and reviews.  And one of the key steps is that while Dr. Z makes the decision, these decisions are based on the good advice from the experts.

For the missions flown, Dr. Z has about 5 failures for every success.  These failures are things like being removed from the program or the mission being canceled.  And sometimes it is for technical failures of the device or the probe.  Things will fail.  At every launch, he goes there to watch and has a large mission plan.  Most of that mission plan covers what to do if things go wrong.  Plan for the failure.