Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.
To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.
Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.
This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.
As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)
Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.
This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.
When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.
Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.
Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.
Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.
In attendance are the recent instructors of Operating Systems and the TAs.
A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Wednesday, October 17, 2018
Wednesday, October 10, 2018
NASA Talk: Making Miracles Happen
Dr. Thomas H. Zurbuchen the Associate Administrator of NASA's Science Mission Directorate gave a talk today about some key lessons he has learned and observed from his years with NASA, as well as years before. The following is my quick notes from his talk.
Planning for the future, Voyager launched with the plan to visit Jupiter and Saturn. But the scientists still had plans for the possibility to visit Uranus and Neptune based on that launch window. And also, that in the lifetime of the probes, they may leave the Heliosphere. Imagine that over 40 years after launch, there are still papers to be written in Nature or Science from the mission. Voyager 2 is nearing this mark, which Voyager 1 already crossed.
Solar probe problem was to explore near to the Sun. The desire was there from 60 years ago. And eventually from decades of other work, the technology was there to support components such as solar panels that can operate close to Sun. Once the technology was available, the proposal was based on a $4 billion probe, which was outside the budget. But NASA offered to design / launch for $1 billion. This forced a complete redesign. The original mission used a single Jupiter flyby to bring the probe to 4 solar radii. For $1 billion, the flight path would instead use Venus to do repeated flybys and eventually lower the perihelion to 9 solar radii. While 9 > 4, the Venus flight path provides many flybys of the Sun, which provides a richer dataset from each flyby. This probe is the Parker Solar Probe, also exceptionally named for the still living scientist Eugene Parker.
Cassini reinforced the need for teams to have diverse backgrounds and training. That greater success is possible from having the great teams.
Hubble provided the research, just last year alone, for ~1000 publications. On the initial launch, there was the well known flaw in the mirror, when NASA had been predicting great images. After studying the results, a tech worked out that the mirror was slightly out of position, and by a small shift of the sensor, everything worked.
Shortly after finishing his PhD, Dr. Z came up with a new sensor in '98 to be possibly part of Messenger, at its launch in 2004, which was multiple times added and removed from the craft. And even after the launch, it took Messenger 7 years to complete the orbital path and remove sufficient energy so that the probe could enter orbit of Mercury. This requires patience.
When working in teams, he tells about being the Swiss army. Barbed wire had to be placed around the camp. This was the job given to those in trouble. But he talked with them on this duty and helped. So eventually, rather than just being the trouble job, the team was a good team, and some soldiers wanted to work on those teams. Better to be on a good team doing a bad task, than a bad team doing a good task.
The National Academy of Science sets the science priorities (this process from priority to mission I read significantly about in Chasing New Horizons), but ultimately the specific mission to meet the science priority is decided by Dr. Z. Then that mission moves through multiple steps and reviews. And one of the key steps is that while Dr. Z makes the decision, these decisions are based on the good advice from the experts.
For the missions flown, Dr. Z has about 5 failures for every success. These failures are things like being removed from the program or the mission being canceled. And sometimes it is for technical failures of the device or the probe. Things will fail. At every launch, he goes there to watch and has a large mission plan. Most of that mission plan covers what to do if things go wrong. Plan for the failure.
Planning for the future, Voyager launched with the plan to visit Jupiter and Saturn. But the scientists still had plans for the possibility to visit Uranus and Neptune based on that launch window. And also, that in the lifetime of the probes, they may leave the Heliosphere. Imagine that over 40 years after launch, there are still papers to be written in Nature or Science from the mission. Voyager 2 is nearing this mark, which Voyager 1 already crossed.
Solar probe problem was to explore near to the Sun. The desire was there from 60 years ago. And eventually from decades of other work, the technology was there to support components such as solar panels that can operate close to Sun. Once the technology was available, the proposal was based on a $4 billion probe, which was outside the budget. But NASA offered to design / launch for $1 billion. This forced a complete redesign. The original mission used a single Jupiter flyby to bring the probe to 4 solar radii. For $1 billion, the flight path would instead use Venus to do repeated flybys and eventually lower the perihelion to 9 solar radii. While 9 > 4, the Venus flight path provides many flybys of the Sun, which provides a richer dataset from each flyby. This probe is the Parker Solar Probe, also exceptionally named for the still living scientist Eugene Parker.
Cassini reinforced the need for teams to have diverse backgrounds and training. That greater success is possible from having the great teams.
Hubble provided the research, just last year alone, for ~1000 publications. On the initial launch, there was the well known flaw in the mirror, when NASA had been predicting great images. After studying the results, a tech worked out that the mirror was slightly out of position, and by a small shift of the sensor, everything worked.
Shortly after finishing his PhD, Dr. Z came up with a new sensor in '98 to be possibly part of Messenger, at its launch in 2004, which was multiple times added and removed from the craft. And even after the launch, it took Messenger 7 years to complete the orbital path and remove sufficient energy so that the probe could enter orbit of Mercury. This requires patience.
When working in teams, he tells about being the Swiss army. Barbed wire had to be placed around the camp. This was the job given to those in trouble. But he talked with them on this duty and helped. So eventually, rather than just being the trouble job, the team was a good team, and some soldiers wanted to work on those teams. Better to be on a good team doing a bad task, than a bad team doing a good task.
The National Academy of Science sets the science priorities (this process from priority to mission I read significantly about in Chasing New Horizons), but ultimately the specific mission to meet the science priority is decided by Dr. Z. Then that mission moves through multiple steps and reviews. And one of the key steps is that while Dr. Z makes the decision, these decisions are based on the good advice from the experts.
For the missions flown, Dr. Z has about 5 failures for every success. These failures are things like being removed from the program or the mission being canceled. And sometimes it is for technical failures of the device or the probe. Things will fail. At every launch, he goes there to watch and has a large mission plan. Most of that mission plan covers what to do if things go wrong. Plan for the failure.
Friday, September 14, 2018
Is C low level?
A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.
The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.
I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?
All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.
The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.
I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?
All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.
Wednesday, August 29, 2018
Thesis Defense: Responsive Parallel Computation
We can consider parallelism strategies as either competitive (such as pthreads) or cooperative (such as Cilk). Cooperative systems assume that the parallel computation is being equally distributed across the threads, such that other tasks are not prioritized.
Into this, Stefan Muller presented a Responsive Parallel Computation approach where the two models can be unified. Or put another way, "We can extend existing cooperative language and cost models to account for competitive threading constructs, and to design and implement efficient and responsive scheduling algorithms."
Cooperative systems, such as Cilk, rely on task queues and work stealing to implement the parallelism, but when IO is involved, the common implementations use blocking system calls. In the cooperative system, this blocking call then blocks one of the worker threads, rather than just the working task.
For operating system scheduling, there are many threads within the myriad of processes in the system that may need to run. The OS often uses a priority associated with any thread / process to determine these scheduling decisions. When programmers assign priorities (if they do at all), they often are trying to express a relative ordering between different tasks rather than absolute priorities. As some systems limit priorities to a handful to as many as 100, the mapping of priority classes to numeric levels can be unclear.
Propose an extension to ML, PriML, with two new types:
t cmd[p] - command p returning t
t thread[p] - thread at priority p returning t
priority and order declarations enable defining the priority and the specifying the partial order between those priorities. spawn[priority] {expression} to create the thread. sync to retrieve the result. Or poll to return an option that can be matched, so that we can select from multiple threads. For example, the following PriML sequence spawns two threads to sort emails with different methods and selects the faster one.
t1 <- spawn[p] {do (qsort date emails)};
t2 <- spawn[p] {do (mergesort date emails)};
do (let fun choose () =
cmd[p]
{
v1 <- poll t1;
v2 <- poll t2;
do (case (v1, v2) of
(SOME l, _) =>
cmd[p] {cancel t2; ret l}
| (_, SOME l) =>
cmd[p] {cancel t1; ret l}
| (NONE, NONE) => choose ())
}
in
choose ()
end)
These features then allow the language to enforce the priority constraints, and avoid any priority inversion, as sync must always be to an equal or higher priority, and poll can be to any.
Returning to parallel program analysis, we represent them with DAGs; however, the DAG must be extended with priority annotations and the work / span calculations. This analysis can explore different schedules, such that adding fairness for all thread priorities would extend the bound on time.
Into this, Stefan Muller presented a Responsive Parallel Computation approach where the two models can be unified. Or put another way, "We can extend existing cooperative language and cost models to account for competitive threading constructs, and to design and implement efficient and responsive scheduling algorithms."
Cooperative systems, such as Cilk, rely on task queues and work stealing to implement the parallelism, but when IO is involved, the common implementations use blocking system calls. In the cooperative system, this blocking call then blocks one of the worker threads, rather than just the working task.
For operating system scheduling, there are many threads within the myriad of processes in the system that may need to run. The OS often uses a priority associated with any thread / process to determine these scheduling decisions. When programmers assign priorities (if they do at all), they often are trying to express a relative ordering between different tasks rather than absolute priorities. As some systems limit priorities to a handful to as many as 100, the mapping of priority classes to numeric levels can be unclear.
Propose an extension to ML, PriML, with two new types:
t cmd[p] - command p returning t
t thread[p] - thread at priority p returning t
priority and order declarations enable defining the priority and the specifying the partial order between those priorities. spawn[priority] {expression} to create the thread. sync to retrieve the result. Or poll to return an option that can be matched, so that we can select from multiple threads. For example, the following PriML sequence spawns two threads to sort emails with different methods and selects the faster one.
t1 <- spawn[p] {do (qsort date emails)};
t2 <- spawn[p] {do (mergesort date emails)};
do (let fun choose () =
cmd[p]
{
v1 <- poll t1;
v2 <- poll t2;
do (case (v1, v2) of
(SOME l, _) =>
cmd[p] {cancel t2; ret l}
| (_, SOME l) =>
cmd[p] {cancel t1; ret l}
| (NONE, NONE) => choose ())
}
in
choose ()
end)
These features then allow the language to enforce the priority constraints, and avoid any priority inversion, as sync must always be to an equal or higher priority, and poll can be to any.
Returning to parallel program analysis, we represent them with DAGs; however, the DAG must be extended with priority annotations and the work / span calculations. This analysis can explore different schedules, such that adding fairness for all thread priorities would extend the bound on time.
Monday, August 27, 2018
Book Review: Valley of Genius
Valley of Genius is a history of Silicon Valley based on the people who made it. Since the work was based on interviews, I expected that it would read as actual interviews, where the dialog exists between the author and the "genius". Instead, the author was removed from the chapters and instead the entire text consisted of the different participants being quoted. This writing style took sometime to accept. Initially, I wanted to know exactly who each person was and their role in the narration, which given the numbers involved would significantly detract from the story being told. Then I stopped bothering with the names, only looking for them when two (or more) speakers refer to each other. And this aspect is the best of the book, to have the different individuals having a debate in the dialog, otherwise each chapter is just narration. That said, the concern is that there is lost context to the quotes, as the author has explicitly stated the interviews had been spliced together.
All in all I enjoyed the book, but that only merits 3.5/5 stars.
(A free copy of the book was provided through Goodreads.)
All in all I enjoyed the book, but that only merits 3.5/5 stars.
(A free copy of the book was provided through Goodreads.)
Friday, August 17, 2018
Repost: CRA Memo on Best Practices for Engaging Teaching Faculty in Research Computing Departments
Mark Guzdial noted today that the CRA has prepared its memo about teaching faculty in research departments. For the past two years, I have been going to a CRA event at SIGCSE geared toward preparing this memo, so good to see its release. I am thankful that at my institution, we have most of the things outlined in the memo and are treated roughly as equals to the tenure-track faculty.
Tuesday, July 31, 2018
Book Review: The Art of Application Performance Testing
The Art of Application Performance Testing, covers what it says. The book starts with concepts general to any performance testing, which was interesting to me. Most of the text focuses though on the Application part of the title. The applications here are primarily web-based, or other client-server based setups, and not just the generic "application" referring to any program. That said, I do not work on such applications, so the remainder of the text was of less value to me.
In testing applications, a performance analyst needs to establish a representative workload, which includes the actions to perform, and the combined load. For example, most users logging in to their bank will view their account balance, while others might transfer money or pay a bill. Combined these actions might represent most of the work from users. Then for each unit of server, how many users should be able to perform a mix of those actions, which forms the load.
After establishing the workload, the analyst needs to implement the described workload, which requires a tool that generates the load (either by driving the application itself or replaying a synthetic trace of the load). For those tools, what additional hardware is required to deploy this load? Does the deployment take into account geographic and other user variations (so that the load generation is representative of the user base)? Finally, what tooling and methodology exists for profiling and recording the execution of the workload for present and future analysis?
So I appreciated the content of the book and would recommend it to individuals focusing on testing of user-facing applications.
In testing applications, a performance analyst needs to establish a representative workload, which includes the actions to perform, and the combined load. For example, most users logging in to their bank will view their account balance, while others might transfer money or pay a bill. Combined these actions might represent most of the work from users. Then for each unit of server, how many users should be able to perform a mix of those actions, which forms the load.
After establishing the workload, the analyst needs to implement the described workload, which requires a tool that generates the load (either by driving the application itself or replaying a synthetic trace of the load). For those tools, what additional hardware is required to deploy this load? Does the deployment take into account geographic and other user variations (so that the load generation is representative of the user base)? Finally, what tooling and methodology exists for profiling and recording the execution of the workload for present and future analysis?
So I appreciated the content of the book and would recommend it to individuals focusing on testing of user-facing applications.
Subscribe to:
Posts (Atom)