Elegant C: testing

Showing posts with label testing. Show all posts

Saturday, March 2, 2019

Conference Attendance - SIGCSE 2019 - Day 2.5

Continuing at SIGCSE, here are several more paper talks that I attended on Friday. Most of the value at SIGCSE comes from the friendly conversations with other attendees. From 5-11p, I was in the hotel lobby talking with faculty and students. Discussing research ideas, telling our stories from teaching, and generally renewing friendships within the field.

On the Effect of Question Ordering on Performance and Confidence in Computer Science Examinations
On the exams, students were offered a bonus if they could predict their score by within 10%. Does the order of questions (easy -> hard, or hard -> easy) have any impact on their estimated or actual performance on an exam. Students overpredicted by over 10% on the exams. As a whole, the hard to easy students did worse, but this result was not statistically significant. A small improvement is gained for women when the exams start with the hardest problem.

I wonder about whether students were biased in their prediction based on the reward. Ultimately, the authors gave the reward to all students regardless of the quality of their prediction.

The Relationship between Prerequisite Proficiency and Student Performance in an Upper-Division Computing Course

We have prerequisites to ensure that students are prepared for the later course, an upper-level data structures class. Students started on average with 57% of expected prerequisite knowledge, and will finish the course with an improvement of 8% on this knowledge. There is a correlation between prerequisite score and their final score. With several prerequisites, some knowledge concepts has greater correlation than others. Assembly is a surprising example of a concept that relates. Students benefit from intervention that addresses these deficiencies early in the term.

Afterward, we discussed that this work did not explore what prerequisite knowledge weakly correlated with student learning. How might we better understand what prerequisites actually support the learning in a course? Furthermore, can we better understand the general background of students in the course, such as class standing or general experience?

Visualizing Classic Synchronization Problems

For three classic synchronization problems: dining philosophers, bounded producer-consumer, and readers and writers. Each one has a window displaying the operations, as well as multiple algorithmic strategies. With these visualizations, do students learn better and also find them more engaging than reading about the problems in the textbook. While not statistically significant, the control group exhibited better recall, although the visualization group had higher engagement. That said, the control group exhibited higher course grades, so the difference in learning may actually be from unrelated factors.

Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.

Wednesday, May 16, 2018

Review: Lessons from Building Static Analysis Tools at Google

The Communications of the ACM recently had several development articles, and I found the one on static analysis tools at Google particularly interesting. The article works through how Google went about integrating static analysis tools into every developer's workflow. And the tools have to be in the workflow, or developers will "forget" to use them. The second problem with the tools is ensuring that the feedback is useful. Currently, each dev will mark the items as either useful or incorrect. If a tool exceeds a 10% false-positive rate, it is temporarily disabled until that tool's developers can fix the flagged issues. The third issue with the tools is that some are expensive. Depending on the type of static analysis, the time required may be significant. Thus the tools are classified into two camps: on each compile, or on each code review / commit. It is also important that some tools can be temporarily disabled, such that during debugging or refactoring the code may temporarily mutate into an "unsafe" state to simplify the process.

Personally, I am glad that they are integrating analysis tools into the development workflow. Much work has been done to find bugs and issues within source code, so it is good that these analyses can be utilized regularly to improve code quality.

(As a note, I do not nor never have worked for Google, so I can only write based on the ACM article and not personal experience.)