Wednesday, November 22, 2017

Review: Languages and Code Quality in GitHub

The study is several years old, but was recently reprinted in the Communications of the ACM.  In it, they mined GitHub data for active open source projects, collecting the defect and development rates.  They classified the defects according to their type, and the development language according to their feature.  And they found that language choice matters, marginally.  Some types of bug are far more common, such as memory management bugs in C / C++.  Functional languages have the lowest rate; however, this analysis is only based on the commit history and does not also analyze development time, or differences in programmers.  So the takeaway is that language features do matter, but programmers just write buggy code.

Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed.  The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs.  Perhaps this is achievable with ordering constraints.  Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
  data = msg;
  flag = 1;
}

int recv() {
  while (!flag) continue;
  return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data.  These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock?  Let's add two special labels: pre, post.  These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations.  Do the constraints need to be within a single iteration or constraining every iteration?  Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
  VEDGE_HERE(before, after);
  L(before, x = i);
  L(after, y = i + 10);
}

Code extends LLVM and is on GitHub.  The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly.  The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions).  Then in lowering, the lowering to assembly can better take advantage of the specific constraints required.  Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8.  I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models?  argues that C++11 would require acquire semantics on unlock.  Here it is stated that RMC is much more straightforward.  Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings.  Recall that the coarsest grained instruction is the full memory fence.

Thursday, July 13, 2017

PhD Defense - Automated Data-Driven Hint Generation for Learning Programming

Kelly Rivers defended her PhD work this afternoon.  She will returning to CMU this fall as a teaching professor.

Student enrollment is increasing, so more work is needed to automate the support, as TAs / instructors are not scaling.  Prior work (The Hint Factory) developed models based on prior student submissions, and then a current student's work can be found within the model thus providing suggestions for how to proceed.  However, programming may not fit within this model due to the larger and more varied space for which students can solve the problems.

First, student code proceeds through a series of canonicalization steps - AST, anonymized, simplification.  Such that the following python code is transformed:

import string
def any_lowercase(s):
  lst = [string.ascii_lowercase]
  for elem in s:
    if (elem in lst) == True:
      return True
    return False

Becomes

import string
def any_lowercase(p0):
  for v1 in p0:
    return (v1 in string.ascii_lowercase)

Studies then went over 41 different problems with hundreds of correct solutions and thousands of incorrect solutions.  The model can then generate the edits and chain these hints as necessary.  In more than 99.9% of cases, the model could successfully generate a hint chain to reach a correct solution.

To further test this model and approach, the model started with the empty space (just teacher solution) and was compared against the final model.  Ideally, the final model will propose fewer edits than the initial model.  And for 56% of problems, this was true.  40% of problems were already optimal.  And 3% are opportunities for improvement to the model.

Next, given this model exists, how do the hints impact student learning?  Select half of the students to give them access to the hint model optionally.  Using a pre / post assessment, the measurement was a wash.  Instead, a second study was designed that required the students to use the system within a two hour OLI module.  Hints would be provided with every submission and either before or after the midtest in the OLI module.  Only 1/2 of the students actually proceeded through the module in order.  However, most learning was just within the pretest->practice->midtest, so adding those students increased the population.  The results show that the hints reduce the time required to learn the equal amount.

From interviews with students, students need and want targeted help on their work.  However, the hints generated thus far were not always useful.  Proposed another study based on different styles of hints: location, next-step, structure, and solution.  This study found that participants with lower expertise wanted more detailed hints.  Hint usage would sometimes be for what is wrong versus how to solve it.  And often, students know what to do, and just need to reference (via example / prior work) how to do this, rather than hinting what to do.

Monday, July 10, 2017

Wrote my own Calling Convention

I have been listening to the Dungeon Hacks audiobook, and it has both reminded of my past joys of playing Angband, as well as some interesting little "hacks" I did in high school.

In high school, I wrote many programs on my TI-83, often to help me in other classes, which lead to an interesting conversation:
Student: "Teacher, Brian has written programs on his calculator for this class."
Teacher: calls me forward "Brian, is this true?"
Me: "Yes."
Teacher: "Did you write them yourself?"
Me: "Yes."
Teacher: "I do not see what the problem is."

And besides writing useful programs, I also wrote games.  All in the TI-83's BASIC.  However, the TI-83 only had 24kB of space for programs and data.  And eventually I started exceeding this limit.  So I started finding interesting hacks that would reduce the space of programs, such as the trailing " of a string is not required if that string ends the line.  Each would save 1 byte, but it started to add up.

Now, the TI-BASIC on the 83 was very primitive.  Particularly it had no GOSUB instruction, only GOTO.  So you cannot write functions / subroutines and reuse code, which would both improve the design and reduce the space required.  Now I already knew the basics (no pun intended) of C, so I knew that code should be able to call other functions and get values back.  But TI-BASIC would let you call another program and then return to the calling program after the called one finished.  That's like a function call, right?

Therefore, I wrote a library program.  Variables were global, so the calling program could set specific parameters in the global variables, call the library program.  The library would use one parameter to determine which functionality to execute, update the necessary globals and then exit, thus returning.  And consequently, I had a library of common functions which sped up my development and reduced the space I needed.

And so it was only in listening to the audio book last week, did I realize that long ago I had developed a simple calling convention in high school.  And now I teach, among other things, the x86-64 Linux ABI (i.e. calling convention) to college students.

A calling convention just dictates how each register is used when calling a function.  Which ones need to be preserved, arguments, return value, etc.  And it also dictates the management of the stack.

Wednesday, May 3, 2017

PhD Defense - Meeting Tail Latency SLOs in Shared Networked Storage

Today I went to Timothy Zhu's PhD thesis defense.  His work is on achieving better sharing of data center resources to improve performance, and particularly to reduce tail latency.  He also TA'd for me last fall.

Workloads are generally bursty, and their characteristics are different.  Furthermore, they may have service level objectives (SLOs), and the system needs to meet these different objectives.  And the system contains a variety of resources that must be shared in some form.  It is not sufficient to just divide the bandwidth.  Nor can the system measure the latency and try reacting, particularly as bursty workloads do not give sufficient time to react.  While each workload has deadlines, it would be too complex to tag request packets with the deadlines for queuing and routing.  However, the deadlines can be used to generate priorities for requests.

The system is architected to have storage and network enforcement components to ensure QoS.  There is also a controller that receives an initial trace to characterize each workload, and that workload's SLOs.  The controller works through a sequence of analyses to successfully place each workload into the overall system.

Effectively, each workload is assigned a "bucket" of tokens, where the bucket size provides the ability to handle bursts and the rate that tokens are added covers the request rate for the workload.  Shorter burstier workloads receive large buckets and low rates, while constant workloads with little bursts have high rates and small buckets.  In both cases, only when the bucket is empty, is the workload rate-limited in its requests, and these requests receive the lowest priority.  Deterministic Network Calculus (DNC) to model the worst-case queue scenarios.  This plots two curves: the requesting flow and the service curve, both plotted as tokens by function of window size (dt).  The maximum distance between the curves is the maximum latency.

Using three traces: DisplayAds, MSN, and LiveMaps, they tested three approaches: Cake (reactive approach), earliest deadline first, and Timothy's scheme (PriorityMeister).  His scheme did significantly better than the others at meeting the SLOs.  However, the DNC analysis was based on achieving 100% and not the SLO's 99% (or other percentile success).  Depending on the characteristics, there can be significant differences between these guarantees.  To model the latency percentiles, Stochastic Network Calculus (SNC) can achieve this; however, the math is significantly more complex.  And the math had not previously been applied to this problem.  DNC is still better when assuming that bursts are correlated or the system is in an adversarial setting.  Reducing these assumptions (uncorrelated workloads), the SNC-based analysis permitted the system to admit 3x workloads versus the DNC analysis.

Workloads have a curve of satisfying bucket sizes and token rate pairs.  Many systems require the user to provide its rate limit.  Other systems use simple heuristics to either find the "knee of the curve" or select a rate limit as a multiple of the average rate.  However, for an individual workload, all pairs are satisfying, it is only when workloads are combined in a system do the different pairs matter.  The configurations of the set of workloads on the system can be solved for using a system of linear equations.  Therefore, when placing new workloads, the controlling architecture can find successful placements, while potentially reconfiguring the workloads assigned.

One extension would be addressing failure modes.  Currently, the system is assumed to be at degraded performance when components have failed.

Monday, May 1, 2017

Book Review: Multicore and GPU Programming: An Integrated Approach

I wanted to like this book, Multicore and GPU Programming: An Integrated Approach, and be able to use it in the classroom.  I teach a class where we cover OpenMP, MPI, Cilk, and Cuda.  Yet, I would not use this book and I have told the publishers this as well, to which they acknowledged my reasoning.

First, the author elects to use QtThreads for the base parallel programming approach.  He notes that he had considered using pthreads or C++11 thread support, and comments that he rejected C++11 threads for the book as the support was incomplete at the time.  Pthreads is left without comment.  That may be, but I have heard of and used those options, while QtThreads is something that I have never considered.

Second, the book serves as an admirable API reference.  For one or a set of function calls, significant space is dedicated to how that call works and illustrating examples for it.  Structured Parallel Programming also covers many APIs, yet it maintains a feel of being about parallel programming in general rather that the calls specifically.  However, that work covers different API sets so the two are not explicitly comparable.

Third, and this issue is really more for the editors rather than the author, the typesetting on each page is poor.  There is significant white space left bordering the text on each page.  Furthermore, the code wraps, and often not for being long code, but for long comments and comments on the same line as the code.  I understand that in programming these are stylistic choices; however, the impact of finding line wraps from long comments leaves the text looking unprofessional.  I must assume that the author wrote the code separately and then provided for being included into the book, but the editor failed to make allowances for typesetting.

In conclusion, I wanted to like and use the book.  Whenever I speak with the publisher, they always direct me to it, I just have to hope for something else to come along.  Use it as a reference perhaps, but I am cautious in my recommendation.

(This book was provided free by the publisher to review for possible use in the classroom.)

Friday, April 21, 2017

Repost: What Makes a Program Elegant?

In a recent issue of the Communications of the ACM, there was a short article titled, What Makes a Program Elegant?  I found it an interesting discussion that has summarized well the characteristics in elegant programming: minimality, accomplishment, modesty, and revelation.  Revelation is one that I had not considered before, but I think it is most important of all.  There are some code sequences that I have written, which the elegance has rested most of all on its revelation.  Using and showing some aspect of computers and programming that I have never seen before, or revealing that there is a modest way to accomplish something new or differently.

Monday, March 13, 2017

Book Review: Optimized C++: Proven Techniques for Heightened Performance

I have spent significant time on performance issues and have been in search of a book that can summarize the diversity of issues and techniques well.  I hoped that Optimized C++: Proven Techniques for Heightened Performance would provide some of the guidance I want and
This book is not quite it.  There is good material here, yet I found repeatedly thinking that the author was not aware of the past 10(?) years of changes to the field.  Not an issue of the book was from the early 2000s, but it was published last year.

A key step in improving the performance of programs is measuring it.  There are a variety of techniques for doing so.  Tools based on instrumentation and tools based on sampling profiling.  I find greater value to using the sampling profiling tools (for measuring performance) due to their lower overhead and ability to pinpoint where in a function this cost exists.  Yet the book's focus is limited to gprof-esque approaches.  I tell students that this approach is best with deep call trees, which may be a greater issue for C++ programming specifically.

The author is somewhat dismissive to compiler optimizations and emphasizes that his observed benefit has been particularly limited to function inlining.  There are many more optimizations, and you should care about them.  But again, I wonder if his experience of C++ has been deep call trees that could particularly benefit from inlining.

In a take it or leave it, this work also discourages the use of dynamic libraries.  Yes, they impose a performance penalty, but they also provide valuable functionality.  It all depends on your use case for whether you should statically or dynamically link your code.  Code that is reused by separate executables should be in a dynamic library, as it reduces the memory requirements when running and reduces the effort to patch and update those executables.  Components that are only used by a single executable should be statically linked, unless the components are of significant size such that decoupling can still benefit memory usage and the updating process.

The author related that replacing printf with puts to just print a string has performance advantages, due to printf being a complicated "God function".  The basic point is valid that printf has significant functionality; however, the anecdote should be taken with a grain of salt.  Current compilers will do this optimization (replace printf with puts) automatically.

While most of the work provides small examples, the final chapters on concurrency (?) and memory management do not.  The concurrency chapter reads as a reference book, as it lists the various APIs available and what each does.  It would be better for the book to assume that the readers are familiar with these calls (as the author does with many other topics) and discuss possible optimizations within this scope.

To conclude, the book is not bad, but I also cannot say it is accurate on every point.  Especially with performance, programmers are apt to make prompt design decisions based on "their experience" or "recent publications".  Measure your code's performance.  Only then can you discern which techniques will provide value.

Saturday, March 11, 2017

Conference Time: SIGCSE 2017 - Day 2

I started my morning by attending my regular POGIL session.  I like the technique and using it in the classroom.  However, I should probably make the transition, attend the (all / multi-day) workshop, and perhaps get one of those "ask me about POGIL" pins.

Lunch was then kindly provided by the CRA for all teaching-track faculty in attendance.  There is the start of an effort to ultimately prepare a memo to departments for how to best support / utilize us (including me).  One thing for me is the recognition of how to evaluate the quality of teaching / learning.

Micro-Classes: A Structure for Improving Student Experience in Large Classes - How can we provide the personal interactions that are valuable, which enrollments are large / increasing?  We have a resource that is scaling - the students.  The class is partitioned into microclasses, where there is clear physical separation in the lecture room.  And each microclass has a dedicated TA / tutor.  Did this work in an advanced (soph/ junior) class on data structures?

Even though the same instructor taught both the micro and the control class, the students reported higher scores for the instructor for preparedness, concern for students, etc.  Yet, there was no statistical difference in learning (as measured by grades).

Impact of Class Size on Student Evaluations for Traditional and Peer Instruction Classrooms - How can we compare the effectiveness of peer instruction being using in courses of varying class sizes?  For dozens of courses, the evaluation scores for PI and non-PI classes were compared.  There was a statistical difference between the two sets and particularly for evaluating the course and instructor.  This difference exists even when splitting by course.  This difference does not stem from frequency of course, nor the role of the instructor (teaching, tenure, etc).

Thursday, March 9, 2017

Conference Attendance SIGCSE 2017 - Day 1

Here in Seattle, where I used to live, attending SIGCSE 2017.

Exposed! CS Faculty Caught Lecturing in Public: A Survey of Instructional Practices - Postsecondary Instructional Practices Survey (24 items), 7000 CS faculty invited, about 800 responses. If the evidence is clear that active-learning is better for instruction, then we should be doing that more. The overall split for CS was equal between student-centered and instructor-centered (exactly same avearge, 61.5). The survey showed clear differences between non-STEM (student) and STEM (instructor). So CS is doing better than its overall group.

Now, to dig into which differences there are in the demographics. The major difference in instructors is women, and those with 15 years of experience versus 30, both showing a 5+ point difference between student and instructor centered. However, 60s are still "whatever" and are not strongly committed. For those who are strongly committed, there are about 20% for each, while the remaining 60% are whatevers.

Investigating Student Plagiarism Patterns and Correlations to Grades - What are some of the patterns of the plagiarism, such as parts or all and how do students try to obfuscate their "work". Data from 2400 students taking a sophomore-level data structure course. After discarding those assignments with insufficient solution space, four assignments remained from six semesters. Used a plagiarism detector, to find likely cases of cheating.

First, even though the assignments remained unchanged, the rate of cases stayed constant. Most cases involved work from prior semesters. About two thirds of students who cheated, did so on only one assignment. Second, the rate of cheating on the individual assignments was similar to the partner assignment. Third, while students who cheated did better on those assignments, but they did not receive perfect scores and that those cheating did worse in the course than those who did not. And that those who took the follow-on course showed a larger grade difference (p=0.00019). Fourth, the analysis used the raw gradebook data that is independent of the detection and result of that detection.

Six detectors used. Lazy detector (common-case, no comments or whitespace), Token-based (all names become generic, sort functions by token length): identical token stream, modified token edit distance, and inverted token index (compute 12-grams and inversely weight how common these are). "Weird variable name" (lowercase, removed underscores). Obfuscation detector (all on one line, long variable names, etc). Fraction of total cases found by each detector: 15.69%, 18.49%, 49.71%, 72.77%, 67.35%, 0.38%.

Monday, February 20, 2017

Repost: Learn by Doing

I want to take a brief time to link to two of Mark Guzdial's recent posts.  Both including an important theme in teaching.  Students learn best by doing not hearing.  Oddly students commonly repeat this misconception.  If I structure our class time to place them as the ones doing something, rather than me "teaching" by speaking, the appraisal can be that I did not teach.  They may not dispute that they learned, but I failed to teach them.

Students learn when they do, not just hear.  And Learning in MOOCs does not take this requirement into account.

I have to regularly review these points.  So much so that I was able to give them to a group of reporters last week (part of new faculty orientation, but still).

Wednesday, February 8, 2017

Conference Attendance CGO (etc) 2017 - Day 3

Today is the last day for the conference.  I attended several more talks today and my student took 2nd in the student research competition.  So it has been a good attendance and I have received some interesting and valuable feedback on my own work, as well as finding some possible collaborations for the future.

Optimistic Loop Optimization
The compiler wants to optimize loops in the program.  However, C and LLVM IR have many complex characteristics that the Polyhedral model cannot represent, such as aliasing, wrapping, or out of bound accesses.  Rather than just assuming that these characteristics are not present, instead, the code can be analyzed to determine which violating characteristics may be present.  These assumptions are placed in the code, and can then be reduced to the set of preconditions for which the optimized loop can be executed.  Should the assumptions fail, the code instead can branch to the original version.  These optimizations can also be optimized (for example N < 127 implies N < 128).  For SPEC200x, the assumptions fail about 2% of the time and impose 4% runtime overhead.

Software Prefetching for Indirect Memory Accesses
What should we prefetch in software? A[x + N] is easy for hardware, A->next is hard for everyone, while A[B[x + N]] is easy for software and hard for hardware to predict.  So given a loop (such as exists in NAS is) that has this indirect structure, then prefetches can be inserted that will speedup the loop.  Yet, there are three key points for inserting these prefetch instructions.
- You have to prefetch both the A[B[]] and B[].  Otherwise, the prefetch will block on the B[].
- Having both prefetches requires that they are both offset from the actual access as well as each other.  Too close and they are not parallel.  Too far and the data is evicted before use.
- The first point raised that there is an actual load (not prefetch) of B[] and therefore needs to be bounds checked.

Tuesday, February 7, 2017

Conference Attendance CGO (etc) 2017 - Day 2

Several of the talks were great and very interesting.  Other talks particularly needed further presentation practice.  Unfortunately, sometimes this may come from English as a second language.  And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.

Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector.  This code takes 0.3s to run.  Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores).  When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component.  (Similar to thread create routines taking in a function to execute).  With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".

Instead, let's propose three new instructions to include in the LLVM IR.  Supporting these lines required approximately 6000 lines of changes.  When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.

Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG.  The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow.  Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.

Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler.  Each component stores hashes of the components on which it depends.  When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd.  These hash changes can then either propagate, if changed, or the compilation process will complete.

Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size.  These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww).  Hardware profilers can then determine the hot functions.  This information is then processed to determine the hot functions that should be clustered together.  These clusters are mapped to separate loader sessions that will load them using huge pages.

Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization.  Let's ask machine learning to figure this out.  The danger is always finding the right level of training to provide valuable insights without overfitting, etc.


Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work.  I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration.  Now that it is Monday, it is time to attend some conference talks.  In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered.  In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions.  For example, is a queue required to always return an element, or can it fail if an element was recently added?  Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant.  On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower.  To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology.  Each level of the hierarchy acts as a MCS-style queue lock.  How then can these threads abort from their queue?  In a single level, threads mark their status as aborted and are then skipped when handing off the lock.  In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads.  The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs.  Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size.  The concern is that the trace / analysis approaches are very slow and could potentially skew the results.  And when the results are derived from only one trace, there is additional question about their applicability.