Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed.  The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs.  Perhaps this is achievable with ordering constraints.  Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
  data = msg;
  flag = 1;
}

int recv() {
  while (!flag) continue;
  return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data.  These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock?  Let's add two special labels: pre, post.  These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations.  Do the constraints need to be within a single iteration or constraining every iteration?  Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
  VEDGE_HERE(before, after);
  L(before, x = i);
  L(after, y = i + 10);
}

Code extends LLVM and is on GitHub.  The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly.  The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions).  Then in lowering, the lowering to assembly can better take advantage of the specific constraints required.  Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8.  I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models?  argues that C++11 would require acquire semantics on unlock.  Here it is stated that RMC is much more straightforward.  Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings.  Recall that the coarsest grained instruction is the full memory fence.