Tuesday, February 7, 2017

Conference Attendance CGO (etc) 2017 - Day 2

Several of the talks were great and very interesting.  Other talks particularly needed further presentation practice.  Unfortunately, sometimes this may come from English as a second language.  And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.

Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector.  This code takes 0.3s to run.  Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores).  When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component.  (Similar to thread create routines taking in a function to execute).  With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".

Instead, let's propose three new instructions to include in the LLVM IR.  Supporting these lines required approximately 6000 lines of changes.  When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.

Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG.  The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow.  Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.

Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler.  Each component stores hashes of the components on which it depends.  When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd.  These hash changes can then either propagate, if changed, or the compilation process will complete.

Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size.  These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww).  Hardware profilers can then determine the hot functions.  This information is then processed to determine the hot functions that should be clustered together.  These clusters are mapped to separate loader sessions that will load them using huge pages.

Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization.  Let's ask machine learning to figure this out.  The danger is always finding the right level of training to provide valuable insights without overfitting, etc.


No comments: