Today is the last day for the conference. I attended several more talks today and my student took 2nd in the student research competition. So it has been a good attendance and I have received some interesting and valuable feedback on my own work, as well as finding some possible collaborations for the future.
Optimistic Loop Optimization
The compiler wants to optimize loops in the program. However, C and LLVM IR have many complex characteristics that the Polyhedral model cannot represent, such as aliasing, wrapping, or out of bound accesses. Rather than just assuming that these characteristics are not present, instead, the code can be analyzed to determine which violating characteristics may be present. These assumptions are placed in the code, and can then be reduced to the set of preconditions for which the optimized loop can be executed. Should the assumptions fail, the code instead can branch to the original version. These optimizations can also be optimized (for example N < 127 implies N < 128). For SPEC200x, the assumptions fail about 2% of the time and impose 4% runtime overhead.
Software Prefetching for Indirect
Memory Accesses
What should we prefetch in software? A[x + N] is easy for hardware, A->next is hard for everyone, while A[B[x + N]] is easy for software and hard for hardware to predict. So given a loop (such as exists in NAS is) that has this indirect structure, then prefetches can be inserted that will speedup the loop. Yet, there are three key points for inserting these prefetch instructions.
- You have to prefetch both the A[B[]] and B[]. Otherwise, the prefetch will block on the B[].
- Having both prefetches requires that they are both offset from the actual access as well as each other. Too close and they are not parallel. Too far and the data is evicted before use.
- The first point raised that there is an actual load (not prefetch) of B[] and therefore needs to be bounds checked.
A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Showing posts with label llvm. Show all posts
Showing posts with label llvm. Show all posts
Wednesday, February 8, 2017
Tuesday, February 7, 2017
Conference Attendance CGO (etc) 2017 - Day 2
Several of the talks were great and very interesting. Other talks particularly needed further presentation practice. Unfortunately, sometimes this may come from English as a second language. And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.
Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector. This code takes 0.3s to run. Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores). When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component. (Similar to thread create routines taking in a function to execute). With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".
Instead, let's propose three new instructions to include in the LLVM IR. Supporting these lines required approximately 6000 lines of changes. When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.
Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG. The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow. Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.
Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler. Each component stores hashes of the components on which it depends. When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd. These hash changes can then either propagate, if changed, or the compilation process will complete.
Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size. These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww). Hardware profilers can then determine the hot functions. This information is then processed to determine the hot functions that should be clustered together. These clusters are mapped to separate loader sessions that will load them using huge pages.
Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization. Let's ask machine learning to figure this out. The danger is always finding the right level of training to provide valuable insights without overfitting, etc.
Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector. This code takes 0.3s to run. Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores). When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component. (Similar to thread create routines taking in a function to execute). With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".
Instead, let's propose three new instructions to include in the LLVM IR. Supporting these lines required approximately 6000 lines of changes. When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.
Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG. The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow. Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.
Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler. Each component stores hashes of the components on which it depends. When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd. These hash changes can then either propagate, if changed, or the compilation process will complete.
Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size. These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww). Hardware profilers can then determine the hot functions. This information is then processed to determine the hot functions that should be clustered together. These clusters are mapped to separate loader sessions that will load them using huge pages.
Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization. Let's ask machine learning to figure this out. The danger is always finding the right level of training to provide valuable insights without overfitting, etc.
Monday, February 6, 2017
Conference Attendance CGO (etc) 2017 - Day 1
I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work. I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration. Now that it is Monday, it is time to attend some conference talks. In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!
Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered. In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions. For example, is a queue required to always return an element, or can it fail if an element was recently added? Using these annotations, the authors were able to find issues and other identifications for the data structures.
Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant. On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower. To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology. Each level of the hierarchy acts as a MCS-style queue lock. How then can these threads abort from their queue? In a single level, threads mark their status as aborted and are then skipped when handing off the lock. In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.
The implementation was model checked and then run on a HP Superdome with 576 hardware threads. The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.
Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs. Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size. The concern is that the trace / analysis approaches are very slow and could potentially skew the results. And when the results are derived from only one trace, there is additional question about their applicability.
Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered. In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions. For example, is a queue required to always return an element, or can it fail if an element was recently added? Using these annotations, the authors were able to find issues and other identifications for the data structures.
Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant. On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower. To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology. Each level of the hierarchy acts as a MCS-style queue lock. How then can these threads abort from their queue? In a single level, threads mark their status as aborted and are then skipped when handing off the lock. In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.
The implementation was model checked and then run on a HP Superdome with 576 hardware threads. The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.
Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs. Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size. The concern is that the trace / analysis approaches are very slow and could potentially skew the results. And when the results are derived from only one trace, there is additional question about their applicability.
Friday, May 6, 2016
Tail Duplication with LLVM
Let me start by saying that tail duplication exists in LLVM, but only as a backend, machine code optimization. I am going to detail several engineering points that were required to support this operation strictly in the LLVM intermediate representation (IR).
To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).
bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}
The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.
The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.
Tail duplication follows two basic steps that LLVM provides support to accomplish:
Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.
And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.
Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.
I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.
With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.
To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).
bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}
The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.
The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.
Tail duplication follows two basic steps that LLVM provides support to accomplish:
- Clone the basic block
- Merge cloned blocks with the predecessors
Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.
And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.
Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.
I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.
With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.
Subscribe to:
Posts (Atom)