To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).
bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}
The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.
The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.
Tail duplication follows two basic steps that LLVM provides support to accomplish:
- Clone the basic block
- Merge cloned blocks with the predecessors
Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.
And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.
Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.
I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.
With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.
 
4 comments:
So this was investigating tail duplication in IR rather than later in backend with the TailDuplicator pass. Was there a benefit?
Doing this in the IR provided new opportunities for my instrumentation; however, measurements showed that it only impacted a small fraction of cases. It could be useful, but not for the benchmarks that I was using.
Apparently LLVM used to have tail duplication but it was removed circa 2011 with discussion at the time.
"Remove the old tail duplication pass. It is not used and is unable to update
ssa, so it has to be run really early in the pipeline. Any replacement
should probably use the SSAUpdater."
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20110829/127208.html
Still, I'm surprised that it isn't available for O3 optimizations: optimizations that take longer to perform or that may generate larger code (in an attempt to make the program run faster).
Anyways, Apple added it to their B3 (non LLVM) JIT and claim a >2% win:
https://webkit.org/blog/5852/introducing-the-b3-jit-compiler/
Good post! I totally agree that IR-level tail duplication would help your use case (i.e. the contech project). But I think from a practical and LLVM maintainers' perspective, it just not worth it to maintain it in mid-end, given the fact that it mostly benefits backend optimizations, instruction scheduling to name a few.
Post a Comment