A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.
The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.
I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?
All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.
A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Showing posts with label compilers. Show all posts
Showing posts with label compilers. Show all posts
Friday, September 14, 2018
Wednesday, May 16, 2018
Review: Lessons from Building Static Analysis Tools at Google
The Communications of the ACM recently had several development articles, and I found the one on static analysis tools at Google particularly interesting. The article works through how Google went about integrating static analysis tools into every developer's workflow. And the tools have to be in the workflow, or developers will "forget" to use them. The second problem with the tools is ensuring that the feedback is useful. Currently, each dev will mark the items as either useful or incorrect. If a tool exceeds a 10% false-positive rate, it is temporarily disabled until that tool's developers can fix the flagged issues. The third issue with the tools is that some are expensive. Depending on the type of static analysis, the time required may be significant. Thus the tools are classified into two camps: on each compile, or on each code review / commit. It is also important that some tools can be temporarily disabled, such that during debugging or refactoring the code may temporarily mutate into an "unsafe" state to simplify the process.
Personally, I am glad that they are integrating analysis tools into the development workflow. Much work has been done to find bugs and issues within source code, so it is good that these analyses can be utilized regularly to improve code quality.
(As a note, I do not nor never have worked for Google, so I can only write based on the ACM article and not personal experience.)
Personally, I am glad that they are integrating analysis tools into the development workflow. Much work has been done to find bugs and issues within source code, so it is good that these analyses can be utilized regularly to improve code quality.
(As a note, I do not nor never have worked for Google, so I can only write based on the ACM article and not personal experience.)
Friday, April 27, 2018
Thesis Defense: Systems Support for Intermittent Computing
Today I attended Alexei Colin's thesis defense titled, Systems Support for Intermittent Computing.
For small, embedded devices, batteries are expensive / difficult, so energy can be harvested from RF, light, temp gradients, motion, et cetera. In such a device, this direct energy source is insufficient to power the device, so a small capacitor (or other storage medium) retains this energy until the device can be powered for a short time. The discharge provides an intermittent period of execution before the power source drops below the threshold for execution. Programs can be annotated with latches or other progress points, such that execution after power failure can then resume at this point after the power is again available.
To model the computation, the program will be decomposed into tasks, where each task can only transfer control to other tasks, but contains arbitrary code. Tasks will communicate through channels. The channels provide the memory model, such that any internal updates within the task are ultimately exposed via the channels. However, this model while reducing the overhead required to execute the tasks, requires a greater quantity of the non-volatile memory.
How do we then get the tasks and thus the latches? Given a control flow graph (CFG), task boundaries will need to be inserted between specific basic blocks of the graph. The compiler can be extended to model (or receive model results) of the energy requirements for each block and thereby estimate which path segments will have sufficient energy for complete execution. Each block actually has not a single energy, but a PDF based on possible microarchitectural effects. Then the model combines these PDFs to compute the CDF to determine the probability that a given path will successfully execute given a specific amount of energy available. Note, each task boundary imposes overhead both in time and therefore energy, so we want the set of task boundaries to minimize overhead, while also accounting for task failures wasting energy. This compiler pass produces better task decompositions than are achieved via manual programmer annotations, as provided by prior work.
Other system support issues. This system should also have dynamic ability to select the stored energy necessary for task execution. This change first requires splitting the energy storage device into multiple banks in hardware. Also, debugging issues in the system is difficult, particularly where the device is expecting to regularly "fail", so a new debugger was prepared that can combine the program state of traditional debuggers, while still supporting the device to be intermittent. Such devices will also need further design for intermittent networking stacks, and then be built into a larger IoT hierarchy.
In conclusion, energy-harvesting embedded computers will form the edge of the IoT hierarchy. And the system stack will form the basis for support.
For small, embedded devices, batteries are expensive / difficult, so energy can be harvested from RF, light, temp gradients, motion, et cetera. In such a device, this direct energy source is insufficient to power the device, so a small capacitor (or other storage medium) retains this energy until the device can be powered for a short time. The discharge provides an intermittent period of execution before the power source drops below the threshold for execution. Programs can be annotated with latches or other progress points, such that execution after power failure can then resume at this point after the power is again available.
To model the computation, the program will be decomposed into tasks, where each task can only transfer control to other tasks, but contains arbitrary code. Tasks will communicate through channels. The channels provide the memory model, such that any internal updates within the task are ultimately exposed via the channels. However, this model while reducing the overhead required to execute the tasks, requires a greater quantity of the non-volatile memory.
How do we then get the tasks and thus the latches? Given a control flow graph (CFG), task boundaries will need to be inserted between specific basic blocks of the graph. The compiler can be extended to model (or receive model results) of the energy requirements for each block and thereby estimate which path segments will have sufficient energy for complete execution. Each block actually has not a single energy, but a PDF based on possible microarchitectural effects. Then the model combines these PDFs to compute the CDF to determine the probability that a given path will successfully execute given a specific amount of energy available. Note, each task boundary imposes overhead both in time and therefore energy, so we want the set of task boundaries to minimize overhead, while also accounting for task failures wasting energy. This compiler pass produces better task decompositions than are achieved via manual programmer annotations, as provided by prior work.
Other system support issues. This system should also have dynamic ability to select the stored energy necessary for task execution. This change first requires splitting the energy storage device into multiple banks in hardware. Also, debugging issues in the system is difficult, particularly where the device is expecting to regularly "fail", so a new debugger was prepared that can combine the program state of traditional debuggers, while still supporting the device to be intermittent. Such devices will also need further design for intermittent networking stacks, and then be built into a larger IoT hierarchy.
In conclusion, energy-harvesting embedded computers will form the edge of the IoT hierarchy. And the system stack will form the basis for support.
Tuesday, October 24, 2017
PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus
Today, I went to Michael Sullivan's thesis defense, who passed. The work was at a delightful intersection of my interests.
We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs. Perhaps this is achievable with ordering constraints. Given the following simple example, what constraints are required?
int data, flag;
void send(int msg) {
data = msg;
flag = 1;
}
int recv() {
while (!flag) continue;
return data;
}
Two constraints: data ("visible") before flag, flag ("executed") before data. These constraints are explicitly programmer-specified, and that it is contended that this is practical.
rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.
In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock? Let's add two special labels: pre, post. These serve for interface boundaries to denote that everything has executed before this point, or is visible.
Next problem is loop iterations. Do the constraints need to be within a single iteration or constraining every iteration? Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.
for (i = 0; i < 2; i++) {
VEDGE_HERE(before, after);
L(before, x = i);
L(after, y = i + 10);
}
Code extends LLVM and is on GitHub. The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly. The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions). Then in lowering, the lowering to assembly can better take advantage of the specific constraints required. Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8. I suspect that x86's TSO model is not as interesting for finding performance benefits.
Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models? argues that C++11 would require acquire semantics on unlock. Here it is stated that RMC is much more straightforward. Further, students in 15-418 found gains from RMC versus the C11 model.
Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings. Recall that the coarsest grained instruction is the full memory fence.
We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs. Perhaps this is achievable with ordering constraints. Given the following simple example, what constraints are required?
int data, flag;
void send(int msg) {
data = msg;
flag = 1;
}
int recv() {
while (!flag) continue;
return data;
}
Two constraints: data ("visible") before flag, flag ("executed") before data. These constraints are explicitly programmer-specified, and that it is contended that this is practical.
rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.
In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock? Let's add two special labels: pre, post. These serve for interface boundaries to denote that everything has executed before this point, or is visible.
Next problem is loop iterations. Do the constraints need to be within a single iteration or constraining every iteration? Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.
for (i = 0; i < 2; i++) {
VEDGE_HERE(before, after);
L(before, x = i);
L(after, y = i + 10);
}
Code extends LLVM and is on GitHub. The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly. The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions). Then in lowering, the lowering to assembly can better take advantage of the specific constraints required. Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8. I suspect that x86's TSO model is not as interesting for finding performance benefits.
Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models? argues that C++11 would require acquire semantics on unlock. Here it is stated that RMC is much more straightforward. Further, students in 15-418 found gains from RMC versus the C11 model.
Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings. Recall that the coarsest grained instruction is the full memory fence.
Monday, March 13, 2017
Book Review: Optimized C++: Proven Techniques for Heightened Performance
I have spent significant time on performance issues and have been in search of a book that can summarize the diversity of issues and techniques well. I hoped that Optimized C++: Proven Techniques for Heightened Performance would provide some of the guidance I want and
This book is not quite it. There is good material here, yet I found repeatedly thinking that the author was not aware of the past 10(?) years of changes to the field. Not an issue of the book was from the early 2000s, but it was published last year.
A key step in improving the performance of programs is measuring it. There are a variety of techniques for doing so. Tools based on instrumentation and tools based on sampling profiling. I find greater value to using the sampling profiling tools (for measuring performance) due to their lower overhead and ability to pinpoint where in a function this cost exists. Yet the book's focus is limited to gprof-esque approaches. I tell students that this approach is best with deep call trees, which may be a greater issue for C++ programming specifically.
The author is somewhat dismissive to compiler optimizations and emphasizes that his observed benefit has been particularly limited to function inlining. There are many more optimizations, and you should care about them. But again, I wonder if his experience of C++ has been deep call trees that could particularly benefit from inlining.
In a take it or leave it, this work also discourages the use of dynamic libraries. Yes, they impose a performance penalty, but they also provide valuable functionality. It all depends on your use case for whether you should statically or dynamically link your code. Code that is reused by separate executables should be in a dynamic library, as it reduces the memory requirements when running and reduces the effort to patch and update those executables. Components that are only used by a single executable should be statically linked, unless the components are of significant size such that decoupling can still benefit memory usage and the updating process.
The author related that replacing printf with puts to just print a string has performance advantages, due to printf being a complicated "God function". The basic point is valid that printf has significant functionality; however, the anecdote should be taken with a grain of salt. Current compilers will do this optimization (replace printf with puts) automatically.
While most of the work provides small examples, the final chapters on concurrency (?) and memory management do not. The concurrency chapter reads as a reference book, as it lists the various APIs available and what each does. It would be better for the book to assume that the readers are familiar with these calls (as the author does with many other topics) and discuss possible optimizations within this scope.
To conclude, the book is not bad, but I also cannot say it is accurate on every point. Especially with performance, programmers are apt to make prompt design decisions based on "their experience" or "recent publications". Measure your code's performance. Only then can you discern which techniques will provide value.
This book is not quite it. There is good material here, yet I found repeatedly thinking that the author was not aware of the past 10(?) years of changes to the field. Not an issue of the book was from the early 2000s, but it was published last year.
A key step in improving the performance of programs is measuring it. There are a variety of techniques for doing so. Tools based on instrumentation and tools based on sampling profiling. I find greater value to using the sampling profiling tools (for measuring performance) due to their lower overhead and ability to pinpoint where in a function this cost exists. Yet the book's focus is limited to gprof-esque approaches. I tell students that this approach is best with deep call trees, which may be a greater issue for C++ programming specifically.
The author is somewhat dismissive to compiler optimizations and emphasizes that his observed benefit has been particularly limited to function inlining. There are many more optimizations, and you should care about them. But again, I wonder if his experience of C++ has been deep call trees that could particularly benefit from inlining.
In a take it or leave it, this work also discourages the use of dynamic libraries. Yes, they impose a performance penalty, but they also provide valuable functionality. It all depends on your use case for whether you should statically or dynamically link your code. Code that is reused by separate executables should be in a dynamic library, as it reduces the memory requirements when running and reduces the effort to patch and update those executables. Components that are only used by a single executable should be statically linked, unless the components are of significant size such that decoupling can still benefit memory usage and the updating process.
The author related that replacing printf with puts to just print a string has performance advantages, due to printf being a complicated "God function". The basic point is valid that printf has significant functionality; however, the anecdote should be taken with a grain of salt. Current compilers will do this optimization (replace printf with puts) automatically.
While most of the work provides small examples, the final chapters on concurrency (?) and memory management do not. The concurrency chapter reads as a reference book, as it lists the various APIs available and what each does. It would be better for the book to assume that the readers are familiar with these calls (as the author does with many other topics) and discuss possible optimizations within this scope.
To conclude, the book is not bad, but I also cannot say it is accurate on every point. Especially with performance, programmers are apt to make prompt design decisions based on "their experience" or "recent publications". Measure your code's performance. Only then can you discern which techniques will provide value.
Wednesday, February 8, 2017
Conference Attendance CGO (etc) 2017 - Day 3
Today is the last day for the conference. I attended several more talks today and my student took 2nd in the student research competition. So it has been a good attendance and I have received some interesting and valuable feedback on my own work, as well as finding some possible collaborations for the future.
Optimistic Loop Optimization
The compiler wants to optimize loops in the program. However, C and LLVM IR have many complex characteristics that the Polyhedral model cannot represent, such as aliasing, wrapping, or out of bound accesses. Rather than just assuming that these characteristics are not present, instead, the code can be analyzed to determine which violating characteristics may be present. These assumptions are placed in the code, and can then be reduced to the set of preconditions for which the optimized loop can be executed. Should the assumptions fail, the code instead can branch to the original version. These optimizations can also be optimized (for example N < 127 implies N < 128). For SPEC200x, the assumptions fail about 2% of the time and impose 4% runtime overhead.
Software Prefetching for Indirect Memory Accesses
What should we prefetch in software? A[x + N] is easy for hardware, A->next is hard for everyone, while A[B[x + N]] is easy for software and hard for hardware to predict. So given a loop (such as exists in NAS is) that has this indirect structure, then prefetches can be inserted that will speedup the loop. Yet, there are three key points for inserting these prefetch instructions.
- You have to prefetch both the A[B[]] and B[]. Otherwise, the prefetch will block on the B[].
- Having both prefetches requires that they are both offset from the actual access as well as each other. Too close and they are not parallel. Too far and the data is evicted before use.
- The first point raised that there is an actual load (not prefetch) of B[] and therefore needs to be bounds checked.
Optimistic Loop Optimization
The compiler wants to optimize loops in the program. However, C and LLVM IR have many complex characteristics that the Polyhedral model cannot represent, such as aliasing, wrapping, or out of bound accesses. Rather than just assuming that these characteristics are not present, instead, the code can be analyzed to determine which violating characteristics may be present. These assumptions are placed in the code, and can then be reduced to the set of preconditions for which the optimized loop can be executed. Should the assumptions fail, the code instead can branch to the original version. These optimizations can also be optimized (for example N < 127 implies N < 128). For SPEC200x, the assumptions fail about 2% of the time and impose 4% runtime overhead.
Software Prefetching for Indirect Memory Accesses
What should we prefetch in software? A[x + N] is easy for hardware, A->next is hard for everyone, while A[B[x + N]] is easy for software and hard for hardware to predict. So given a loop (such as exists in NAS is) that has this indirect structure, then prefetches can be inserted that will speedup the loop. Yet, there are three key points for inserting these prefetch instructions.
- You have to prefetch both the A[B[]] and B[]. Otherwise, the prefetch will block on the B[].
- Having both prefetches requires that they are both offset from the actual access as well as each other. Too close and they are not parallel. Too far and the data is evicted before use.
- The first point raised that there is an actual load (not prefetch) of B[] and therefore needs to be bounds checked.
Labels:
compilers,
conference,
llvm,
loop,
optimizations,
prefetch
Tuesday, February 7, 2017
Conference Attendance CGO (etc) 2017 - Day 2
Several of the talks were great and very interesting. Other talks particularly needed further presentation practice. Unfortunately, sometimes this may come from English as a second language. And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.
Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector. This code takes 0.3s to run. Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores). When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component. (Similar to thread create routines taking in a function to execute). With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".
Instead, let's propose three new instructions to include in the LLVM IR. Supporting these lines required approximately 6000 lines of changes. When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.
Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG. The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow. Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.
Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler. Each component stores hashes of the components on which it depends. When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd. These hash changes can then either propagate, if changed, or the compilation process will complete.
Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size. These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww). Hardware profilers can then determine the hot functions. This information is then processed to determine the hot functions that should be clustered together. These clusters are mapped to separate loader sessions that will load them using huge pages.
Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization. Let's ask machine learning to figure this out. The danger is always finding the right level of training to provide valuable insights without overfitting, etc.
Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector. This code takes 0.3s to run. Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores). When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component. (Similar to thread create routines taking in a function to execute). With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".
Instead, let's propose three new instructions to include in the LLVM IR. Supporting these lines required approximately 6000 lines of changes. When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.
Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG. The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow. Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.
Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler. Each component stores hashes of the components on which it depends. When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd. These hash changes can then either propagate, if changed, or the compilation process will complete.
Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size. These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww). Hardware profilers can then determine the hot functions. This information is then processed to determine the hot functions that should be clustered together. These clusters are mapped to separate loader sessions that will load them using huge pages.
Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization. Let's ask machine learning to figure this out. The danger is always finding the right level of training to provide valuable insights without overfitting, etc.
Friday, May 6, 2016
Tail Duplication with LLVM
Let me start by saying that tail duplication exists in LLVM, but only as a backend, machine code optimization. I am going to detail several engineering points that were required to support this operation strictly in the LLVM intermediate representation (IR).
To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).
bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}
The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.
The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.
Tail duplication follows two basic steps that LLVM provides support to accomplish:
Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.
And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.
Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.
I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.
With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.
To demonstrate tail duplication, I will take an extremely simple example of returning whether a number is even. (This is just demonstration, an optimizing compiler can reduce this example to two inlined x86 instructions.) In the following code, there are four basic blocks: A, B, C, and D. Block A initializes the value and executes the conditional instruction with control flow to both blocks B and C. Blocks B and C update the value and then fall through to block D, which returns the value updated in either B or C. This control flow is the classic diamond pattern, from the picture it makes in the control flow graph (CFG).
bool isEven(int x)
{
bool returnValue = false; // A
if (x % 2 == 0)
returnValue = true; // B
else
returnValue = false; // C
return returnValue; // D
}
The idea is to duplicate a tail block and then merge each duplicate into its predecessor. This (often) increases the code size, but provides a larger block of instructions for optimization and scheduling and is a vital step in superblock scheduling. In the above example, tail duplication would duplicate block D into D' and then merge D with B and D' with C.
The second thing that you should know is SSA form. This is a way of representing the instructions, such that every value is from a single static assignment (SSA). If every variable is only assigned to once, how do loops work? How did the above example work? A special assignment is added, called a PHI node. A PHI node selects from multiple values depending on which edge in CFG was followed. SSA provides an ease of understanding where a value is created and every place that it is used.
Tail duplication follows two basic steps that LLVM provides support to accomplish:
- Clone the basic block
- Merge cloned blocks with the predecessors
Let's start by calling CloneBasicBlock. The first problem is that the cloned block is a perfect clone, in that it uses all of the old instructions' values. All of them. It is like you move across the country and discover that your lamp is still plugged in at your old place, somehow. Tackling this problem requires the code to fix all of these use-def chains, so that the new uses depend on the new definitions in the cloned block. Another function CloneFunctionInto has a loop that fixes all of the instructions in the cloned function, which will serve here too.
And then the second problem with cloning basic blocks is that the program has introduced new definitions into the program, which may now vie with the existing definitions. There is a simple solution to merging these values, PHI nodes. Of course, these PHI nodes have nothing to do with the old PHI nodes already present. I spent a long time thinking about how I would update the existing PHIs, until I realized that they were all going to be cloned and merged, so I would new a completely new set as the one definition had become many and the many uses had become one.
Next each basic block is to be merged with its one predecessor. LLVM provides MergeBlockIntoPredecessor; however, the IR kept failing in odd ways after using this function. What I discovered is that the basic block was no longer well formed. The cloned block has only one predecessor, but its PHINodes still have multiple predecessors. Therefore, the merge step would take the first value for each PHINode (which would be correct if the block was well formed), but in my case I was in the process of reforming the IR. Therefore, I had to fix the PHINodes to only have one incoming value.
I thought what I would do is iterate through the incoming paths into each PHI instruction and delete any that do not match the single predecessor. Unfortunately, when the incoming value is deleted, the iterators were invalidated (not an uncommon problem when iterating). So in the end, the tail duplication I was writing had to do what the MergeBlockIntoPredecessor already did, select one value for the PHINode and update the uses accordingly.
With that development complete, now I can move on to the interesting question of whether there are cases when doing tail duplication would be a good idea and provide some benefit to my instrumentation.
Monday, February 23, 2015
Compilers and Optimizations, should you care?
Compiler optimizations matter. One time in helping a fellow Ph.D. student improve a simulator's performance, I did two things: first, I replaced an expensive data structure with a more efficient one. Second, I turned on compiler optimizations. Together, the simulator ran 100x faster.
A question posted in the stackexchange system asked, "Why are there so few C compilers?" The main answer pointed out that any C compiler needs to be optimizing. Lots of optimizations are occurring on every compilation, and each one gaining tiniest increments in performance. While I enjoy discussing them in detail, I generally wave my hands and tell of how they are good, yet make debugging difficult. These optimizations are lumped together as the "optimization level".
In "What Every Programmer Should Know About Compiler Optimizations", we revisit optimizations. First, the compiler is no panacea and cannot correct for inefficient algorithms or poor data structure choices (although I am party to research on the later). The article then suggests four points to assist the compiler in its efforts at optimizing the code.
"Write understandable, maintainable code." Please do this! Usually, the expensive resource is the programmer. So the first optimization step is to improve the programmer's efficiency with the source code. Remember Review: Performance Anti-patterns and do not start optimizing the code until you know what is slow.
"Use compiler directives." Scary. Excepting the inline directive, I have used these less than a half dozen times in almost as many years of performance work. Furthermore, the example of changing the calling convention is less relevant in 64-bit space where most conventions have been made irrelevant.
"Use compiler-intrinsic functions." (see Compiler Intrinsics the Secret Sauce) This can often dovetail with the first point by removing ugly bit twiddling code and putting in clean function calls.
"Use profile-guided optimization (PGO)." This optimization is based on the dynamic behavior of the program. Meaning that if you take a profile of the program doing X, and later the program does Y; executing Y can be slower. The key is picking good, representative examples of the program's execution.
So you have dialed up the optimization level, and written understandable code sprinkled with intrinsics. Now what? The next step (with which I agree) is to use link time optimizations (LTO) / Link-Time Code Generation (LTCG). This flag delays many optimizations until the entire program is available to be linked. One of the principles of software performance is that the more of the program available to be optimized, the better it can be optimized. (This principle also applies in computer architecture). Thus, by delaying many optimization until the entire program is available, the linker can find additional and better opportunities than were present in individual components.
The article notes, "The only reason not to use LTCG is when you want to distribute the resulting object and library files." And alas, I have fought several battles to overcome this point, as my work requires the use of LTO. Perhaps in the next decade, LTO will be standard.
A question posted in the stackexchange system asked, "Why are there so few C compilers?" The main answer pointed out that any C compiler needs to be optimizing. Lots of optimizations are occurring on every compilation, and each one gaining tiniest increments in performance. While I enjoy discussing them in detail, I generally wave my hands and tell of how they are good, yet make debugging difficult. These optimizations are lumped together as the "optimization level".
In "What Every Programmer Should Know About Compiler Optimizations", we revisit optimizations. First, the compiler is no panacea and cannot correct for inefficient algorithms or poor data structure choices (although I am party to research on the later). The article then suggests four points to assist the compiler in its efforts at optimizing the code.
"Write understandable, maintainable code." Please do this! Usually, the expensive resource is the programmer. So the first optimization step is to improve the programmer's efficiency with the source code. Remember Review: Performance Anti-patterns and do not start optimizing the code until you know what is slow.
"Use compiler directives." Scary. Excepting the inline directive, I have used these less than a half dozen times in almost as many years of performance work. Furthermore, the example of changing the calling convention is less relevant in 64-bit space where most conventions have been made irrelevant.
"Use compiler-intrinsic functions." (see Compiler Intrinsics the Secret Sauce) This can often dovetail with the first point by removing ugly bit twiddling code and putting in clean function calls.
"Use profile-guided optimization (PGO)." This optimization is based on the dynamic behavior of the program. Meaning that if you take a profile of the program doing X, and later the program does Y; executing Y can be slower. The key is picking good, representative examples of the program's execution.
So you have dialed up the optimization level, and written understandable code sprinkled with intrinsics. Now what? The next step (with which I agree) is to use link time optimizations (LTO) / Link-Time Code Generation (LTCG). This flag delays many optimizations until the entire program is available to be linked. One of the principles of software performance is that the more of the program available to be optimized, the better it can be optimized. (This principle also applies in computer architecture). Thus, by delaying many optimization until the entire program is available, the linker can find additional and better opportunities than were present in individual components.
The article notes, "The only reason not to use LTCG is when you want to distribute the resulting object and library files." And alas, I have fought several battles to overcome this point, as my work requires the use of LTO. Perhaps in the next decade, LTO will be standard.
Labels:
article,
C,
C++,
compilers,
link,
LTO,
patterns,
performance,
stackoverflow
Tuesday, September 16, 2014
Atomic Weapons in Programming
In parallel programming, most of the time the use of locks is good enough for the application. And when it is not, then you may need to resort to atomic weapons. While I can and have happily written my own lock implementations, its like the story of a lawyer redoing his kitchen himself. It is not a good use of the lawyer's time unless he's enjoying it.
That said, I have had to use atomic weapons against a compiler. The compiler happily reordered several memory operations in an unsafe way. Using fence instructions, I was able to prevent this reordering, while not seeing fences in the resulting assembly. I still wonder if there was some information I was not providing.
Regardless, the weapons are useful! And I can thank the following presentation for illuminating me to the particular weapon that was needed, Atomic Weapons. I have reviewed earlier work by Herb Sutter and he continues to garner my respect (not that he is aware), but nonetheless I suggest any low-level programmer be aware of the tools that are available, as well as the gremlins that lurk in these depths and might necessitate appropriate weaponry.
That said, I have had to use atomic weapons against a compiler. The compiler happily reordered several memory operations in an unsafe way. Using fence instructions, I was able to prevent this reordering, while not seeing fences in the resulting assembly. I still wonder if there was some information I was not providing.
Regardless, the weapons are useful! And I can thank the following presentation for illuminating me to the particular weapon that was needed, Atomic Weapons. I have reviewed earlier work by Herb Sutter and he continues to garner my respect (not that he is aware), but nonetheless I suggest any low-level programmer be aware of the tools that are available, as well as the gremlins that lurk in these depths and might necessitate appropriate weaponry.
Tuesday, April 16, 2013
When whitespace matters
In the process of grading student programming projects, I discovered that whitespace can matter in how C code is compiled. Some programmers may be aware that strings can be concatenated together:
const char* t = "this string" " is one";
Because the whitespace is ignored. Furthermore, programmers may encode strings via macros, and replace the instance with the macro:
#define A_STRING " is one"
Now, in the latest version of g++, new C++ support can treat the item following the string literal as either a macro or a user defined literal. The compiler makes the determination based on whether there is whitespace.
const char* t = "this string"A_STRING; // compiler error
const char* t = "this string" A_STRING; // expected behavior
I like consistent use of whitespace in code for stylistic reasons, but introducing a singular dependency on whitespace is odd for C/C++, in contrast to python that is highly whitespace dependent.
const char* t = "this string" " is one";
Because the whitespace is ignored. Furthermore, programmers may encode strings via macros, and replace the instance with the macro:
#define A_STRING " is one"
Now, in the latest version of g++, new C++ support can treat the item following the string literal as either a macro or a user defined literal. The compiler makes the determination based on whether there is whitespace.
const char* t = "this string"A_STRING; // compiler error
const char* t = "this string" A_STRING; // expected behavior
I like consistent use of whitespace in code for stylistic reasons, but introducing a singular dependency on whitespace is odd for C/C++, in contrast to python that is highly whitespace dependent.
Friday, March 2, 2012
Usage of NULL
I read an argument recently about whether a test for NULL should be:
- or -
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
if (ptr)
- or -
if (ptr != NULL)
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
Wednesday, February 9, 2011
Compiler Intrinsics - The Secret Sauce
Compiler intrinsics are one of the things that C programmers are often unaware. Sometimes they know that such a functionality should exist, but is suspected to be cloaked in assembly. Other times, the intrinsic does something otherwise unimagined by the developer. For this article, I will confine my notation and functionality to that provided by Visual Studio (see MSDN); however, I have every expectation that other platforms / compilers will operate in a similar manner.
The intrinsic provides two things to a programmer. First, compiler supported functionality might otherwise be unavailable. Second and related, the compiler has understanding of what each intrinsic does. The intrinsics are of three flavors: obscured functionality, replicated functionality, and compiler insights. And in discussing these flavors, I think the provisions will be better understood.
Architectures provide many instructions that are not immediately accessible to the average programmer. A simple example is counting the bits in a variable. Such functionality is quite simple for hardware and already provided by the processor, yet I am aware of no language that has notation for such an operation. Instead, a programmer can invoke __popcnt(var) and be done. Invoking this intrinsic provides two benefits: first, performance and second correctness. On Stack Overflow, the equivalent C implementation is suggested as:
Replicated functionality steps up one level in complexity. These operations may not have simple assembly equivalents, yet there still may be advantages from using the intrinsics. An example is memcpy (in a prior job, I worked extensively to improve the implementation of memcpy so this is particularly dear). There is a version of memcpy in the C runtime library, which is a good general purpose implementation. Yet, if the program is not copying variable length data, perhaps inlining the call can provide benefit. Beyond just avoiding the function call, having a fixed copy size also enables the compiler to generate instructions without the loops, etc that are in the memcpy implementation. By utilizing the memcpy intrinsic, the compiler can do all of this, whereas without the intrinsic the linker is limited to linking in the function from the separate library.
Many times I've heard someone cite the halting problem and correctly assert that a compiler cannot fully intuit the operation of a program. Beyond this, even limited analysis of source code is a difficult problem. To this end, there exist a limited set of intrinsics that provide the compiler with further information regarding program state. In particular, there is __assume. This intrinsic posits that some expression is true, which can result in further optimization of the subsequent code. It is recommended that this intrinsic is paired with asserts, as follows.
Compiler intrinsics provide the programmer the opportunity to improve application performance by both simplifying algorithms with better hardware support and by guiding the compiler with a greater understanding of the code's operation.
EDIT: The intrinsics are also termed "builtins" for GCC, see here.
The intrinsic provides two things to a programmer. First, compiler supported functionality might otherwise be unavailable. Second and related, the compiler has understanding of what each intrinsic does. The intrinsics are of three flavors: obscured functionality, replicated functionality, and compiler insights. And in discussing these flavors, I think the provisions will be better understood.
Architectures provide many instructions that are not immediately accessible to the average programmer. A simple example is counting the bits in a variable. Such functionality is quite simple for hardware and already provided by the processor, yet I am aware of no language that has notation for such an operation. Instead, a programmer can invoke __popcnt(var) and be done. Invoking this intrinsic provides two benefits: first, performance and second correctness. On Stack Overflow, the equivalent C implementation is suggested as:
int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}
Replicated functionality steps up one level in complexity. These operations may not have simple assembly equivalents, yet there still may be advantages from using the intrinsics. An example is memcpy (in a prior job, I worked extensively to improve the implementation of memcpy so this is particularly dear). There is a version of memcpy in the C runtime library, which is a good general purpose implementation. Yet, if the program is not copying variable length data, perhaps inlining the call can provide benefit. Beyond just avoiding the function call, having a fixed copy size also enables the compiler to generate instructions without the loops, etc that are in the memcpy implementation. By utilizing the memcpy intrinsic, the compiler can do all of this, whereas without the intrinsic the linker is limited to linking in the function from the separate library.
Many times I've heard someone cite the halting problem and correctly assert that a compiler cannot fully intuit the operation of a program. Beyond this, even limited analysis of source code is a difficult problem. To this end, there exist a limited set of intrinsics that provide the compiler with further information regarding program state. In particular, there is __assume. This intrinsic posits that some expression is true, which can result in further optimization of the subsequent code. It is recommended that this intrinsic is paired with asserts, as follows.
#ifdef DEBUG
#define INVARIANT(e) assert(e)
#else
#define INVARIANT(e) __assume(e)
#endif
Compiler intrinsics provide the programmer the opportunity to improve application performance by both simplifying algorithms with better hardware support and by guiding the compiler with a greater understanding of the code's operation.
EDIT: The intrinsics are also termed "builtins" for GCC, see here.
Thursday, December 30, 2010
A Pointer on Pointers Part 2
In the second part to the series on pointers, we'll be covering what they point to and how it relates to reality. Let's consider four pointers: void*, char[], struct*, and (void *)f(int) (i.e., a function pointer). With each pointer, we'll learn something further about C, assembly and how a processor interacts with memory.
To begin, there is the all purpose void* pointer (and yes, void* means pointer to void, but I'll keep writing void* pointer for emphasis). A C type that means a pointer to nothing, or anything. This pointer is important for allocating space and casting (changing the type of) the space. This second piece is something that only has representation in the programming language, in assembly every pointer has no type information. So by casting, the programmer tells the compiler that this block of memory is now to be treated differently. Therefore, void* pointers are used when the type is not known, (e.g., pthread_create(..., void* arg) or CreateThread(...,
A char[] is the next C type that will be discussed here. Every array in C is a pointer. Learn this fact. Internalize it. They are so identical that you can use them interchangeably, like so:
*(foo + 0) = '\0'; // 2
Line 1 and 2 are equivalent. So a pointer to many chars is an array of chars. And we can access any array offset with the pointer. Or we can use pointer arithmetic to access specific elements. Now, next time you see char*, you can think of it as an array of characters. (In a later post, we'll cover char** and other more complex pointers).
Working with the third type, a struct* pointer. Structs are merely logical arrangements of data that a programmer specifies. So accessing this data via a pointer will set up assembly to be at some offset from the base pointer. If you want to ignore this compiler support, such things are your purview. And in seeing how you can do this, we'll learn what the compiler is doing and what a struct* pointer is. We want a struct with two shorts, followed by a long. And we'll assign 'A', 'B', and 1024 to the three fields respectively.
You might be scratching your head, but option 1 and 2 do the same thing. Option 1 is how a programmer should write the code. Option 2 is roughly what the computer will actually be executing. So just remember that structs and struct* pointers are just logical arrangements of the data for your conveince.
Lastly, the function pointer cannot be ignored. A vital tool in modern architectures is the ability to call functions. Most calls use built in addresses (established at compiler time), but sometimes where the code wants to go isn't known until runtime. Function pointers are what enables inheritance in OO development, dynamic linking, and often finds its use in switch statements and other constructs. (Especially fun to use them with runtime code generation, but that's another post). And at their core, a programmer is merely telling the computer that execution should continue at some address.
To summarize, memory is just sequences of bytes that are given meaning by the types and usage that exist in the source code and consequently the program itself. To be different types of pointers is to look at memory through different lenses.
To begin, there is the all purpose void* pointer (and yes, void* means pointer to void, but I'll keep writing void* pointer for emphasis). A C type that means a pointer to nothing, or anything. This pointer is important for allocating space and casting (changing the type of) the space. This second piece is something that only has representation in the programming language, in assembly every pointer has no type information. So by casting, the programmer tells the compiler that this block of memory is now to be treated differently. Therefore, void* pointers are used when the type is not known, (e.g., pthread_create(..., void* arg) or CreateThread(...,
LPVOID lpParameter,
...)) or to discard existing type information.A char[] is the next C type that will be discussed here. Every array in C is a pointer. Learn this fact. Internalize it. They are so identical that you can use them interchangeably, like so:
char* foo = (char*) malloc(1024 * sizeof(char));
foo[0] = '\0'; // 1*(foo + 0) = '\0'; // 2
Line 1 and 2 are equivalent. So a pointer to many chars is an array of chars. And we can access any array offset with the pointer. Or we can use pointer arithmetic to access specific elements. Now, next time you see char*, you can think of it as an array of characters. (In a later post, we'll cover char** and other more complex pointers).
Working with the third type, a struct* pointer. Structs are merely logical arrangements of data that a programmer specifies. So accessing this data via a pointer will set up assembly to be at some offset from the base pointer. If you want to ignore this compiler support, such things are your purview. And in seeing how you can do this, we'll learn what the compiler is doing and what a struct* pointer is. We want a struct with two shorts, followed by a long. And we'll assign 'A', 'B', and 1024 to the three fields respectively.
typedef struct _pop2 {
short a, b;
long c;
} POP2, *PPOP2;
// Option 1
PPOP2 structP = (PPOP2) malloc(sizeof(POP2));
structP->a = 'A';
structP->b = 'B';
structP->c = 1024;// Or option 2
char* no_struct = (char*) malloc(8);
*(short*)(no_struct + 0) = 'A';
*(short*)(no_struct + 2) = 'B';
*(long*)(no_struct + 4) = 1024;You might be scratching your head, but option 1 and 2 do the same thing. Option 1 is how a programmer should write the code. Option 2 is roughly what the computer will actually be executing. So just remember that structs and struct* pointers are just logical arrangements of the data for your conveince.
Lastly, the function pointer cannot be ignored. A vital tool in modern architectures is the ability to call functions. Most calls use built in addresses (established at compiler time), but sometimes where the code wants to go isn't known until runtime. Function pointers are what enables inheritance in OO development, dynamic linking, and often finds its use in switch statements and other constructs. (Especially fun to use them with runtime code generation, but that's another post). And at their core, a programmer is merely telling the computer that execution should continue at some address.
To summarize, memory is just sequences of bytes that are given meaning by the types and usage that exist in the source code and consequently the program itself. To be different types of pointers is to look at memory through different lenses.
Subscribe to:
Posts (Atom)