Elegant C: microarchitecture

Friday, September 14, 2018

Is C low level?

A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.

The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.

I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?

All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly. I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters. These counters are not well documented and so the tools are low-level. Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue. Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body. For some applications, several loops account for most of the time. Then the loops can be perturbed in different ways that force certain resources to be exercised further. For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.