Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly.  I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters.  These counters are not well documented and so the tools are low-level.  Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue.  Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body.  For some applications, several loops account for most of the time.  Then the loops can be perturbed in different ways that force certain resources to be exercised further.  For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.  

No comments: