Elegant C: performance analysis

Showing posts with label performance analysis. Show all posts

Tuesday, July 31, 2018

Book Review: The Art of Application Performance Testing

The Art of Application Performance Testing, covers what it says. The book starts with concepts general to any performance testing, which was interesting to me. Most of the text focuses though on the Application part of the title. The applications here are primarily web-based, or other client-server based setups, and not just the generic "application" referring to any program. That said, I do not work on such applications, so the remainder of the text was of less value to me.

In testing applications, a performance analyst needs to establish a representative workload, which includes the actions to perform, and the combined load. For example, most users logging in to their bank will view their account balance, while others might transfer money or pay a bill. Combined these actions might represent most of the work from users. Then for each unit of server, how many users should be able to perform a mix of those actions, which forms the load.

After establishing the workload, the analyst needs to implement the described workload, which requires a tool that generates the load (either by driving the application itself or replaying a synthetic trace of the load). For those tools, what additional hardware is required to deploy this load? Does the deployment take into account geographic and other user variations (so that the load generation is representative of the user base)? Finally, what tooling and methodology exists for profiling and recording the execution of the workload for present and future analysis?

So I appreciated the content of the book and would recommend it to individuals focusing on testing of user-facing applications.

Wednesday, January 13, 2016

Repost: Avoid Panicking about Performance

In a recent post, another blogger related how a simple attempt to improve performance nearly spiraled out of control. The lesson is that always measure and understand your performance problem before attempting any solution. Now, the very scope of your measurements and understanding can vary depending on the complexity of your solution. And when your "optimizations" have caused the system to go sideways, it is time to take a careful appraisal of whether to revert or continue. I have done both, and more often have I wished that I reverted rather than continued. Afterall, it is better for the code to work slowly rather than not work.

Again, always measure before cutting.

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly. I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters. These counters are not well documented and so the tools are low-level. Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue. Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body. For some applications, several loops account for most of the time. Then the loops can be perturbed in different ways that force certain resources to be exercised further. For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.