Elegant C: December 2011

Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.

First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").

The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).

So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.

But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).

Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.

Again, always measure performance before cutting (i.e., optimizing).

Elegant C

Friday, December 23, 2011

Measuring Performance