Elegant C: June 2013

Friday, June 21, 2013

What is Time: Keeping Time in Software

In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application. However, sometimes what you want is to measure an interval of time. We can generally agree to do this by taking two measurements: the time before and the time after. Then find the difference, which is the interval of time elapsed. Great! But how exactly are we going to measure time?

A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.

But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:

#include <stdio.h>
#include <time.h>

int main(int argc, char** argv)
{

    struct timespec n, o;
    clockid_t t = CLOCK_MONOTONIC;

    clock_getres(t, &n);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    clock_gettime(t, &n);
    clock_gettime(t, &o);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    printf("%d - %d\n",o.tv_sec, o.tv_nsec);

    return 0;
}

% ./t_test
0 - 999848
13390932 - 798720095
13390932 - 798720095

This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.

But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?

But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.

But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.

These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).

Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.

As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.

Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature. Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).

In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.

Wednesday, June 5, 2013

Atomics versus Locks

I've written in the past about using intrinsics to improve performance. And I've discussed using this functionality to build your own synchronization. I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question. But sometimes a little knowledge is a dangerous thing. The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.

Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation. So let's test and see how the results turn out. First, the base code:

unsigned long long counter = 0;

int main(int argc, char** argv)
{
    unsigned long long start, end, o;

    start = rdtsc();
    do {
        counter++;
    } while (counter < MAX);
    end = rdtsc();

    printf("%lld\n", (end - start));
    return 0;
}

This version is the "Single Thread" result. But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded. By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration. Overhead, versus keeping the value in a register, like the assembly that follows:

loop: add    $0x1,%rax
      cmp    $0x1dcd64ff,%rax
      jbe    loop

And the volatile code adds: mov    0x1429(%rip),%rax and a corresponding write after. Already this is enough to measurably impact performance. Let's move forward and switch to performing our increment using compare-and-swap.

    do {
        unsigned long long new, old = counter;

        if (old < MAX)
        {
            new = old + 1;
            __sync_val_compare_and_swap(&counter, old, new);
        }
        else
            i = 0;

    } while (i == 1);

Now, we only want to increment when we haven't reached max. If we have, then terminate. But the assembly is now unpleasant:

loop: mov    0x144b(%rip),%rax        # 401a90 <counter>
      cmp    $0x1dcd64ff,%rax
      jbe    swap
...
swap: lea    0x1(%rax),%rcx
      lock cmpxchg %rcx,0x140b(%rip)        # 401a90 <counter>
      jmp    loop

First, the optimized code has split this into two pieces. The check and the increment, and these have been split. A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together. But the fact is, the impact of using the atomic operation looks like 6x.

As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead). Each version is then modified to spawn a second thread that does the same work as the first. All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):

TEST - -O3 - -O0

Single Thread - 1 cyc / inc - 7 cyc / inc

Single Thread volatile - 6 cyc / inc - 16 cyc / inc

Single Thread w/ CAS - 38 cyc / inc - 46 cyc / inc

Single Thread w/ lock - 90 cyc / inc - 92 cyc / inc

Two Threads w/ CAS - 250 cyc / inc - 251 cyc / inc

Two Threads w/ lock - 730 cyc / inc - 630 cyc / inc

What can we conclude? First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced. Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.

Second, but why did unoptimized lock work better? Because there was a different thread interleaving. Perhaps the OS put one of threads asleep and the other thread ran happily for a while. The point is that if you are synchronizing, then that is your overhead.

Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock. Yet, this test was highly artificial. The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization. Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730). Furthermore, how often is your synchronization a single variable?

To conclude, IF what you want to synchronize is a single variable, then using CAS can work. Have I done this? Yes. Do I normally? No. I have this tool in my programming toolbox, for those times that I really need it.

Anyway, why not use atomic increment instead? Because we want to ensure that the final value is exactly MAX, no more and no less.