Wednesday, June 5, 2013

Atomics versus Locks

I've written in the past about using intrinsics to improve performance.  And I've discussed using this functionality to build your own synchronization.  I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question.  But sometimes a little knowledge is a dangerous thing.  The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.

Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation.  So let's test and see how the results turn out.  First, the base code:

unsigned long long counter = 0;

int main(int argc, char** argv)
{
    unsigned long long start, end, o;

    start = rdtsc();
    do {
        counter++;
    } while (counter < MAX);
    end = rdtsc();
   
    printf("%lld\n", (end - start));  
    return 0;
}


This version is the "Single Thread" result.  But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded.  By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration.  Overhead, versus keeping the value in a register, like the assembly that follows:

loop: add    $0x1,%rax
      cmp    $0x1dcd64ff,%rax
      jbe    loop

And the volatile code adds: mov    0x1429(%rip),%rax and a corresponding write after.  Already this is enough to measurably impact performance.  Let's move forward and switch to performing our increment using compare-and-swap.

    do {
        unsigned long long new, old = counter;
       
        if (old < MAX)
        {
            new = old + 1;
            __sync_val_compare_and_swap(&counter, old, new);
        }
        else
            i = 0;
       
    } while (i == 1);


Now, we only want to increment when we haven't reached max.  If we have, then terminate.  But the assembly is now unpleasant:

loop: mov    0x144b(%rip),%rax        # 401a90 <counter>
      cmp    $0x1dcd64ff,%rax
      jbe    swap
...
swap: lea    0x1(%rax),%rcx
      lock cmpxchg %rcx,0x140b(%rip)        # 401a90 <counter>
      jmp    loop

First, the optimized code has split this into two pieces.  The check and the increment, and these have been split.  A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together.  But the fact is, the impact of using the atomic operation looks like 6x.

As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead).  Each version is then modified to spawn a second thread that does the same work as the first.  All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):

     TEST         -      -O3      -      -O0
Single Thread          -   1 cyc / inc -   7 cyc / inc
Single Thread volatile -   6 cyc / inc -  16 cyc / inc
Single Thread w/ CAS   -  38 cyc / inc -  46 cyc / inc
Single Thread w/ lock  -  90 cyc / inc -  92 cyc / inc
Two Threads   w/ CAS   - 250 cyc / inc - 251 cyc / inc
Two Threads   w/ lock  - 730 cyc / inc - 630 cyc / inc

What can we conclude?  First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced.  Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.  

Second, but why did unoptimized lock work better?  Because there was a different thread interleaving.  Perhaps the OS put one of threads asleep and the other thread ran happily for a while.  The point is that if you are synchronizing, then that is your overhead.

Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock.  Yet, this test was highly artificial.  The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization.  Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730).  Furthermore, how often is your synchronization a single variable?

To conclude, IF what you want to synchronize is a single variable, then using CAS can work.  Have I done this?  Yes.  Do I normally?  No.  I have this tool in my programming toolbox, for those times that I really need it.

Anyway, why not use atomic increment instead?  Because we want to ensure that the final value is exactly MAX, no more and no less.

No comments: