Monday, July 25, 2011

Book Review: Masterminds of Programming

Several years ago I read a couple of interesting interviews with language designers, so my curiously was piqued by Masterminds of Programming: Conversations with the Creators of Major Programming Languages (Theory in Practice (O'Reilly)). While the title is pretentious, the premise was interesting in that here are 17 well-used languages, so what was the designer thinking? Why did he make the choices that he did?

Alas, it was not quite to be. Interesting yes, but the discussions were sometimes more general involving all of computer science. Yet nearly every interview had memorable lines and claims; a few of which I shall reproduce here:

"C is a reasonably good language for compilers to generate, but the idea that humans beings should program in it is completely absurd." - Bertrand Meyer, Eiffel

"One of our programming languages guys proposed a competition in which people would program the same program in all three of those [languages] and we'd see .... It turned out the only thing we figured out is it all depended on where the brightest programmer went...." - Charles Geschke, PostScript

"A good programmer writes good code quickly. Good code is correct, compact and readable. 'Quickly' means hours to days."
and
"An operating system does absolutely nothing for you. As long as you had something - a subroutine called a disk driver, a subroutine called some kind of communication support, in the modern world, it doesn't do anything else." - Chuck Moore, FORTH

"I do recommend [C++] and not everybody is reluctant. In fact, I don't see much reluctance in [system software or embedded systems] beyond the natural reluctance to try something new in established organizations. Rather, I see steady and significant growth in C++ use."
and
"I have never seen a program that could be written better in C than in C++. I don't think such a program could exist." - Bjarne Stroustrup, C++

So now I go to Amazon to remove the book from the list to read, and record a score of 3 out of 5.

Saturday, July 16, 2011

Parallel Programming - Synchronization

Having submitted a paper this week to a conference, I have been lax on posting any updates.  With the submission behind me, I shall present another topic that has arisen recently in research and conversations - scalable synchronization.  For this post, I presume everyone has some knowledge of synchronization constructs like mutual exclusion, and knowledge of cache coherence is also valuable.  The story of this post begins 20 years ago when Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors was published.

This work studied the cost of mutexes / spinlocks and barriers.  They presented several different algorithms that scale well with increasing core counts.  When I read a work two decades old in Computer Science, at best I might think "Ah, here is where this idea was first proposed."  Which is true about the above work, but I also sit scratching my head and wondering how these ideas haven't diffused into common knowledge.

This post will focus on a single type of spinlock, the queued spinlock.  I'll discuss the hidden costs and dangers of common spinlocks and why it is sometimes worth using a queued spinlock.  Some of the authors were so good to propose several adaptations in Scalable queue-based spin locks with timeout, which I will also be discussing.  For this post, we will use an atomic_swap instruction, which is available (in one form or another) on modern architectures.

T atomic_swap(T* loc, T val){ T old = *loc; *loc = *val; return old;}

A basic (test-and-set) spinlock tries swapping in the value "locked" until it gets "unlocked" back, as follows.

void acquire_spinlock(int* lock)
{
    int value;
    do {
        value = atomic_swap(lock, LOCKED);
    } while (value == LOCKED);
}

Programmers soon realized that this implementation has some drawbacks.  Each core is constantly making requests for the same cache line and each one is trying to modify it.  The poor cache line is therefore going from processor to processor without anything useful happening.  So two changes were made: test-and-test-and-set and exponential backoff.  The following acquire routine has been extended with these changes, see "//".

void acquire_spinlock(int* lock)
{
    int value, delay = 1;
    value = atomic_swap(lock, LOCKED);
    while (value == LOCKED)
    {
        pause(delay); // take time before retrying
        delay = delay * 2;
        if (*lock != LOCKED) // test before swapping
            value = atomic_swap(lock, LOCKED);
    }
}

By delaying some time before attempting the lock again, each processor reduces its load on the memory system.  However, the core can be unlucky and be delaying when the lock is available, if it was released just after the test.  "Test-and-test-and-set" (TTS) is a simple paradigm where the code attempts an inexpensive check, like "*lock != LOCKED" before the expensive operation "atomic_swap".

However, TTS can still have many threads of execution enter the same expensive region.  The spinlock here is safe, but a programmer might have:
if (global_buffer == null) {global_buffer = malloc(SPACE);}
This sequence could have several threads attempting to allocate the same large buffer and therefore leaking memory, or worse.

But back to spinlocks.  With TTS, many threads have all requested read permissions for the lock's cache line.  Each sees that the lock is UNLOCKED and requests write permissions.  The cores are sending a storm of requests that will conclude with one core acquiring the lock and the rest spinning again.

Skipping over some intermediary research, the state of the art has generally settled on the queued spinlock.  The idea is that each waiter will spin on a local variable rather than the "lock" itself.  While both referenced works give code for the lock, it is relatively simple to derive once you know it exists (a fellow student and I worked out the algorithm in less than 30min), but I'll provide a sketch here.

struct entry
{
    volatile entry* next;
    volatile int state;
};

void acquire_queued_spinlock(void* lock, entry* me)
{
    me->next = null; 
    me->state = UNLOCKED;
    entry* prev = atomic_swap(lock, me);
    if (prev == null) return;  // lock is not held
    me->state = LOCKED;
    prev->next = me;
    while (me->state == LOCKED) ;
}

The queued spinlock only maintains a tail-pointer.  Each thread should have a pointer to its entry and insertions of new entries are only made to the tail.  After inserting, the thread has a unique pointer to the "old" tail of the queue, where it extends the queue.  The thread now spins on the local variable "me->state". Furthermore, on the release event, only one thread is unlocked, which minimizes the memory traffic from the cores fighting over the lock.  However, the queued lock is more expensive to acquire when there is no contention and it can be a greater pain to debug.

Reviewing the measurements in Algorithms..., the simple test-and-set spinlock performs best at very low levels of contention.  After a modest amount of lock contention, the TTS with exponential backoff is performing best.  But as the number of requests continue to increase, the queued spinlock dominates performance from then on.

As with all performance work, measure twice, cut once.  The correct spinlock for your parallel code will depend on: level of contention and the specific architecture of the system.