This work studied the cost of mutexes / spinlocks and barriers. They presented several different algorithms that scale well with increasing core counts. When I read a work two decades old in Computer Science, at best I might think "Ah, here is where this idea was first proposed." Which is true about the above work, but I also sit scratching my head and wondering how these ideas haven't diffused into common knowledge.
This post will focus on a single type of spinlock, the queued spinlock. I'll discuss the hidden costs and dangers of common spinlocks and why it is sometimes worth using a queued spinlock. Some of the authors were so good to propose several adaptations in Scalable queue-based spin locks with timeout, which I will also be discussing. For this post, we will use an atomic_swap instruction, which is available (in one form or another) on modern architectures.
T atomic_swap(T* loc, T val){ T old = *loc; *loc = *val; return old;}
A basic (test-and-set) spinlock tries swapping in the value "locked" until it gets "unlocked" back, as follows.
void acquire_spinlock(int* lock)
{
int value;
do {
value = atomic_swap(lock, LOCKED);
} while (value == LOCKED);
}
}
Programmers soon realized that this implementation has some drawbacks. Each core is constantly making requests for the same cache line and each one is trying to modify it. The poor cache line is therefore going from processor to processor without anything useful happening. So two changes were made: test-and-test-and-set and exponential backoff. The following acquire routine has been extended with these changes, see "//".
void acquire_spinlock(int* lock)
{
int value, delay = 1;
value = atomic_swap(lock, LOCKED);
while (value == LOCKED)
{
pause(delay); // take time before retrying
delay = delay * 2;
if (*lock != LOCKED) // test before swapping
value = atomic_swap(lock, LOCKED);
}
}
By delaying some time before attempting the lock again, each processor reduces its load on the memory system. However, the core can be unlucky and be delaying when the lock is available, if it was released just after the test. "Test-and-test-and-set" (TTS) is a simple paradigm where the code attempts an inexpensive check, like "*lock != LOCKED" before the expensive operation "atomic_swap".
However, TTS can still have many threads of execution enter the same expensive region. The spinlock here is safe, but a programmer might have:
if (global_buffer == null) {global_buffer = malloc(SPACE);}
This sequence could have several threads attempting to allocate the same large buffer and therefore leaking memory, or worse.But back to spinlocks. With TTS, many threads have all requested read permissions for the lock's cache line. Each sees that the lock is UNLOCKED and requests write permissions. The cores are sending a storm of requests that will conclude with one core acquiring the lock and the rest spinning again.
Skipping over some intermediary research, the state of the art has generally settled on the queued spinlock. The idea is that each waiter will spin on a local variable rather than the "lock" itself. While both referenced works give code for the lock, it is relatively simple to derive once you know it exists (a fellow student and I worked out the algorithm in less than 30min), but I'll provide a sketch here.
struct entry
{
volatile entry* next;
volatile int state;
};
void acquire_queued_spinlock(void* lock, entry* me)
{
me->next = null;
me->state = UNLOCKED;
entry* prev = atomic_swap(lock, me);
if (prev == null) return; // lock is not held
me->state = LOCKED;
prev->next = me;
while (me->state == LOCKED) ;
}
The queued spinlock only maintains a tail-pointer. Each thread should have a pointer to its entry and insertions of new entries are only made to the tail. After inserting, the thread has a unique pointer to the "old" tail of the queue, where it extends the queue. The thread now spins on the local variable "me->state". Furthermore, on the release event, only one thread is unlocked, which minimizes the memory traffic from the cores fighting over the lock. However, the queued lock is more expensive to acquire when there is no contention and it can be a greater pain to debug.
Reviewing the measurements in Algorithms..., the simple test-and-set spinlock performs best at very low levels of contention. After a modest amount of lock contention, the TTS with exponential backoff is performing best. But as the number of requests continue to increase, the queued spinlock dominates performance from then on.
As with all performance work, measure twice, cut once. The correct spinlock for your parallel code will depend on: level of contention and the specific architecture of the system.
No comments:
Post a Comment