Elegant C: spinlocks

Showing posts with label spinlocks. Show all posts

Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.

Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work. I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration. Now that it is Monday, it is time to attend some conference talks. In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered. In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions. For example, is a queue required to always return an element, or can it fail if an element was recently added? Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant. On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower. To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology. Each level of the hierarchy acts as a MCS-style queue lock. How then can these threads abort from their queue? In a single level, threads mark their status as aborted and are then skipped when handing off the lock. In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads. The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs. Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size. The concern is that the trace / analysis approaches are very slow and could potentially skew the results. And when the results are derived from only one trace, there is additional question about their applicability.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Saturday, July 16, 2011

Parallel Programming - Synchronization

Having submitted a paper this week to a conference, I have been lax on posting any updates. With the submission behind me, I shall present another topic that has arisen recently in research and conversations - scalable synchronization. For this post, I presume everyone has some knowledge of synchronization constructs like mutual exclusion, and knowledge of cache coherence is also valuable. The story of this post begins 20 years ago when Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors was published.

This work studied the cost of mutexes / spinlocks and barriers. They presented several different algorithms that scale well with increasing core counts. When I read a work two decades old in Computer Science, at best I might think "Ah, here is where this idea was first proposed." Which is true about the above work, but I also sit scratching my head and wondering how these ideas haven't diffused into common knowledge.

This post will focus on a single type of spinlock, the queued spinlock. I'll discuss the hidden costs and dangers of common spinlocks and why it is sometimes worth using a queued spinlock. Some of the authors were so good to propose several adaptations in Scalable queue-based spin locks with timeout, which I will also be discussing. For this post, we will use an atomic_swap instruction, which is available (in one form or another) on modern architectures.

T atomic_swap(T* loc, T val){ T old = *loc; *loc = *val; return old;}

A basic (test-and-set) spinlock tries swapping in the value "locked" until it gets "unlocked" back, as follows.

void acquire_spinlock(int* lock)

{

    int value; 

    do {

        value = atomic_swap(lock, LOCKED);

    } while (value == LOCKED);

}

Programmers soon realized that this implementation has some drawbacks. Each core is constantly making requests for the same cache line and each one is trying to modify it. The poor cache line is therefore going from processor to processor without anything useful happening. So two changes were made: test-and-test-and-set and exponential backoff. The following acquire routine has been extended with these changes, see "//".

void acquire_spinlock(int* lock)

{

    int value, delay = 1;

    value = atomic_swap(lock, LOCKED); 

    while (value == LOCKED)

{

        pause(delay); // take time before retrying

        delay = delay * 2; 

        if (*lock != LOCKED) // test before swapping

            value = atomic_swap(lock, LOCKED);

}

}

By delaying some time before attempting the lock again, each processor reduces its load on the memory system. However, the core can be unlucky and be delaying when the lock is available, if it was released just after the test. "Test-and-test-and-set" (TTS) is a simple paradigm where the code attempts an inexpensive check, like "*lock != LOCKED" before the expensive operation "atomic_swap".

However, TTS can still have many threads of execution enter the same expensive region. The spinlock here is safe, but a programmer might have:

if (global_buffer == null) {global_buffer = malloc(SPACE);}

This sequence could have several threads attempting to allocate the same large buffer and therefore leaking memory, or worse.

But back to spinlocks. With TTS, many threads have all requested read permissions for the lock's cache line. Each sees that the lock is UNLOCKED and requests write permissions. The cores are sending a storm of requests that will conclude with one core acquiring the lock and the rest spinning again.

Skipping over some intermediary research, the state of the art has generally settled on the queued spinlock. The idea is that each waiter will spin on a local variable rather than the "lock" itself. While both referenced works give code for the lock, it is relatively simple to derive once you know it exists (a fellow student and I worked out the algorithm in less than 30min), but I'll provide a sketch here.

struct entry

{

    volatile entry* next;

    volatile int state;

};

void acquire_queued_spinlock(void* lock, entry* me)

{

    me->next = null; 

    me->state = UNLOCKED; 

    entry* prev = atomic_swap(lock, me);

    if (prev == null) return;  // lock is not held

    me->state = LOCKED;

    prev->next = me;

    while (me->state == LOCKED) ;

}

The queued spinlock only maintains a tail-pointer. Each thread should have a pointer to its entry and insertions of new entries are only made to the tail. After inserting, the thread has a unique pointer to the "old" tail of the queue, where it extends the queue. The thread now spins on the local variable "me->state". Furthermore, on the release event, only one thread is unlocked, which minimizes the memory traffic from the cores fighting over the lock. However, the queued lock is more expensive to acquire when there is no contention and it can be a greater pain to debug.

Reviewing the measurements in Algorithms..., the simple test-and-set spinlock performs best at very low levels of contention. After a modest amount of lock contention, the TTS with exponential backoff is performing best. But as the number of requests continue to increase, the queued spinlock dominates performance from then on.

As with all performance work, measure twice, cut once. The correct spinlock for your parallel code will depend on: level of contention and the specific architecture of the system.