Showing posts with label parallel programming. Show all posts
Showing posts with label parallel programming. Show all posts

Wednesday, April 22, 2020

Thesis Proposal - Lightweight Preemptable Functions

Sol Boucher presented his thesis proposal today via Zoom.  It was well attended and I think many of us are already looking forward to his defense.

Lightweight Preemptable Functions (LPF)
Function calls require some knowledge of the cost for a program to know whether it can invoke it or not in other time critical execution.

In the current space of multitasking, three approaches will be first reviewed: futures, threads, and processes.  Futures rely on the runtime to provide asynchronous execution, but preemption requires yield points, and not all execution can support these points, nor can it be clear what this future will do.  With threads, execution is fairly independent, but minimal information is provided toward scheduling and cancellation is hard.  Processes provide good support for cancellation, except that on fork(), only one thread is required to be carried over in the new process.  The other threads are canceled, which can result in inconsistent state.

The thesis proposal is: Introducing a novel abstraction for preemption at the granularity of a synchronous function call, which includes a timeout.

launch(function, timeout, argument) returning a continuation.  If the timeout elapses, the continuation is returned, but it is then the choice of the caller for whether to resume it or cancel.  This LPF executes within the same thread context as the caller, thereby reducing overhead.  However, to resume the LPF, it will need a new stack.  To support the timeout, it relies on a timer signal that can occur every ~5 microseconds.  Launch / resume have overhead comparable to this, significantly better than fork or pthread_create.  However, cancel is extremely expensive.

LPFs also have an issue with calling functions that are non-reentrant, similar to the rules governing signal handlers.  To address this, the runtime provides selective relinking to capture what the LPF is calling via the global offset table (GOT).  Some GOT entries point to dynamic libraries, other entries are initially pointing to the dynamic linker.  This runtime support also needs to intercept thread local variables.  This interception support imposes about 10ns of  overhead, which is little above the cost of function calls themselves.

Microservice approaches have significant latency, often tends to hundreds of microseconds.  Primarily the requirement to create a sufficient container, often via processes or virtualization.  If the microservice was instead written safely and using LPFs, then the latency could be reduced toward the hardware bound as measured by communicating between VMs or committing transactions.

Cancellation cleanup is difficult in languages, such as C, that require explicit cleanup.  In other languages, adding a new exception path for timeout and cancellation could then invoke the necessary destructors.  Nonetheless, this can be expensive (perhaps single milliseconds).

Other possible future work:
Another cancellation step is the cost of unloading the mirrored library, so could the runtime instead track the changes made and then determine whether to roll back or discard.
Is it possible to reduce the overhead of the preemption signals or improving their granularity.

Friday, September 14, 2018

Is C low level?

A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level".  The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language.  As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction. 
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process.  Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.

The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs".  If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model."  Which generally sounds like the GPU design.

I appreciate the callouts to C's shortcomings, which it certainly has.  The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast.  And with the programs being written in either C or a language built on C, that forces many programs into particular patterns.  I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?

All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it.  This is a gap that needs to be addressed, and by a language that has explicit parallel support.

Wednesday, August 29, 2018

Thesis Defense: Responsive Parallel Computation

We can consider parallelism strategies as either competitive (such as pthreads) or cooperative (such as Cilk).  Cooperative systems assume that the parallel computation is being equally distributed across the threads, such that other tasks are not prioritized.

Into this, Stefan Muller presented a Responsive Parallel Computation approach where the two models can be unified.  Or put another way, "We can extend existing cooperative language and cost models to account for competitive threading constructs, and to design and implement efficient and responsive scheduling algorithms."

Cooperative systems, such as Cilk, rely on task queues and work stealing to implement the parallelism, but when IO is involved, the common implementations use blocking system calls.  In the cooperative system, this blocking call then blocks one of the worker threads, rather than just the working task.

For operating system scheduling, there are many threads within the myriad of processes in the system that may need to run.  The OS often uses a priority associated with any thread / process to determine these scheduling decisions.  When programmers assign priorities (if they do at all), they often are trying to express a relative ordering between different tasks rather than absolute priorities.  As some systems limit priorities to a handful to as many as 100, the mapping of priority classes to numeric levels can be unclear.

Propose an extension to ML, PriML, with two new types:
t cmd[p] - command p returning t
t thread[p] - thread at priority p returning t

priority and order declarations enable defining the priority and the specifying the partial order between those priorities.  spawn[priority] {expression} to create the thread.  sync to retrieve the result.  Or poll to return an option that can be matched, so that we can select from multiple threads.  For example, the following PriML sequence spawns two threads to sort emails with different methods and selects the faster one.

t1 <- spawn[p] {do (qsort date emails)};
t2 <- spawn[p] {do (mergesort date emails)};
do (let fun choose () =
  cmd[p]
  {
    v1 <- poll t1;
    v2 <- poll t2;
    do (case (v1, v2) of
      (SOME l, _) =>
        cmd[p] {cancel t2; ret l}
      | (_, SOME l) =>
        cmd[p] {cancel t1; ret l}
      | (NONE, NONE) => choose ())
    }
  in
    choose ()
  end)

These features then allow the language to enforce the priority constraints, and avoid any priority inversion, as sync must always be to an equal or higher priority, and poll can be to any.

Returning to parallel program analysis, we represent them with DAGs; however, the DAG must be extended with priority annotations and the work / span calculations.  This analysis can explore different schedules, such that adding fairness for all thread priorities would extend the bound on time.

Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed.  The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs.  Perhaps this is achievable with ordering constraints.  Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
  data = msg;
  flag = 1;
}

int recv() {
  while (!flag) continue;
  return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data.  These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock?  Let's add two special labels: pre, post.  These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations.  Do the constraints need to be within a single iteration or constraining every iteration?  Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
  VEDGE_HERE(before, after);
  L(before, x = i);
  L(after, y = i + 10);
}

Code extends LLVM and is on GitHub.  The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly.  The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions).  Then in lowering, the lowering to assembly can better take advantage of the specific constraints required.  Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8.  I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models?  argues that C++11 would require acquire semantics on unlock.  Here it is stated that RMC is much more straightforward.  Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings.  Recall that the coarsest grained instruction is the full memory fence.

Monday, May 1, 2017

Book Review: Multicore and GPU Programming: An Integrated Approach

I wanted to like this book, Multicore and GPU Programming: An Integrated Approach, and be able to use it in the classroom.  I teach a class where we cover OpenMP, MPI, Cilk, and Cuda.  Yet, I would not use this book and I have told the publishers this as well, to which they acknowledged my reasoning.

First, the author elects to use QtThreads for the base parallel programming approach.  He notes that he had considered using pthreads or C++11 thread support, and comments that he rejected C++11 threads for the book as the support was incomplete at the time.  Pthreads is left without comment.  That may be, but I have heard of and used those options, while QtThreads is something that I have never considered.

Second, the book serves as an admirable API reference.  For one or a set of function calls, significant space is dedicated to how that call works and illustrating examples for it.  Structured Parallel Programming also covers many APIs, yet it maintains a feel of being about parallel programming in general rather that the calls specifically.  However, that work covers different API sets so the two are not explicitly comparable.

Third, and this issue is really more for the editors rather than the author, the typesetting on each page is poor.  There is significant white space left bordering the text on each page.  Furthermore, the code wraps, and often not for being long code, but for long comments and comments on the same line as the code.  I understand that in programming these are stylistic choices; however, the impact of finding line wraps from long comments leaves the text looking unprofessional.  I must assume that the author wrote the code separately and then provided for being included into the book, but the editor failed to make allowances for typesetting.

In conclusion, I wanted to like and use the book.  Whenever I speak with the publisher, they always direct me to it, I just have to hope for something else to come along.  Use it as a reference perhaps, but I am cautious in my recommendation.

(This book was provided free by the publisher to review for possible use in the classroom.)

Tuesday, February 7, 2017

Conference Attendance CGO (etc) 2017 - Day 2

Several of the talks were great and very interesting.  Other talks particularly needed further presentation practice.  Unfortunately, sometimes this may come from English as a second language.  And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.

Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector.  This code takes 0.3s to run.  Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores).  When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component.  (Similar to thread create routines taking in a function to execute).  With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".

Instead, let's propose three new instructions to include in the LLVM IR.  Supporting these lines required approximately 6000 lines of changes.  When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.

Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG.  The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow.  Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.

Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler.  Each component stores hashes of the components on which it depends.  When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd.  These hash changes can then either propagate, if changed, or the compilation process will complete.

Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size.  These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww).  Hardware profilers can then determine the hot functions.  This information is then processed to determine the hot functions that should be clustered together.  These clusters are mapped to separate loader sessions that will load them using huge pages.

Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization.  Let's ask machine learning to figure this out.  The danger is always finding the right level of training to provide valuable insights without overfitting, etc.


Tuesday, January 19, 2016

Conference Attendance HiPEAC - Day 2 - Papers

Another conference day.  Much of my time is spent talking with other attendees and doing "work", such as preparing my presentation, send emails, etc.  However, I do take some time to actually sit in on other presentations, so here are two highlights:

PARSECs - This work explores rewriting some of the PARSEC benchmarks to use a task-based parallelism (OpenMP tasks), rather than pthreads.  For many workloads, these changes provide improved scaling.  For almost all workloads, the code size was reduced as the original thread pools, job queues, etc could be removed.  In the near future, these revised versions should be released.

HRF-Relaxed - The original OpenCL had no memory model; however, many vendors implemented one.  Now, C++ and other languages use SC for DRF (sequential consistency for data-race-free programs).  Unfortunately, if you use this consistency model in OpenCL, you will lose performance.  Instead, this work proposes a hierarchical race-free model, where the races are only checked at a certain scope of the program.

Tuesday, December 1, 2015

Course Design Series (Post 3 of N) - Features versus Languages

One problem I had in preparing the curriculum for this fall is how to balance teaching programming languages versus the features that exist in programming languages.  Some analogs to my course have generally been surveys of programming languages, for example - Michael Hewner's course.  In such a course, study is focused strongly on learning a diverse set of languages, particularly pulling from logic and functional programming paradigms.

In general, the problem is that the each feature needs examples from programming languages to illustrate how it is used; however, many of these languages have never been previously taught to the students.  Therefore, I could spend the first month teaching the basics of Ada, Fortran, Pascal, et cetera.  But in doing this, essentially the class starts as a language class.  I had considered this approach; however, the non-survey courses and the textbooks do not begin with the languages.  These examples all begin with the features.  Furthermore, I have taken the further approach to avoid focusing on specific syntax of languages and instead devote my time to teaching features and design.

Having then accepted that the textbooks had a purpose and were rational in design, I followed their structure.  And in retrospect I can see that having upper-level students provides a base of knowledge about programming languages such that the need to cover specifics is avoided.  I have still taken some class periods to delve into specific languages; however, this is the exception rather than a rule.  I do note that in the future, I would spend some time teaching / requiring specific languages.  Without this requirement, I have been faced with grading a myriad of languages and find myself unable to assign problems / questions based on specific languages.

Monday, April 13, 2015

Book Review: Structured Parallel Programming

At SIGCSE this year, I spoke with several publishers about possible textbooks.  Primarily, for ones that would work in the different classes that I might teach.  As well as those that relate to my present research.  In this case, I received a copy of Structured Parallel Programming: Patterns for Efficient Computation from the publisher to review and consider for possible future use in a course.

This work is in three parts: the basics of parallelism and performance, the common patterns in which parallelism is expressed, and example implementations of several algorithms.  The second part is the core of the work.  To show maps, reduce, scatter and gather, stencil, fork-join, and pipeline.  But before we learned those details, we would come to key quotes for all that I do:
You cannot neglect the performance of serial code, hoping to make up the difference with parallelism.
And:
[The] performance of scalar processing is important; if it is slow it can end up dominating performance.
Therefore, parallelism is a method for improving the performance of already efficient code.

With both the common patterns, as well as the example implementations, the authors generally provide the source code for each pattern and implementation using Cilk, TBB, and OpenMP.  This source is not for casual readers.  More involved implementations can stretch for several pages, as the initial implementation and then subsequent refinements are explored.  While it serves well as a reference, it may have worked better to focus on one parallelism approach for each section and therefore give further explanation to the code, especially the language features used.  And thereby retain the pattern itself rather than becoming a practitioners' reference.

The example implementations (the third part) are perhaps the least interesting for the classroom and potentially the most interesting for practitioners.  Clearly, if I was trying to write code similar to one of these problems, I would have an excellent reference and starting point.  However, that is quite rarely the case for myself and I suspect most people as well.

If I was teach a parallel programming course, I might consider using this work (although I still have other, similar textbooks to review); however, were I to do so I would be confining my teaching to the first two parts and may even to just 1 parallel programming paradigm.  Yes, I will admit that the last parallel programming course I took covered a diversity of paradigms (Cilk, vectorization, GPUs, OpenMP, MPI), yet I would have preferred to focus more on what one or two paradigms are capable of rather than just the taste of many.  Parallel programming takes a lot of work to learn and this book is one piece in that effort.

Thursday, March 5, 2015

Conference Attendance SIGCSE 2015 - Day 1 Morning

It is colder here in Kansas City.  Fortunately, I will only be outside briefly.  Most often I will be networking and continuing my efforts to both become a better teacher, as well as finding an academic job teaching.

This morning, I am focusing on the "Curriculum" track.  I am excited by the three papers in this track, the first looks at research, the second is on systems courses, and the last on parallel computing courses.  Alas, I was in the hallway track and missed the first work.  Perhaps I can find the authors later.

Backward Design: An Integrated Approach to a Systems Curriculum
The goal of systems is "higher level software creation".  Computer Science courses are split into Core Tier 1 and Tier 2 (a term from the ACM 2013 curriculum), where the former are taken by all CS majors and the later are only taken by most or some.  One issue in the old curriculum was that OS also taught C.  In crafting a new curriculum, first establish a vision statement, which can be used in conflict resolution (and also revised).  Establish SMART objectives to prepare and build the assessments.  The results can be found on github.

A Module-based Approach to Adopting the 2013 ACM Curricular Recommendations on Parallel Computing
Parallel computing is important and important for CS graduates to know.  The 2013 ACM Curriculum increased the number of hours that students should take in parallel computing.  Part of the recommendations are to place parallel computing into the curriculum and not just as a course.  Thus parallelism modules are placed throughout the curriculum (perhaps as early as CS1 or CS2).  Find the level of abstraction for a concept and introduce it appropriately.  For example, Amdahl's Law in CS1 versus cache coherence in senior-level class.  5 modules of parallelism were established, which have equivalences with the ACM.  Each course in the curriculum may have 1 or more modules, which then teaches and reinforces the topics.  Even after adding these modules, there has continued to be incremental development and revisions, which have improved student outcomes.  The key take away is that it is possible to introduce these recommendations without completely rewriting the curriculum.

In the afternoon, I will be standing with my poster -  Using Active Learning Techniques in Mixed Undergraduate / Graduate Courses.  Later I will post updates from my afternoon.

Tuesday, September 16, 2014

Atomic Weapons in Programming

In parallel programming, most of the time the use of locks is good enough for the application.  And when it is not, then you may need to resort to atomic weapons.  While I can and have happily written my own lock implementations, its like the story of a lawyer redoing his kitchen himself.  It is not a good use of the lawyer's time unless he's enjoying it.

That said, I have had to use atomic weapons against a compiler.  The compiler happily reordered several memory operations in an unsafe way.  Using fence instructions, I was able to prevent this reordering, while not seeing fences in the resulting assembly.  I still wonder if there was some information I was not providing.

Regardless, the weapons are useful!  And I can thank the following presentation for illuminating me to the particular weapon that was needed, Atomic Weapons.  I have reviewed earlier work by Herb Sutter and he continues to garner my respect (not that he is aware), but nonetheless I suggest any low-level programmer be aware of the tools that are available, as well as the gremlins that lurk in these depths and might necessitate appropriate weaponry.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize."  At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements).  All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization.  And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors.  The first pattern is termed "loop interchange", named for its solution.  In this pattern, the program does not access data with spatial locality.  Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets.  In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
  for (i = 0; i < N; i ++)
    f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing.  Threads in a program intentionally and unintentionally share data.  Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need.  The hardware is expecting locality from arrays and data structures.  When multithreaded, the cache line is the unit by which the hardware tracks sharing of data.  So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching.  The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
  for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
      X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing.  Common data structures may also have different sharing patterns for each element.  For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right.  The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two.  Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;
   
    // tree sorted by value
    struct _node* left;
    struct _node* right;
   
    // tree sorted by encoding

    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;


The final pattern explored in the paper is data alignment.  Ignoring the issue of misaligned accesses, let's look at misaligned allocations.  Suppose we allocate an array of 48-byte data structures in a multithreaded program.  Sometimes accessing an element is one cache miss, but sometimes it is two.  The runtime system has packed the data structures together, with 4 fitting in 3 cache lines.  In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line.  The primary solution is to use support calls that change the allocation alignment.  This may waste space, but now the allocation comes using our expected number of cache lines.  And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years.  I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Saturday, July 16, 2011

Parallel Programming - Synchronization

Having submitted a paper this week to a conference, I have been lax on posting any updates.  With the submission behind me, I shall present another topic that has arisen recently in research and conversations - scalable synchronization.  For this post, I presume everyone has some knowledge of synchronization constructs like mutual exclusion, and knowledge of cache coherence is also valuable.  The story of this post begins 20 years ago when Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors was published.

This work studied the cost of mutexes / spinlocks and barriers.  They presented several different algorithms that scale well with increasing core counts.  When I read a work two decades old in Computer Science, at best I might think "Ah, here is where this idea was first proposed."  Which is true about the above work, but I also sit scratching my head and wondering how these ideas haven't diffused into common knowledge.

This post will focus on a single type of spinlock, the queued spinlock.  I'll discuss the hidden costs and dangers of common spinlocks and why it is sometimes worth using a queued spinlock.  Some of the authors were so good to propose several adaptations in Scalable queue-based spin locks with timeout, which I will also be discussing.  For this post, we will use an atomic_swap instruction, which is available (in one form or another) on modern architectures.

T atomic_swap(T* loc, T val){ T old = *loc; *loc = *val; return old;}

A basic (test-and-set) spinlock tries swapping in the value "locked" until it gets "unlocked" back, as follows.

void acquire_spinlock(int* lock)
{
    int value;
    do {
        value = atomic_swap(lock, LOCKED);
    } while (value == LOCKED);
}

Programmers soon realized that this implementation has some drawbacks.  Each core is constantly making requests for the same cache line and each one is trying to modify it.  The poor cache line is therefore going from processor to processor without anything useful happening.  So two changes were made: test-and-test-and-set and exponential backoff.  The following acquire routine has been extended with these changes, see "//".

void acquire_spinlock(int* lock)
{
    int value, delay = 1;
    value = atomic_swap(lock, LOCKED);
    while (value == LOCKED)
    {
        pause(delay); // take time before retrying
        delay = delay * 2;
        if (*lock != LOCKED) // test before swapping
            value = atomic_swap(lock, LOCKED);
    }
}

By delaying some time before attempting the lock again, each processor reduces its load on the memory system.  However, the core can be unlucky and be delaying when the lock is available, if it was released just after the test.  "Test-and-test-and-set" (TTS) is a simple paradigm where the code attempts an inexpensive check, like "*lock != LOCKED" before the expensive operation "atomic_swap".

However, TTS can still have many threads of execution enter the same expensive region.  The spinlock here is safe, but a programmer might have:
if (global_buffer == null) {global_buffer = malloc(SPACE);}
This sequence could have several threads attempting to allocate the same large buffer and therefore leaking memory, or worse.

But back to spinlocks.  With TTS, many threads have all requested read permissions for the lock's cache line.  Each sees that the lock is UNLOCKED and requests write permissions.  The cores are sending a storm of requests that will conclude with one core acquiring the lock and the rest spinning again.

Skipping over some intermediary research, the state of the art has generally settled on the queued spinlock.  The idea is that each waiter will spin on a local variable rather than the "lock" itself.  While both referenced works give code for the lock, it is relatively simple to derive once you know it exists (a fellow student and I worked out the algorithm in less than 30min), but I'll provide a sketch here.

struct entry
{
    volatile entry* next;
    volatile int state;
};

void acquire_queued_spinlock(void* lock, entry* me)
{
    me->next = null; 
    me->state = UNLOCKED;
    entry* prev = atomic_swap(lock, me);
    if (prev == null) return;  // lock is not held
    me->state = LOCKED;
    prev->next = me;
    while (me->state == LOCKED) ;
}

The queued spinlock only maintains a tail-pointer.  Each thread should have a pointer to its entry and insertions of new entries are only made to the tail.  After inserting, the thread has a unique pointer to the "old" tail of the queue, where it extends the queue.  The thread now spins on the local variable "me->state". Furthermore, on the release event, only one thread is unlocked, which minimizes the memory traffic from the cores fighting over the lock.  However, the queued lock is more expensive to acquire when there is no contention and it can be a greater pain to debug.

Reviewing the measurements in Algorithms..., the simple test-and-set spinlock performs best at very low levels of contention.  After a modest amount of lock contention, the TTS with exponential backoff is performing best.  But as the number of requests continue to increase, the queued spinlock dominates performance from then on.

As with all performance work, measure twice, cut once.  The correct spinlock for your parallel code will depend on: level of contention and the specific architecture of the system.

Tuesday, June 7, 2011

Conference Time: HotPar-3

A little over a week ago, I was at Berkeley for the 3rd annual Workshop on Hot Topics in Parallelism.  The exciting thing was that researchers from a variety of areas were there, unlike my impression of conferences where everyone is in the same area.  Computational scientists, computer architects, programming language and compiler folk, and more were present to discuss their work on parallelism.

There are a couple of works that I would like to highlight (none of which are my own), from the complete list available at HotPar11.  First, "Parallel Programming with Inductive Synthesis" had some of the coolest and most unexpected ideas that I heard.  Their work provides a programmer with the ability to sketch a solution.  A simple example is factoring.  Given an initial function: f(x) = x^2 - 2 * x - 3.  And the programmer then requests the system provide an equivalent function in the form of g(x) = (x - ??) * (x - ??).  (?? is the notation for the system to synthesize the value).

Factoring is a known problem, so the system just factors, right?  Wrong!  Rather (based on my memory of the presentation), the system converts the two functions into circuits, then sets them equal to each other and finds a satisfying assignment to all variables.  Therefore, the system can create the synthesized functions in the general case.

The second interesting work explored data races.  In "How to Miscompile Programs with "Benign" Data Races", the attendees saw how even benign data races could cause errors given certain compiler optimizations.  Unfortunately, the examples are more involved and therefore I refer you to the paper.

The final interesting result that I wish to highlight is that given current technology trends, matrix multiply will be memory bound in 10 years.  In "Balance Principles for Algorithm-Architecture Co-Design", the authors derive simplified formulas for specific functions that can show how their performance relates to current / future technology.  The intent is to enable architects, etc to understand where to best focus the design "dollar", in order to achieve the best performance for that dollar.

I've selected these works primarily on the basis of the results staying with me.  I found many others professionally interesting, but the three papers here are the ones that made me sit up, scratch my beard, and go "hmm".

Friday, March 4, 2011

Parallel Programming - First Pass

With my time dominated by preparing to take PhD qualifying exams (i.e., quals), I have been even more slack than usual with regards to preparing regular posts.  Nonetheless, let's talk a little on parallel programming.  In one aspect, the parallel paradigm is the future of computer science, even if I remain highly skeptical about what the specifics of this computing will be.  But just because its usage in general computing may be occluded, the specific usefulness of parallel computing is not in doubt.  This post will serve as an overview of several concepts in parallel programming.

First to distinguish between concurrent and parallel execution.  Concurrent execution has the possibility or potential for executing simultaneously.  Parallel execution is when this potential is realized.  Concurrent execution is possible with a single core; however, parallel execution is not.

Synchronization is the main question when writing concurrent code.  Synchronization introduces a specific ordering to what was otherwise independent execution.  There are two common flavors: exclusion and notification.  Exclusion consists of mutexes, spinlocks, and other constructs that guarantee a single instance of concurrent execution performing a specific set of operations.  With notification, concurrent executions establish information with respect to each other, for example every instance has reached a specific point (e.g., barrier).

An ongoing quest with synchronization research is transactional memory (TM).  TM provides the ability to make a set of memory updates atomicly.  Processors provide the ability to make simple updates atomic (see Compiler Intrinsics), yet a series of updates requires the explicit exclusion guarantee provided by spinlocks, etc.  TM brings the exclusion to the memory address itself, rather than the abstract object protected by the spinlock, and allows an arbitrary set of accesses to be encapsulated in the atomic operation.  However, TM is not presently feasible.

Parallel patterns are formed based on the observation that parallel programs and algorithms can be classified into several distinct groups (i.e., patterns).  An assembly line is a parallel operation and fits the "pipelined" pattern.  By the programmer recognizing the pattern, certain common errors can be avoided.  With the pipeline, the programmer recognizes that the data is to be passed through discreet stages.

Well, that's my prelude to what will likely be many more posts on parallel programming.