Showing posts with label data structures. Show all posts
Showing posts with label data structures. Show all posts

Thursday, March 7, 2019

Talk: Concurrent Data Structures for Non-Volatile Memory

Today, Michal Friedman, gave a talk on Concurrent Data Structures for Non-Volatile Memory.

Future systems will contain non-volatile memory.  This is memory that exhibits normal DRAM characteristics, but can maintain its contents even across power failures.  In current systems, caches update memory on either evictions and flushes.  Flushes, however, impose overhead due to the memory access time and overriding the write-back nature of most caches.

Linearizability is one definition for concurrency governing the observation of the operations.  This can be extended to durable linearizability being on a durable system, such that data is flushed before global visibility (initialization), flush prior operations (dependence), and persist operations before they complete (completion).  But a further extension is required to know when a sequence of operations are complete, beyond just taking snapshots of the memory state.

Relaxed, durable, and log versions of lock-free queue that extend Michael and Scott's baseline queue implementation.  Each version provides stronger guarantees: relaxed are the existing augmented with a sync operation to snapshot state, durable preserves the data structure across failures, log identifies the specific state.  The main guarantee is that the data structure will be consistent for any set of thread crashes, which is stronger than the lock-free guarantee.

We do this by extending the prior lock-free versions that include memory flushes of key state, and that later update which see volatile state will flush that state before completing their operations.  This meets the durable linearizability.  And can be extended by also have a log of operations that are updated and maintained before the operations themselves execute.  These logs are per-thread, so as to be unordered and to be individually stateful.

The relaxed version implements sync by creating a special object that indicates a snapshot is occurring.  If other concurrent operations find this object, they take over the snapshot and continue persisting the state before completing its own operation.  Thus a snapshot does not block other operations, but still occurs at that point in the sequence of operations.

Based on performance measurements, the relaxed performs similar to the baseline implementation, while the durable and log-based implementations run slower than the relaxed but with similar performance.

Finally, TSO provides us a guarantee that the stores will reach the cache line in a desired order and not require flushing between writes.

Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work.  I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration.  Now that it is Monday, it is time to attend some conference talks.  In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered.  In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions.  For example, is a queue required to always return an element, or can it fail if an element was recently added?  Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant.  On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower.  To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology.  Each level of the hierarchy acts as a MCS-style queue lock.  How then can these threads abort from their queue?  In a single level, threads mark their status as aborted and are then skipped when handing off the lock.  In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads.  The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs.  Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size.  The concern is that the trace / analysis approaches are very slow and could potentially skew the results.  And when the results are derived from only one trace, there is additional question about their applicability.


Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year.  I still have not done much testing personally, but I did direct two students who explored and measured this support.  As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks.  Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report.  Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code.  Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive.  This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates.  Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach.  However, their measurements revealed some counter-intuitive results.  For example, an AVL tree is a self-balancing data structure.  The self-balanced nature is a benefit in that fewer memory accesses should be required.  Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation.  We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache).  As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort.  Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions.  In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads).  In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations.  It still remains the advice that simple atomic updates should utilize the appropriate instructions.  What if you need to perform several of these operations?  Again, we turn to the final report.  The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction".  Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Monday, July 14, 2014

Book Review: The Practice of Programming

(contains Amazon affiliate link)
I recently found, The Practice of Programming (Addison-Wesley Professional Computing Series), sitting on the shelf at my local library.  I am generally skeptical when it comes to programming books, and particularly those from different decades, but I trusted the name "Brian Kernighan" so I checked the book out.

And I am so glad that I did.  From the first chapter that discussed style, I wanted to read more.  And the only reason to ever stop reading was to pull out a computer and put these things into practice.  I didn't even mind that it wasn't until chapter 7 that performance was discussed.  Still, I will readily acknowledge that I disagree with some of statements in the book.  Furthermore, there are some parts of the text that are clearly dated, like discussing current C / C++ standards.

I'd like to conclude with a brief code snippet from the work.  This code is part of a serializer / deserializer.  Such routines are always a pain to write and particularly if you have many different classes / structs that need them.  Thus the authors suggest using vargs and writing a single routine that can handle this for you.  Here is the unpack (i.e., deserialize) routine:

/* unpack: unpack packed items from buf, return length */
int unpack(uchar *buf, char *fmt, ...)
{
    va_list args;
    char *p;
    uchar *bp, *pc;
    ushort *ps;
    ulong *pl;

    bp = buf;
    va_start(args, fmt);
    for (p = fmt; *p != '\0'; p++) {
        switch (*p) {
        case 'c': /* char */
            pc = va_arg(args, uchar*);
            *pc = *bp++;
            break;
         case 's': /* short */
             ps = va_arg(args, ushort*);
             *ps = *bp++ << 8;
             *ps |= *bp++;
             break;
         case 'l': /* long */
             pl = va_arg(args, ulong*);
             *pl = *bp++ << 24;
             *pl |= *bp++ << 16;
             *pl |= *bp++ << 8;
             *pl |= *bp++;
         default: /* illegal type character */
             va_end(args);
             return -1;
         }
     }
     va_end(args);
     return bp - buf;
}

So now we have a little language for describing the format of the data in the buffer.  We invoke unpack with a string like "cscl" and pointers to store the char, short, char and long.  Hah!  That's it.  Anytime we add new types, we just to call the pack / unpack.

Does it matter that the variables are only sequences like "pl" or "bp"?  No.  Variable names should be meaningful and consistent.  "i" can be fine for a loop iterator.

We have given up some performance (*gasp*), but gained in the other parts that matter like readability and maintainability.  I plan on using this in my current research (but only the unpack, as my serializers are already highly optimized).  All in all, I approve of this book and may even someday require it as a textbook for students.

Monday, March 4, 2013

Maps and Red-Black Trees

When cooperating on a prior work of research, Brainy: effective selection of data structures, I learned about the cost of selecting particular data structures.  I expect that every Computer Scientist knows the general cost of using standard data structures like trees and linked lists.  Now, the extension that Brainy provided was to recognize that a "tree" could have different implementations and these implementations can have different costs for a given workload.  I, and I expect others, learned about AVL, splay, red-black and other tree types.

All this is good until your type hides its implementation and you reason about it in abstract.  A map is commonly misinterpreted by its implementation.  A map is a key-value store, where a "key" has an associated "value".  An address book can be understood as a map of name to address / phone number / etc.

Now, the map is assumed to be implemented as a hash table, to give O(1) look-up cost.  However in the C++ standard library, map uses a red-black tree for O(log n).  Before you go and request that the libraries be changed, understand that experimental measurements suggest that the red-black implementation wins out when n > 500, as the hash collisions dominate the idealized O(1) cost.  This is the problem that Brainy attempts to solve: the programmer selects the functionality, e.g., map and Brainy selects the best implementation given your program and architecture.

So go ahead and use a map, but recognize that you may not have O(1) cost and definitely won't as n grows increasingly large.

Wednesday, December 1, 2010

Elegance in Structure

Recently, I came across a request for an elegant implementation of linked lists (again on stack overflow).  And now that I'm finally through conference deadlines, etc, let's consider this request in further detail.

Specifically, the issue is a singly linked, circular list.  I'd love to hear the benefit of making a singly linked list circular, but I'll assume that there is a reason.  Therefore, we'll explore three design questions.
  1. Layout of the structure
  2. Inserting elements
  3. Deleting elements
Structure layout revolves around a key issue that will decide the rest of the design.  Is the head element tracked in a particular way?  To specifically track the head element, the data structure would need two pieces: a container and individual elements.  This is the questioner's design.  Alternatively, one would treat the head implicitly as the first element, and solely use the "node" structure.  The advantage is that there is only one data structure and the necessary code for managing the special "head" is already in place.  However, with an implicit constraint, the data structure can break if it isn't obeyed.  This constraint is the container structure, but without the compiler's support / guarantees.

The second issue is preparing an elegant insertion of elements.  Two design questions are present: whether to order, and if the list is unordered then is it a queue or a stack.  Based on the poster's design, we have an unordered queue.  Unfortunately, this gives an unpleasant O(n) insertion cost, unless we can include a tail pointer.  Since elegance encompasses more than simplicity, elegant tail insertion should have a pointer whenever possible.  The potential inelegance of the implementations (for insertion and removal) is dealing with the three different insertion cases: zero elements, one elements, many elements.  Can we do it in fewer?

typedef List struct list;
typedef Node struct node;
bool Insert(List* L, int value)
{
    Node* N = (Node*) malloc(sizeof(Node));

    if (N == NULL) return false;

    N->value = value;

    if (L->head == NULL)  // case zero elements
    {
        L->head = L->tail = N->next = N;
    }
    else  // cases 1+ elements
    {
        L->tail->next = N;
        L->tail = N;
        N->next = L->head;  // keep circular
    }

    return true;
}

The final elegance piece rests on element removal. Again, we're trying to reduce the inelegance by combining the different cases into fewer tests and code.  The original implementation has three cases: remove only element, remove head, and remove element.  The distinguishing step about removing the head element is that the head pointer needs to be updated.  Additionally, when any element is removed, the previous element needs to be updated.  However, we already have the previous element for the head, the tail!  Thus, element removal has a common case of finding the element, and then a couple of quick checks based on that element.

void Remove(List* L, int value)
{
    Node* T, E, P;

    if (L == NULL) return;

    T = L->head;
    P = E = L->tail;

    if (T == NULL) return;

    do {
        if (T->value == value) break;
        P = T;
        T = T->next;
    } while (T != E);

    // Ideally, the break would skip into this block
    if (T->value == value)
    {
        P->next = T->next;   // Remove elem
        if (T == E) L->tail = P;  // Removed tail
        if (P == E) L->head = P->next;  // Removed head
        if (T == P) {L->head = L->tail = NULL;} // Removed only elem
        free(T);  // Cleanup
    }
}

Initially, I wanted to treat the list implicitly, but I find myself liking the inclusion of a tail pointer. Yet on small lists, the trade-offs may be different.  I wouldn't give this as the be all and end all of circular lists, but I would present this as my generic implementation.

Tuesday, October 26, 2010

A Pointer on Pointers Part 1

People often express uncertainty about the working of pointers, structures, etc.  They may not be my readers, but hearing their plight moves me to write nonetheless.  Others have noticed this too, again looking at StackOverflow.  Everything in this post is written from a C-centric point of view, unless otherwise noted.

There are two separate items to understand about pointers.  First, where they come from.  And second, what they point to.  We will roughly be working off of the following picture of memory usage, but understand that this is so simplified that it is almost wrong.


Memory can be allocated from two places: stack and heap.  Both are generalizations of the actual mechanics.  Stack allocations are explicitly named variables in the source that are either made globally or local to a function.  Global variables use space delineated in the binary image, rather than the stack pages, but they should be considered identically by a programmer in that the allocations are only good within their scope.  This leads to my first example:

struct foo* newFoo()
{
    struct foo F;
    return &F;
}

Managed language programmers often attempt the above example.  newFoo() creates a foo and returns a reference; however, the referenced foo is on the stack.  In case there is uncertainty, let's see the assembly for function newFoo():

; Using 32bit x86 assembly, as 64bit has calling convention differences
  push    ebp
  mov     ebp, esp
  sub     esp, 0x10  ; Allocate 16 bytes for a "foo"
  mov     eax, esp   ; Put the address into the return value
  mov     esp, ebp   ; Cleanup
  pop     ebp
  ret

Esp points to the end of the stack, so the subtract instruction changes the location pointed to, thereby setting aside the required space for an object foo on the stack, but the space is only available while in the function.  When the cleanup section is reached, the space is "reclaimed" as esp is changed back.  (The funny thing is, an optimizing compiler may inline newFoo() and therefore change the scope of this stack allocation).  So to make an allocation persist beyond its allocation scope, we need the "heap".

Heap memory is memory that won't be reclaimed until the program explicitly requests (or termination).  While malloc() is the most common method for heap allocations, it is not the only one.  From these methods, a section of memory has been set aside for the program's usage.  To again simplify our picture of computing, a Java implementation of newFoo would disassemble to something like the following:

  push    ebp
  mov     ebp, esp
  sub     esp, 0x4     ; Space for arguments
  mov     [esp], 0x10  ; "Push argument"
  call    malloc
  mov     esp, ebp     ; Cleanup
  pop     ebp
  ret

A language like Java might prohibit making an allocation of an object foo on the stack and therefore force the allocation to made from the heap.  And it does so by not using "sub esp, 0x10" to make the allocation but instead calling out to malloc() to set aside the necessary space.  Now, out of this function, the address returned is not reclaimed.

An entire series of courses would be required to fully explain the trade-offs and reasoning, and since the length is growing long, I will conclude here and discuss the usage of this memory in a later post.

Thursday, October 21, 2010

Accessing Structs with Side Effects

In my recent research, I've needed to take certain additional actions when the fields in a data structure are changed.  Since all code involved is my own, my initial approach was to annotate any access with explicit calls.  Obviously, this isn't sustainable in larger projects, but for quick prototyping, it is sufficient.  Let's see a simple example (with pseudo-C as usual):

typedef struct _tree {
    pTree left, right;
    int data;
} Tree, *pTree;

Tree t;
t.left = t.right = NULL;
// Now take action

Since we want to ensure that we take action after every update, get / set routines could be introduced for each field.  Yet, now I have the overhead of calling these routines on every access.  And I still have to trust that all accesses will use the routines (though using another language like C++, I could make the fields private and force access to the routines).

My research is currently using C#, so I have further tools available to me, specifically accessors.  Initially, they seemed to be yet another silly hoop to jump through while writing code.  However, they provide a great tool for my present problem, where modifications to fields need to incur side effects.  Now for a C# snippet:

private Node pLeft;  // Internal storage
public Node left     // Public accessor
{
    get { return pLeft;}
    set
    {
        if (pLeft == value) return;  // discard redundant stores
        pLeft = value;
        pStack.Push(this); // side effect on update
    }
}

So my data structure now has implicit side effects from updates (good for the research).  And so any future development doesn't need explicit annotations.  All in all, I was quite pleased with this change (except for spending the time debugging the differences between implicit and explicit detection).