Elegant C: NUMA

Showing posts with label NUMA. Show all posts

Wednesday, May 2, 2018

Performance of Atomic Operations on NUMA Systems

It is the end of the semester, so time for posters about student projects. I visited two sessions so far with three more to go. I specifically wanted to highlight the results from one poster.

The pair of students wrote a microbenchmark around compare-and-swap, where the value is read, a local update is computed and then compare-and-swap attempts to place the new value into memory iff the old value is present, otherwise fail and retry. Running the code in tight loop with a thread per hardware context, there is clearly going to be significant contention. In this scenario, they had two observations from the results:

If the requesting thread is located on the same node as the memory, it will almost always fail. Implying that accessing NUMA local memory takes a different path than NUMA remote, thereby exhibiting worse performance on contended atomic operations.
The Intel processors had a higher success rate as neighboring threads were more likely to pass along access between each other. The AMD system did not exhibit this behavior.

Caveats: The precise NUMA topology was not known. And the AMD processors were several generations older than the Intel processors.

Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work. I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration. Now that it is Monday, it is time to attend some conference talks. In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered. In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions. For example, is a queue required to always return an element, or can it fail if an element was recently added? Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant. On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower. To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology. Each level of the hierarchy acts as a MCS-style queue lock. How then can these threads abort from their queue? In a single level, threads mark their status as aborted and are then skipped when handing off the lock. In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads. The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs. Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size. The concern is that the trace / analysis approaches are very slow and could potentially skew the results. And when the results are derived from only one trace, there is additional question about their applicability.

Monday, June 20, 2011

Book Review: Write Great Code Volume 1 (part 4 of 4)

As I stated in part 3, I found some erroneous information in chapter 11 about NUMA. While this book was published in 2004, I cannot excuse the mistakes given that the research dates to the 90s (c.f., The SGI Origin: A ccNUMA Highly Scalable Server - ISCA'97). Fortunately, they are relatively small (only a couple of paragraphs); however, the topic is valuable to know and I've spent a fair amount of my time trying to educate programmers on good practices with NUMA (c.f., Topology information under Windows) and so I don't appreciate books providing bad information.

NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory. For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses. Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.

In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory." And furthermore provides the examples of graphics card memory and flash devices.

The last chapter, 12, discusses I/O. Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated. Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous. Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included. Second, at the highest performance levels, polling of devices can perform faster than interrupts. For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point). Finally, using asynchronous I/O is at the top of my list for things to know about devices. In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.

Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course). It is therefore hard for me to know who the appropriate audience would be. Therefore, I have a neutral take on it. But this isn't the last, as volume 2 has also been checked out from the library.