Wednesday, May 2, 2018

Performance of Atomic Operations on NUMA Systems

It is the end of the semester, so time for posters about student projects.  I visited two sessions so far with three more to go.  I specifically wanted to highlight the results from one poster.

The pair of students wrote a microbenchmark around compare-and-swap, where the value is read, a local update is computed and then compare-and-swap attempts to place the new value into memory iff the old value is present, otherwise fail and retry.  Running the code in tight loop with a thread per hardware context, there is clearly going to be significant contention.  In this scenario, they had two observations from the results:
  1. If the requesting thread is located on the same node as the memory, it will almost always fail.  Implying that accessing NUMA local memory takes a different path than NUMA remote, thereby exhibiting worse performance on contended atomic operations.
  2. The Intel processors had a higher success rate as neighboring threads were more likely to pass along access between each other.  The AMD system did not exhibit this behavior.
Caveats: The precise NUMA topology was not known.  And the AMD processors were several generations older than the Intel processors.

No comments: