The pair of students wrote a microbenchmark around compare-and-swap, where the value is read, a local update is computed and then compare-and-swap attempts to place the new value into memory iff the old value is present, otherwise fail and retry. Running the code in tight loop with a thread per hardware context, there is clearly going to be significant contention. In this scenario, they had two observations from the results:
- If the requesting thread is located on the same node as the memory, it will almost always fail. Implying that accessing NUMA local memory takes a different path than NUMA remote, thereby exhibiting worse performance on contended atomic operations.
- The Intel processors had a higher success rate as neighboring threads were more likely to pass along access between each other. The AMD system did not exhibit this behavior.
Caveats: The precise NUMA topology was not known. And the AMD processors were several generations older than the Intel processors.
No comments:
Post a Comment