Elegant C: cache

Showing posts with label cache. Show all posts

Monday, October 14, 2019

Conference Attendance - MICRO 52 - Day 1

I am in Columbus Ohio for MICRO 52. A third of the attendees drove from other "midwestern" universities, of which I am one.

Keynote: Rejuvenating Computer Architecture Research with Open-Source Hardware

Moore's Law is irrelevant now, as the cost per transistor has held steady since the 28mm technology node. The cost of any deployment depends on the development cost and only at very large scales, is the cost per transistor dominant. Given that, how can we reduce the cost of hardware development.

Cambrian explosion of (RISC) ISAs in mid-1980s on with a great diversity of ISAs being created and competing. Then the Intel Pentium came out, which combined the CISC ISA with a translation into the RISC micro ops. This extinction event destroyed most of those ISAs.

Why does the instruction set architecture (ISA) matter? It is the dominant interface in the system, defining the interaction between software and hardware. But ISAs are currently proprietary, and tied to the fortunes of the company. Many ISAs have come and gone. And then each SoC (system on a chip) gets custom ISAs for each accelerator.

So there is now the RISC-V ISA that is open for use and development (which I wrote about here). The RISC-V foundation was formed in 2015 to be the neutral guardian of the specification and formal model. Based on this specification, there are both open-source and commercial implementations of the hardware as well as the software ecosystem.

ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMsDRAM is designed based on commands being sent in a specific order with appropriate timings. The oddity is that if specific commands and timings are used that violate the normal usage, then the DRAM module can perform certain operations, such as AND and OR using three specially prepared rows (source x2 and destination).

Hybrid Skiplist: Combining the Best of Near-Data-Processing and Lock-Free Algorithms

This is a student research competition work that I want to highlight. The work is taking skip-lists, a multi-level linked list to support more efficient traversals, which has been implemented on both near-data processing (NDP) systems as well as lock-free. The performance of the two implementations is comparable, but we should be able to do better. The observation is that lock-free gains by having the long, frequently-accessed links in the cache, while NDP gets the data items close. Therefore, let's combine the two approaches so the algorithm uses the lock-free approach on the long links, and leaves the rest in NDP. A dynamic approach then adapts which nodes are in the long list and promotes them, while demoting less frequently accessed elements.

Applying Deep Learning to the Cache Replacement ProblemLet's apply machine learning to cache replacement. Offline, a ML model can perform better than the best replacement schemes, but offline this requires lots of space, more than the cache itself. Current algorithms (such as Hawkeye) use just the current PC, whereas the observation is that the machine learning model includes history, so perhaps history can have value. Using this, they analyzed the history further to notice that this history information is not complete nor does it have to be ordered. If it does not need to be ordered, then the history is a feature list (i.e., bitvector) and not a full list, so the history feature gives an index into a table of predictors for whether a line is cache friendly in usage.

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

This is a release of a Pin-like tool, but for GPUs. Using the framework, you can write specific instrumentation to be applied to CUDA kernels. The framework does an analysis of the kernel to find the specific instrumentation points and then recompile / JIT the code to integrate the request types into the kernel without requiring the actual source code for the kernel. Such types as the specific instructions executed as counts or traces. And thereby build a simulator or error checker.

Thursday, September 12, 2019

Thesis Proposal: Theoretical Foundations for Modern Multiprocessor Hardware

Naama Ben-David gave her proposal this morning on Theoretical Foundations for Modern Multiprocessor Hardware.

Is there a theoretical foundation for why exponential backoff is a good design? Exponential backoff is a practically developed algorithm that 0.

To develop such a foundation, we need to a model of time; however, requests are asynchronous and not according to a single time source. To address this, model time with adversarial scheduling. Thus when performing a request, there are three sources of delay:

self-delay: backoff, sleep, local computation
system-delay: interrupts, context switches
contention-delay: delay caused by contention

Given this model, the adversary can, to a limited degree, decide when requests that an entity's request have passed from self-delay into the system delay can then move to contention-delay and ultimately be completed.

In BBlelloch'17, this model was applied and the work measured for different approaches.

With no backoff, there is omega(n³) work.
Exp backoff reduces to theta(n² log n) bound on work
The paper also proposes a new algorithm that has high probability of O(n²)

The second phase of work is developing simple and efficient algorithms for systems that have non-volatile memory (NVRAM). With NVRAM, on a crash or system failure, the contents in memory persist across reboot (or other restore). This permits the system to restore the running program(s) to a finer degree than happens from auto-saves or other current techniques. However, systems also have caches, which are not persistent. Caches are presently managed by hardware and make decisions as to when to write contents back to memory. Algorithms must work with the caches to ensure that results are safely in memory at selected points of execution. There are a variety of approaches for how to select these points.

The third phase of work is modeling RDMA (remote direct memory access) systems. Can there be a model of the different parts of such a system: memory, NIC (network interface card), and CPU? Then explore the contention as well as possible failures in the system.

One scheme is for every processes to also be able to send messages on behalf of its shared memory neighbors, so that even if a process fails, its ability to participate in algorithms, such as consensus, is still possible.

Being a proposal, ongoing work will work on instantiations of these algorithms to measure the practical performance.

Friday, September 14, 2018

Is C low level?

A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.

The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.

I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?

All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.

Monday, March 21, 2016

PhD Defense - Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Subsystems

Vivek Seshadri gave his PhD defense today covering: how different memory resources have different granularities of access, and therefore need different management techniques. These techniques come out of understanding what hardware could do, without necessarily identifying common features in existing applications that would require / benefit from these techniques.

Page Overlays / Overlay-on-Write: provide the ability to assign physical addresses at sub-page granularities (call them overlays). This reduces the cost of sparse copy-on-writes. In effect, assign a fresh sub-page unit and copy to that location. On every access, check the overlay table in parallel to determine whether to use the normal translation or go to the overlay location.

Gather-Scatter DRAM: provide support for only requesting a subset of cachelines. First, shuffle the data in a cacheline so that the same subset of multiple cache lines will map to different chips in DRAM. Second, introduce additional logic (just a few gates) in DRAM that will compute a modified address, where the default pattern (stride 1) is the normal, un-modified access.

RowClone + BuddyDRAM: can DRAM speedup memcpy (and other bulk memory operations)? First, by opening one row after another, the bitline will take the initial value and then write it into another row. More complex is opening multiple rows simultaneously, which results in bit-wise operations across the three rows: final = C (A | B) | ~C (A & B). By controlling C, bulk bitwise operations are possible. Using this technique, the system can exceed the memory bandwidth for these operations.

DirtyBlock Index: the problem is that if the source is dirty, then it needs to be written back before the previous techniques can be used. DBI provides a faster lookup mechanism to determine if / where are any dirty block lines.

These techniques are interesting, but as the candidate noted, they are in effect solutions in search of a problem. And with DRAM being commodity hardware, it is difficult to envision these techniques being adopted without further work.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize." At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements). All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization. And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors. The first pattern is termed "loop interchange", named for its solution. In this pattern, the program does not access data with spatial locality. Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets. In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
for (i = 0; i < N; i ++)
f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing. Threads in a program intentionally and unintentionally share data. Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need. The hardware is expecting locality from arrays and data structures. When multithreaded, the cache line is the unit by which the hardware tracks sharing of data. So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching. The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
    X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing. Common data structures may also have different sharing patterns for each element. For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right. The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two. Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;

    // tree sorted by value
    struct _node* left;
    struct _node* right;

    // tree sorted by encoding
    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;

The final pattern explored in the paper is data alignment. Ignoring the issue of misaligned accesses, let's look at misaligned allocations. Suppose we allocate an array of 48-byte data structures in a multithreaded program. Sometimes accessing an element is one cache miss, but sometimes it is two. The runtime system has packed the data structures together, with 4 fitting in 3 cache lines. In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line. The primary solution is to use support calls that change the allocation alignment. This may waste space, but now the allocation comes using our expected number of cache lines. And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years. I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Monday, June 20, 2011

Book Review: Write Great Code Volume 1 (part 4 of 4)

As I stated in part 3, I found some erroneous information in chapter 11 about NUMA. While this book was published in 2004, I cannot excuse the mistakes given that the research dates to the 90s (c.f., The SGI Origin: A ccNUMA Highly Scalable Server - ISCA'97). Fortunately, they are relatively small (only a couple of paragraphs); however, the topic is valuable to know and I've spent a fair amount of my time trying to educate programmers on good practices with NUMA (c.f., Topology information under Windows) and so I don't appreciate books providing bad information.

NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory. For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses. Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.

In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory." And furthermore provides the examples of graphics card memory and flash devices.

The last chapter, 12, discusses I/O. Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated. Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous. Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included. Second, at the highest performance levels, polling of devices can perform faster than interrupts. For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point). Finally, using asynchronous I/O is at the top of my list for things to know about devices. In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.

Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course). It is therefore hard for me to know who the appropriate audience would be. Therefore, I have a neutral take on it. But this isn't the last, as volume 2 has also been checked out from the library.