Elegant C: DRAM

Monday, October 14, 2019

Conference Attendance - MICRO 52 - Day 1

I am in Columbus Ohio for MICRO 52. A third of the attendees drove from other "midwestern" universities, of which I am one.

Keynote: Rejuvenating Computer Architecture Research with Open-Source Hardware

Moore's Law is irrelevant now, as the cost per transistor has held steady since the 28mm technology node. The cost of any deployment depends on the development cost and only at very large scales, is the cost per transistor dominant. Given that, how can we reduce the cost of hardware development.

Cambrian explosion of (RISC) ISAs in mid-1980s on with a great diversity of ISAs being created and competing. Then the Intel Pentium came out, which combined the CISC ISA with a translation into the RISC micro ops. This extinction event destroyed most of those ISAs.

Why does the instruction set architecture (ISA) matter? It is the dominant interface in the system, defining the interaction between software and hardware. But ISAs are currently proprietary, and tied to the fortunes of the company. Many ISAs have come and gone. And then each SoC (system on a chip) gets custom ISAs for each accelerator.

So there is now the RISC-V ISA that is open for use and development (which I wrote about here). The RISC-V foundation was formed in 2015 to be the neutral guardian of the specification and formal model. Based on this specification, there are both open-source and commercial implementations of the hardware as well as the software ecosystem.

ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMsDRAM is designed based on commands being sent in a specific order with appropriate timings. The oddity is that if specific commands and timings are used that violate the normal usage, then the DRAM module can perform certain operations, such as AND and OR using three specially prepared rows (source x2 and destination).

Hybrid Skiplist: Combining the Best of Near-Data-Processing and Lock-Free Algorithms

This is a student research competition work that I want to highlight. The work is taking skip-lists, a multi-level linked list to support more efficient traversals, which has been implemented on both near-data processing (NDP) systems as well as lock-free. The performance of the two implementations is comparable, but we should be able to do better. The observation is that lock-free gains by having the long, frequently-accessed links in the cache, while NDP gets the data items close. Therefore, let's combine the two approaches so the algorithm uses the lock-free approach on the long links, and leaves the rest in NDP. A dynamic approach then adapts which nodes are in the long list and promotes them, while demoting less frequently accessed elements.

Applying Deep Learning to the Cache Replacement ProblemLet's apply machine learning to cache replacement. Offline, a ML model can perform better than the best replacement schemes, but offline this requires lots of space, more than the cache itself. Current algorithms (such as Hawkeye) use just the current PC, whereas the observation is that the machine learning model includes history, so perhaps history can have value. Using this, they analyzed the history further to notice that this history information is not complete nor does it have to be ordered. If it does not need to be ordered, then the history is a feature list (i.e., bitvector) and not a full list, so the history feature gives an index into a table of predictors for whether a line is cache friendly in usage.

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

This is a release of a Pin-like tool, but for GPUs. Using the framework, you can write specific instrumentation to be applied to CUDA kernels. The framework does an analysis of the kernel to find the specific instrumentation points and then recompile / JIT the code to integrate the request types into the kernel without requiring the actual source code for the kernel. Such types as the specific instructions executed as counts or traces. And thereby build a simulator or error checker.

Monday, March 21, 2016

PhD Defense - Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Subsystems

Vivek Seshadri gave his PhD defense today covering: how different memory resources have different granularities of access, and therefore need different management techniques. These techniques come out of understanding what hardware could do, without necessarily identifying common features in existing applications that would require / benefit from these techniques.

Page Overlays / Overlay-on-Write: provide the ability to assign physical addresses at sub-page granularities (call them overlays). This reduces the cost of sparse copy-on-writes. In effect, assign a fresh sub-page unit and copy to that location. On every access, check the overlay table in parallel to determine whether to use the normal translation or go to the overlay location.

Gather-Scatter DRAM: provide support for only requesting a subset of cachelines. First, shuffle the data in a cacheline so that the same subset of multiple cache lines will map to different chips in DRAM. Second, introduce additional logic (just a few gates) in DRAM that will compute a modified address, where the default pattern (stride 1) is the normal, un-modified access.

RowClone + BuddyDRAM: can DRAM speedup memcpy (and other bulk memory operations)? First, by opening one row after another, the bitline will take the initial value and then write it into another row. More complex is opening multiple rows simultaneously, which results in bit-wise operations across the three rows: final = C (A | B) | ~C (A & B). By controlling C, bulk bitwise operations are possible. Using this technique, the system can exceed the memory bandwidth for these operations.

DirtyBlock Index: the problem is that if the source is dirty, then it needs to be written back before the previous techniques can be used. DBI provides a faster lookup mechanism to determine if / where are any dirty block lines.

These techniques are interesting, but as the candidate noted, they are in effect solutions in search of a problem. And with DRAM being commodity hardware, it is difficult to envision these techniques being adopted without further work.