Elegant C

Monday, August 24, 2020

The Martian, Computer Science, and College

This summer, the school asked professors if they would be interested in leading book discussions with incoming first-year students in Computer Science. I, along with many other professors, volunteered, where each of us could select our specific title to discuss. I proposed reading, The Martian, by Andy Weir. What follows is not a review of the book, which I really enjoyed, but rather a summary of the discussion points from the hour we had together.

The following text contains many book spoilers.

We started the discussion with a short summary of my background and then a student asked about the Martian rover hacking. It is in my opinion, plausible. It depends on several assumptions, such as the rover's driver being able to be modified so easily to log the malformed network data (sent by the probe). It would then be reasonable to send commands to the probe to broadcast the necessary data to construct an executable on the rover. Then assuming that Mark can run it with sufficient privileges or that there is a known vulnerability allowing the executable to gain the privilege, the probe's data could be a patch. Personally, I enjoyed the thought of using ASCII to communicate and myself and my TAs agree, "man ascii". Besides, I carry an ASCII chart in my wallet.

We discussed how there was significant cooperation in solving the problems. The crew worked together. NASA had many teams working on the problems. Internationally, China also provided assistance. The people working on these problems were diverse. And there were continual concerns about the crew's mental state and about Mark's. Similarly, Computer Science students will need to learn to work with others, to work in groups with diverse backgrounds and skill sets, and know that there are many many more people that are wanting to see and willing to support them in having successful lives and taking steps whenever circumstances dictate.

Mark Watney survives in part by having a diverse education and training. Being a fictional character, he has the right skills to survive, but this is based in reality. Astronauts are trained in a diverse set of skills, particularly to maximize the value gained from their time in space. They are not experts, but rather trained well to exercise the guidance of experts on Earth. And similarly, I reinforced to the students that their studies should work to give them a broad foundation beyond Computer Science.

The final topic brought up by the students was about ethics. First, should NASA tell the crew of the Hermes that Mark Watney was alive on Mars, when it was first determined. Or instead, NASA would censor all communication to ensure that they were not informed that they abandoned Mark. What is the trade-off between the truth and mission results? Second, the Chinese scientists had to make a decision, is the life of one astronaut worth their probe? Should they give up their long-prepared mission of great scientific value to instead make a "grocery delivery"? How much is one life worth? Third, when the Rich Purnell plan presented an alternative to rescuing Mark, was NASA obligated to consult the crew in evaluating this option? And related, the crew of the Hermes decided to return to Mars (on the low chance of killing everyone plan) to save Mark Watney, and also extending their mission duration. Also briefly discussed was that governments also have to decide how much a life is worth. It is noted that the science that Mark can perform makes up for the cost of his rescue, which addresses this concern in story.

I think that the 20 or so students appreciated the hour we had together. I hope to some day be able to meet and ultimately teach them in person.

Wednesday, April 22, 2020

Thesis Proposal - Lightweight Preemptable Functions

Sol Boucher presented his thesis proposal today via Zoom. It was well attended and I think many of us are already looking forward to his defense.

Lightweight Preemptable Functions (LPF)
Function calls require some knowledge of the cost for a program to know whether it can invoke it or not in other time critical execution.

In the current space of multitasking, three approaches will be first reviewed: futures, threads, and processes. Futures rely on the runtime to provide asynchronous execution, but preemption requires yield points, and not all execution can support these points, nor can it be clear what this future will do. With threads, execution is fairly independent, but minimal information is provided toward scheduling and cancellation is hard. Processes provide good support for cancellation, except that on fork(), only one thread is required to be carried over in the new process. The other threads are canceled, which can result in inconsistent state.

The thesis proposal is: Introducing a novel abstraction for preemption at the granularity of a synchronous function call, which includes a timeout.

launch(function, timeout, argument) returning a continuation. If the timeout elapses, the continuation is returned, but it is then the choice of the caller for whether to resume it or cancel. This LPF executes within the same thread context as the caller, thereby reducing overhead. However, to resume the LPF, it will need a new stack. To support the timeout, it relies on a timer signal that can occur every ~5 microseconds. Launch / resume have overhead comparable to this, significantly better than fork or pthread_create. However, cancel is extremely expensive.

LPFs also have an issue with calling functions that are non-reentrant, similar to the rules governing signal handlers. To address this, the runtime provides selective relinking to capture what the LPF is calling via the global offset table (GOT). Some GOT entries point to dynamic libraries, other entries are initially pointing to the dynamic linker. This runtime support also needs to intercept thread local variables. This interception support imposes about 10ns of overhead, which is little above the cost of function calls themselves.

Microservice approaches have significant latency, often tends to hundreds of microseconds. Primarily the requirement to create a sufficient container, often via processes or virtualization. If the microservice was instead written safely and using LPFs, then the latency could be reduced toward the hardware bound as measured by communicating between VMs or committing transactions.

Cancellation cleanup is difficult in languages, such as C, that require explicit cleanup. In other languages, adding a new exception path for timeout and cancellation could then invoke the necessary destructors. Nonetheless, this can be expensive (perhaps single milliseconds).

Other possible future work:
Another cancellation step is the cost of unloading the mirrored library, so could the runtime instead track the changes made and then determine whether to roll back or discard.
Is it possible to reduce the overhead of the preemption signals or improving their granularity.

Wednesday, October 16, 2019

Conference Attendance - MICRO 52 - Day 2/3

This is a rough writing of the notes from the other two keynotes.

Keynote Bill Dally on Domain-Specific Accelerators

Moore's Law is over. Sequential performance is increasing at 3% per year. And cost per transistor is steady or increasing.

Most of power is spent moving data around, so simple ISAs such as RISC are actually inefficient power-wise versus specialized operations. With special data types and operations, the hardware can be designed so that something taking 10s to 100s of cycles is done in 1. Memory bandwidth can bottleneck, as "bits are bits".

Genome matching, via Smith-Waterman algorithm, can be done in single cycle for many bases (10), while a CPU would be 35 ALU ops and 15 load/store. And the specialized hardware is 3.1pJ (10% is memory) and the CPU is 81nJ.

Communication is expensive in power, so be small, be local. 5pJ/word for L1, 50pJ/word for LLC, and 640pJ/word for DRAM. And most of this power is driving the wires.

Conventionally, sparse matrices need <1% set bits to be worth using due the overhead of pointers, etc. However, special purpose hardware can overcome this overhead.

Tensor core performs D = AB + C, so how to execute this in an instruction. For a GPU, 30pJ to fetch / decode / operand fetch the instruction. So specialized instructions can then operate as efficiently as specialized hardware, but with that overhead. On a GPU that power is ~20% overhead.

Keynote: An Invisible Woman: The Inside Story Behind the VLSI Microelectronic Computing Revolution in Silicon Valley

Conjecture: Almost all people are blind to innovations, especially ones by 'others' whom they did not expect to make innovations. ('others' = 'almost all people')

Basically, we rarely notice any innovation, so they are ascribed to the perceived likely cause (c.f., the Matthew effect or the Matilda effect). Credit for innovations is highly visible, and many awards are ascribed to people with certain reputations rather than the specific innovator.

Monday, October 14, 2019

Conference Attendance - MICRO 52 - Day 1

I am in Columbus Ohio for MICRO 52. A third of the attendees drove from other "midwestern" universities, of which I am one.

Keynote: Rejuvenating Computer Architecture Research with Open-Source Hardware

Moore's Law is irrelevant now, as the cost per transistor has held steady since the 28mm technology node. The cost of any deployment depends on the development cost and only at very large scales, is the cost per transistor dominant. Given that, how can we reduce the cost of hardware development.

Cambrian explosion of (RISC) ISAs in mid-1980s on with a great diversity of ISAs being created and competing. Then the Intel Pentium came out, which combined the CISC ISA with a translation into the RISC micro ops. This extinction event destroyed most of those ISAs.

Why does the instruction set architecture (ISA) matter? It is the dominant interface in the system, defining the interaction between software and hardware. But ISAs are currently proprietary, and tied to the fortunes of the company. Many ISAs have come and gone. And then each SoC (system on a chip) gets custom ISAs for each accelerator.

So there is now the RISC-V ISA that is open for use and development (which I wrote about here). The RISC-V foundation was formed in 2015 to be the neutral guardian of the specification and formal model. Based on this specification, there are both open-source and commercial implementations of the hardware as well as the software ecosystem.

ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMsDRAM is designed based on commands being sent in a specific order with appropriate timings. The oddity is that if specific commands and timings are used that violate the normal usage, then the DRAM module can perform certain operations, such as AND and OR using three specially prepared rows (source x2 and destination).

Hybrid Skiplist: Combining the Best of Near-Data-Processing and Lock-Free Algorithms

This is a student research competition work that I want to highlight. The work is taking skip-lists, a multi-level linked list to support more efficient traversals, which has been implemented on both near-data processing (NDP) systems as well as lock-free. The performance of the two implementations is comparable, but we should be able to do better. The observation is that lock-free gains by having the long, frequently-accessed links in the cache, while NDP gets the data items close. Therefore, let's combine the two approaches so the algorithm uses the lock-free approach on the long links, and leaves the rest in NDP. A dynamic approach then adapts which nodes are in the long list and promotes them, while demoting less frequently accessed elements.

Applying Deep Learning to the Cache Replacement ProblemLet's apply machine learning to cache replacement. Offline, a ML model can perform better than the best replacement schemes, but offline this requires lots of space, more than the cache itself. Current algorithms (such as Hawkeye) use just the current PC, whereas the observation is that the machine learning model includes history, so perhaps history can have value. Using this, they analyzed the history further to notice that this history information is not complete nor does it have to be ordered. If it does not need to be ordered, then the history is a feature list (i.e., bitvector) and not a full list, so the history feature gives an index into a table of predictors for whether a line is cache friendly in usage.

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

This is a release of a Pin-like tool, but for GPUs. Using the framework, you can write specific instrumentation to be applied to CUDA kernels. The framework does an analysis of the kernel to find the specific instrumentation points and then recompile / JIT the code to integrate the request types into the kernel without requiring the actual source code for the kernel. Such types as the specific instructions executed as counts or traces. And thereby build a simulator or error checker.

Saturday, September 28, 2019

Repost: Active Learning is Better even if Students Don't Like It

Active learning is a set of techniques that require the student to take an active role in their learning during lecture. Research strongly supports that students will learn more when the lecture utilizes these techniques. And I have measured this effect in my own courses. However, this research shows that students like lectures that use these techniques less even though they are learning more. And I have also informally measured this, such as students who say at the end of the first lecture, "If you are going to require me to participate in lecture, I will not return". Unfortunately, the present educational model is based on the student evaluations (primarily measuring what students like) to evaluate the quality of instruction. Therefore perversely, this aggregate model encourages suboptimal teaching and learning.

The paper recommends then that professors take time in the beginning of the semester to demonstrate the benefits and gain buy in from the students. And then continue to do so. Students want to learn, so they will support this pedagogy. And many students will recognize the value with time, if they give it.

Thursday, September 12, 2019

Thesis Proposal: Theoretical Foundations for Modern Multiprocessor Hardware

Naama Ben-David gave her proposal this morning on Theoretical Foundations for Modern Multiprocessor Hardware.

Is there a theoretical foundation for why exponential backoff is a good design? Exponential backoff is a practically developed algorithm that 0.

To develop such a foundation, we need to a model of time; however, requests are asynchronous and not according to a single time source. To address this, model time with adversarial scheduling. Thus when performing a request, there are three sources of delay:

self-delay: backoff, sleep, local computation
system-delay: interrupts, context switches
contention-delay: delay caused by contention

Given this model, the adversary can, to a limited degree, decide when requests that an entity's request have passed from self-delay into the system delay can then move to contention-delay and ultimately be completed.

In BBlelloch'17, this model was applied and the work measured for different approaches.

With no backoff, there is omega(n³) work.
Exp backoff reduces to theta(n² log n) bound on work
The paper also proposes a new algorithm that has high probability of O(n²)

The second phase of work is developing simple and efficient algorithms for systems that have non-volatile memory (NVRAM). With NVRAM, on a crash or system failure, the contents in memory persist across reboot (or other restore). This permits the system to restore the running program(s) to a finer degree than happens from auto-saves or other current techniques. However, systems also have caches, which are not persistent. Caches are presently managed by hardware and make decisions as to when to write contents back to memory. Algorithms must work with the caches to ensure that results are safely in memory at selected points of execution. There are a variety of approaches for how to select these points.

The third phase of work is modeling RDMA (remote direct memory access) systems. Can there be a model of the different parts of such a system: memory, NIC (network interface card), and CPU? Then explore the contention as well as possible failures in the system.

One scheme is for every processes to also be able to send messages on behalf of its shared memory neighbors, so that even if a process fails, its ability to participate in algorithms, such as consensus, is still possible.

Being a proposal, ongoing work will work on instantiations of these algorithms to measure the practical performance.

Monday, April 8, 2019

Presentation: The Quest for Energy Proportionality in Mobile & Embedded Systems

This is a summary of the presentation on "The Quest for Energy Proportionality in Mobile & Embedded Systems" by Lin Zhong.

We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.

The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.

So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.

Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.

However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.