Elegant C: coherence

Showing posts with label coherence. Show all posts

Monday, April 8, 2019

Presentation: The Quest for Energy Proportionality in Mobile & Embedded Systems

This is a summary of the presentation on "The Quest for Energy Proportionality in Mobile & Embedded Systems" by Lin Zhong.

We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.

The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.

So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.

Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.

However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.

Friday, September 14, 2018

Is C low level?

A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level". The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language. As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction.
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process. Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.

The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs". If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model." Which generally sounds like the GPU design.

I appreciate the callouts to C's shortcomings, which it certainly has. The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast. And with the programs being written in either C or a language built on C, that forces many programs into particular patterns. I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?

All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it. This is a gap that needs to be addressed, and by a language that has explicit parallel support.

Monday, March 21, 2016

PhD Defense - Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Subsystems

Vivek Seshadri gave his PhD defense today covering: how different memory resources have different granularities of access, and therefore need different management techniques. These techniques come out of understanding what hardware could do, without necessarily identifying common features in existing applications that would require / benefit from these techniques.

Page Overlays / Overlay-on-Write: provide the ability to assign physical addresses at sub-page granularities (call them overlays). This reduces the cost of sparse copy-on-writes. In effect, assign a fresh sub-page unit and copy to that location. On every access, check the overlay table in parallel to determine whether to use the normal translation or go to the overlay location.

Gather-Scatter DRAM: provide support for only requesting a subset of cachelines. First, shuffle the data in a cacheline so that the same subset of multiple cache lines will map to different chips in DRAM. Second, introduce additional logic (just a few gates) in DRAM that will compute a modified address, where the default pattern (stride 1) is the normal, un-modified access.

RowClone + BuddyDRAM: can DRAM speedup memcpy (and other bulk memory operations)? First, by opening one row after another, the bitline will take the initial value and then write it into another row. More complex is opening multiple rows simultaneously, which results in bit-wise operations across the three rows: final = C (A | B) | ~C (A & B). By controlling C, bulk bitwise operations are possible. Using this technique, the system can exceed the memory bandwidth for these operations.

DirtyBlock Index: the problem is that if the source is dirty, then it needs to be written back before the previous techniques can be used. DBI provides a faster lookup mechanism to determine if / where are any dirty block lines.

These techniques are interesting, but as the candidate noted, they are in effect solutions in search of a problem. And with DRAM being commodity hardware, it is difficult to envision these techniques being adopted without further work.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.