Elegant C: heterogeneous architectures

Showing posts with label heterogeneous architectures. Show all posts

Wednesday, October 16, 2019

Conference Attendance - MICRO 52 - Day 2/3

This is a rough writing of the notes from the other two keynotes.

Keynote Bill Dally on Domain-Specific Accelerators

Moore's Law is over. Sequential performance is increasing at 3% per year. And cost per transistor is steady or increasing.

Most of power is spent moving data around, so simple ISAs such as RISC are actually inefficient power-wise versus specialized operations. With special data types and operations, the hardware can be designed so that something taking 10s to 100s of cycles is done in 1. Memory bandwidth can bottleneck, as "bits are bits".

Genome matching, via Smith-Waterman algorithm, can be done in single cycle for many bases (10), while a CPU would be 35 ALU ops and 15 load/store. And the specialized hardware is 3.1pJ (10% is memory) and the CPU is 81nJ.

Communication is expensive in power, so be small, be local. 5pJ/word for L1, 50pJ/word for LLC, and 640pJ/word for DRAM. And most of this power is driving the wires.

Conventionally, sparse matrices need <1% set bits to be worth using due the overhead of pointers, etc. However, special purpose hardware can overcome this overhead.

Tensor core performs D = AB + C, so how to execute this in an instruction. For a GPU, 30pJ to fetch / decode / operand fetch the instruction. So specialized instructions can then operate as efficiently as specialized hardware, but with that overhead. On a GPU that power is ~20% overhead.

Keynote: An Invisible Woman: The Inside Story Behind the VLSI Microelectronic Computing Revolution in Silicon Valley

Conjecture: Almost all people are blind to innovations, especially ones by 'others' whom they did not expect to make innovations. ('others' = 'almost all people')

Basically, we rarely notice any innovation, so they are ascribed to the perceived likely cause (c.f., the Matthew effect or the Matilda effect). Credit for innovations is highly visible, and many awards are ascribed to people with certain reputations rather than the specific innovator.

Monday, April 8, 2019

Presentation: The Quest for Energy Proportionality in Mobile & Embedded Systems

This is a summary of the presentation on "The Quest for Energy Proportionality in Mobile & Embedded Systems" by Lin Zhong.

We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.

The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.

So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.

Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.

However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.

Monday, January 18, 2016

Conference Attendance HiPEAC - Day 1 - MULTIPROG

It is once again, conference time. For North Americans, this might seem rather early as I am writing from Prague, Czech Republic (or at least when I started 12 hours ago). I am attending HiPEAC, which is the premier European computer architecture conference. HiPEAC is a dual-track conference. Throughout the three days there is the paper-track, where the accepted papers to TACO (such as mine) are presented. And simultaneously there are workshops. For the first day, I am starting with the MULTIPROG workshop, which is on Programmability and Architectures for Heterogeneous Multicores.

Let's start with the keynote, given by David Kaeli of Northeastern University.
- Concurrent execution of compute kernels
- Scheduling of kernels, deadlines
- Sharing / access to host memory (i.e., RAM)

The current model of using a GPGPU is that it runs 1 computation kernel; however, there are many problems that would better decompose into several separate kernels. It would also be valuable if there were further examples of these problems (i.e., benchmarks). Now, whenever you try running multiple anything on a computational resource, there is a runtime scheduling problem. Which should run to best complete the overall problem. A follow-on research question explores this question a cloud-based environment where the GPU may be shared across entirely independent compute kernels. This requires the kernels to be tagged with IDs to ensure that their memory is kept separate. All of this sounds as if we need an OS for the GPU.

Following the late-morning break, we heard next from MECCA (MEeting the Challenges in Computer Architecture) - 3Ps: parallelism, power, and performance. Consider parallel program annotations for describing the concurrency, runtime management of caches using the annotations to indicate the flow of data and transfer the data before it is required and with the appropriate coherence states and indicate when a block is dead and can be evicted from the cache.

Then there was lunch, resting from my flights, then networking, especially the part where I stood by my poster and discussed my research for 3 hours. Now to rest for day 2.