Keynote Bill Dally on Domain-Specific Accelerators
Moore's Law is over. Sequential performance is increasing at 3% per year. And cost per transistor is steady or increasing.
Most of power is spent moving data around, so simple ISAs such as RISC are actually inefficient power-wise versus specialized operations. With special data types and operations, the hardware can be designed so that something taking 10s to 100s of cycles is done in 1. Memory bandwidth can bottleneck, as "bits are bits".
Genome matching, via Smith-Waterman algorithm, can be done in single cycle for many bases (10), while a CPU would be 35 ALU ops and 15 load/store. And the specialized hardware is 3.1pJ (10% is memory) and the CPU is 81nJ.
Communication is expensive in power, so be small, be local. 5pJ/word for L1, 50pJ/word for LLC, and 640pJ/word for DRAM. And most of this power is driving the wires.
Conventionally, sparse matrices need <1% set bits to be worth using due the overhead of pointers, etc. However, special purpose hardware can overcome this overhead.
Tensor core performs D = AB + C, so how to execute this in an instruction. For a GPU, 30pJ to fetch / decode / operand fetch the instruction. So specialized instructions can then operate as efficiently as specialized hardware, but with that overhead. On a GPU that power is ~20% overhead.
Keynote: An Invisible Woman: The Inside Story Behind the VLSI Microelectronic Computing Revolution in Silicon Valley
Conjecture: Almost all people are blind to innovations, especially ones by 'others' whom they did not expect to make innovations. ('others' = 'almost all people')
Basically, we rarely notice any innovation, so they are ascribed to the perceived likely cause (c.f., the Matthew effect or the Matilda effect). Credit for innovations is highly visible, and many awards are ascribed to people with certain reputations rather than the specific innovator.