Elegant C: ARM

Showing posts with label ARM. Show all posts

Monday, April 8, 2019

Presentation: The Quest for Energy Proportionality in Mobile & Embedded Systems

This is a summary of the presentation on "The Quest for Energy Proportionality in Mobile & Embedded Systems" by Lin Zhong.

We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.

The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.

So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.

Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.

However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.

Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed. The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs. Perhaps this is achievable with ordering constraints. Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
data = msg;
flag = 1;
}

int recv() {
while (!flag) continue;
return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data. These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock? Let's add two special labels: pre, post. These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations. Do the constraints need to be within a single iteration or constraining every iteration? Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
VEDGE_HERE(before, after);
L(before, x = i);
L(after, y = i + 10);
}

Code extends LLVM and is on GitHub. The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly. The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions). Then in lowering, the lowering to assembly can better take advantage of the specific constraints required. Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8. I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models? argues that C++11 would require acquire semantics on unlock. Here it is stated that RMC is much more straightforward. Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings. Recall that the coarsest grained instruction is the full memory fence.

Wednesday, February 18, 2015

Going ARM (in a box)

ARM, that exciting architecture, is ever more available for home development. At first, I was intrigued by the low price points of a Raspberry Pi. And I have one, yet I felt the difficulties of three things: the ARMv6 ISA, the single core, and 512MB of RAM. For my purposes, NVIDIA's development board served far better. At present, that board is no longer available on Amazon; however, I have heard rumors of a 64-bit design being released soon.

With the release of the Raspberry Pi 2, many of my concerns have been allayed. I am also intrigued by the possibility it offers of running Windows 10.

Tuesday, February 4, 2014

Book Review: ARM Assembly Language: Fundamentals and Techniques

Besides reading an average of 5 research papers every week, I also read an average of one book each week. Occasionally those books relate to computers and usually then I'll write about them here. I had realized a couple months ago that I didn't really know anything about ARM processors, besides that they are low power. It seemed remiss to be studying computer architecture and not know one of the modern architectures. Thus I visited my school library and checked out a book.

This is the story of that book - ARM Assembly Language: Fundamentals and Techniques. An interesting book that covered the basic of what I wanted to learn, but the short coming was that it had an expected environment that was different from mine. ARM processors can be found in a greater diversity of devices than say, x86. Yet, I am still thinking about the ARM processor as a drop-in replacement. I look more to devices like Microsoft's Surface or a smartphone, and think about the presence of an OS, etc.

I learned particularly that the ARM instructions have bits to make them predicated. And I realized then that conditional branches are really just predicated instructions. If the predicate(s) are true, then take the branch. Just another perspective on instruction sets. Anyway, I look forward to getting a Raspberry Pi, so I can try out some of what I've learned and get a chance to also work through the assembly generated by compilers.