Elegant C: architecture

Showing posts with label architecture. Show all posts

Friday, March 1, 2019

Conference Attendance: SIGCSE 2019 - Day 1.5

Back at SIGCSE again, this one the 50th to be held. Much of my time is spent dashing about and renewing friendships. That said, I made it to several sessions. I've included at least one author and linked to their paper.

Starting on day 2, we begin with the Keynote from Mark Guzdial

"The study of computers and all the phenomena associated with them." (Perlis, Newell, and Simon, 1967). The early uses of Computer Science were proposing its inclusion in education to support all of education (1960s). For example, given the equation "x = x0 + v*t + 1/2 a * t^2", we can also teach it as a algorithm / program. The program then shows the causal relation of the components. Benefiting the learning of other fields by integrating computer science.

Do we have computing for all? Most high school students have no access, nor do they even take the classes when they do.

Computing is a 21st century literacy. What is the core literacy that everyone needs? C.f. K-8 Learning Trajectories Derived from Research Literature: Sequence, Repetition, Conditionals. Our goal is not teaching Computer Science, but rather supporting learning.

For example, let's learn about acoustics. Mark explains the straight physics. Then he brings up a program (in a block-based language) that can display the sound reaching the microphone. So the learning came from the program, demonstration, and prediction. Not from writing and understanding the code itself. Taking data and helping build narratives.

We need to build more, try more, and innovate. To meet our mission, "to provide a global forum for educators to discuss research and practice related to the learning and teaching of computing at all levels."

Now for the papers from day 1:

Lisa Yan - The PyramidSnapshot Challenge

The core problem is that we only view student work by the completed snapshots. Extended Eclipse with a plugin to record every compilation, giving 130,000 snapshots from 2600 students. Into those snapshots, they needed to develop an automated approach to classifying the intermediate snapshots. Tried autograders and abstract syntax trees, but those could not capture the full space. But! The output is an image, so why not try using image classification. Of the 138531 snapshots, they generated 27220 images. Lisa then manually labeled 12000 of those images, into 16 labels that are effectively four milestones in development. Then, a neural network classifier classified the images. Plot the milestones using a spectrum of colors (blue being start, red being perfect). Good students quickly reach the complete milestones. Struggling students are often in early debugging stages. Tinkering students (~73 percentile on exams) take a lot of time, but mostly spend it on later milestones. From these, we can review assignments and whether students are in the declared milestones, or if other assignment structure is required.

For the following three papers, I served as the session chair.

Tyler Greer - On the Effects of Active Learning Environments in Computing Education

Replication study on the impact of using an active learning classroom versus traditional room. Using the same instructor to teach the same course, but using different classrooms and lecture styles (traditional versus peer instruction). The most significant factor was the use of active learning versus traditional, with no clear impact from the type of room used.

Yayjin Ham, Brandon Myers - Supporting Guided Inquiry with Cooperative Learning in Computer Organization

Taking a computer organization course with peer instruction and guided inquiry, can the peer instruction be traded for cooperative learning to emphasize further engagement and learning. Exploration of a model (program, documentation), then concept invention (building an understanding), then application (apply the learned concepts to a new problem). Reflect on the learning at the end of each "lecture". In back-to-back semesters, measure the learning gains from this intervention, as well as survey on other secondary items (such as, engagement and peer support). However, the students in the intervention group did worse, most of which is controlled by the prior GPA. And across the other survey points, students in the intervention group rated lower. The materials used are available online.

Aman, et al - POGIL in Computer Science: Faculty Motivation and Challenges

Faculty try implementing POGIL in the classroom. Start with training, then implementing in the classroom, and continued innovation. Faculty want to see more motivation, retaining the material, and staying in the course (as well as in the program). Students have a mismatch between their learning and their perceived learning. There are many challenges and concerns from faculty about the costs of adoption.

Friday, April 27, 2018

Thesis Defense: Systems Support for Intermittent Computing

Today I attended Alexei Colin's thesis defense titled, Systems Support for Intermittent Computing.

For small, embedded devices, batteries are expensive / difficult, so energy can be harvested from RF, light, temp gradients, motion, et cetera. In such a device, this direct energy source is insufficient to power the device, so a small capacitor (or other storage medium) retains this energy until the device can be powered for a short time. The discharge provides an intermittent period of execution before the power source drops below the threshold for execution. Programs can be annotated with latches or other progress points, such that execution after power failure can then resume at this point after the power is again available.

To model the computation, the program will be decomposed into tasks, where each task can only transfer control to other tasks, but contains arbitrary code. Tasks will communicate through channels. The channels provide the memory model, such that any internal updates within the task are ultimately exposed via the channels. However, this model while reducing the overhead required to execute the tasks, requires a greater quantity of the non-volatile memory.

How do we then get the tasks and thus the latches? Given a control flow graph (CFG), task boundaries will need to be inserted between specific basic blocks of the graph. The compiler can be extended to model (or receive model results) of the energy requirements for each block and thereby estimate which path segments will have sufficient energy for complete execution. Each block actually has not a single energy, but a PDF based on possible microarchitectural effects. Then the model combines these PDFs to compute the CDF to determine the probability that a given path will successfully execute given a specific amount of energy available. Note, each task boundary imposes overhead both in time and therefore energy, so we want the set of task boundaries to minimize overhead, while also accounting for task failures wasting energy. This compiler pass produces better task decompositions than are achieved via manual programmer annotations, as provided by prior work.

Other system support issues. This system should also have dynamic ability to select the stored energy necessary for task execution. This change first requires splitting the energy storage device into multiple banks in hardware. Also, debugging issues in the system is difficult, particularly where the device is expecting to regularly "fail", so a new debugger was prepared that can combine the program state of traditional debuggers, while still supporting the device to be intermittent. Such devices will also need further design for intermittent networking stacks, and then be built into a larger IoT hierarchy.

In conclusion, energy-harvesting embedded computers will form the edge of the IoT hierarchy. And the system stack will form the basis for support.

Tuesday, February 27, 2018

Conference Attendance SIGCSE 2018

I have just finished attending SIGCSE 2018 in Baltimore. In contrast to my earlier conference attendance, this time I have had higher involvement in its execution.

On Wednesday I went to the New Educator's Workshop (NEW). Even being faculty for two years, there was still a number of things that were either new or good reminders. Such as including or discussing learning objectives with each lecture and assignment, or being careful with increasing one's level of service. As a new faculty member, each service request seems exciting, as no one has asked me before! But many senior faculty emphasized that this is the time in which they are protecting us from lots of service opportunities such that we can spend time on our teaching and research.

On Thursday morning, I presented my recent work that updated a programming assignment in Introduction to Computer Systems, and from which we saw improvements in student exam scores. We did not research the specific action, and are therefore left with two theories. First, the improvement could be from using better style in the starter code and emphasizing this style in submissions. Second, we redesigned the traces to require submissions to address different cases and thereby implement different features. I lean toward the formed, but have no data driven basis for this hypothesis.

Let's discuss active learning briefly. I attended (or ran) several sessions focused on this class of techniques. The basic idea is that students have better engagement and learning by actively participating in class. There are a variety of techniques that work to help increase student activity. On Thursday afternoon, Sat Garcia of USD, presented Improving Classroom Preparedness Using Guided Practice, which showed how student learning improved from participating in Peer Instruction, which particularly requires students to come to class prepared. Shortly later, Cynthia Taylor joined Sat and I in organizing a Bird of Feather (BoF) session on using Active-learning in Systems Courses. We had about 30-40 attendees there split into two groups discussing some techniques they have used and problems they have observed. 5 years ago, a similar BoF had attendance around 15-20, so we are making progress as a field.

On Friday, I spoke with Brandon Myers who has done work on using POGIL in Computer Organization and Architecture. In POGIL, students are working in groups of 3-4 with specific roles through a guided learning, guiding students into discovering the concepts themselves. We had a nice conversation and may be merging our draft resources. This last point is often the tricky part of using active learning in that developing reasonable materials can be both time intensive and requires several iterations.

The Friday morning keynote presentation was given by Tim Bell, who spoke about K-12. This topic is rather distant from my own work and research, so I was skeptical. Yet, I came out quite enthused. It was interesting to think about presenting Computer Science concepts in non-traditional ways, based initially on having to explain your field at elementary school when the other presenters are a cop and a nurse (his example). How could you get 6 year olds to sort? Or see the advantage of binary search as the data grows?

In the afternoon, I was a session chair for the first time. I moderated the session on Errors, so obviously the AV system stopped working for a short duration. Beyond that incident, the session seemed to go well.

I always like going to SIGCSE. It is rejuvenating and exhausting. So many teachers to speak with about courses, curriculum, and other related topics. And then you find that you've been social for 16 hours or so hours.

Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed. The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs. Perhaps this is achievable with ordering constraints. Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
data = msg;
flag = 1;
}

int recv() {
while (!flag) continue;
return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data. These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock? Let's add two special labels: pre, post. These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations. Do the constraints need to be within a single iteration or constraining every iteration? Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
VEDGE_HERE(before, after);
L(before, x = i);
L(after, y = i + 10);
}

Code extends LLVM and is on GitHub. The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly. The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions). Then in lowering, the lowering to assembly can better take advantage of the specific constraints required. Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8. I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models? argues that C++11 would require acquire semantics on unlock. Here it is stated that RMC is much more straightforward. Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings. Recall that the coarsest grained instruction is the full memory fence.

Monday, January 18, 2016

Conference Attendance HiPEAC - Day 1 - MULTIPROG

It is once again, conference time. For North Americans, this might seem rather early as I am writing from Prague, Czech Republic (or at least when I started 12 hours ago). I am attending HiPEAC, which is the premier European computer architecture conference. HiPEAC is a dual-track conference. Throughout the three days there is the paper-track, where the accepted papers to TACO (such as mine) are presented. And simultaneously there are workshops. For the first day, I am starting with the MULTIPROG workshop, which is on Programmability and Architectures for Heterogeneous Multicores.

Let's start with the keynote, given by David Kaeli of Northeastern University.
- Concurrent execution of compute kernels
- Scheduling of kernels, deadlines
- Sharing / access to host memory (i.e., RAM)

The current model of using a GPGPU is that it runs 1 computation kernel; however, there are many problems that would better decompose into several separate kernels. It would also be valuable if there were further examples of these problems (i.e., benchmarks). Now, whenever you try running multiple anything on a computational resource, there is a runtime scheduling problem. Which should run to best complete the overall problem. A follow-on research question explores this question a cloud-based environment where the GPU may be shared across entirely independent compute kernels. This requires the kernels to be tagged with IDs to ensure that their memory is kept separate. All of this sounds as if we need an OS for the GPU.

Following the late-morning break, we heard next from MECCA (MEeting the Challenges in Computer Architecture) - 3Ps: parallelism, power, and performance. Consider parallel program annotations for describing the concurrency, runtime management of caches using the annotations to indicate the flow of data and transfer the data before it is required and with the appropriate coherence states and indicate when a block is dead and can be evicted from the cache.

Then there was lunch, resting from my flights, then networking, especially the part where I stood by my poster and discussed my research for 3 hours. Now to rest for day 2.

Tuesday, November 10, 2015

CSE Distinguished Lecture - Professor David Patterson - Instruction Sets want to be Free: The Case for RISC-V

Similar to every other talk on Computer Architecture, first we need to revisit history. Only by knowing from where we came, do we envision where to go.

History of ISA:
IBM/360 was proposed to unify the diverse lines of mainframes. Slowly the ISAs started adding more instructions to support more things (see below).

Intel 8086 was a crash ISA design program to cover for their original ISA design that was delayed. Then IBM wanted to adopt the Motorola 68000, but the chip was late, so the IBM PC used 8088s.

In the 1980s, did a study that found that if the compiled code only used simple instructions, then the programs ran faster than using all of the instructions. Why not design a simple ISA?

RISC (Reduced Instruction Set Computing) was that ISA. The processor is simpler and faster. Secretly, all processors are now RISC (internally), for example, the Intel and AMD processors translate from their x86 ISAs into their internal RISC ISA.

Maybe several simple instructions together could execute together, so the architecture could be simplified further and the compiler can find these instructions rather than spending time and energy when the program is running. This ISA is VLIW (very long instruction word), where many simple instructions are merged into the long instruction.

Open ISA:
Computer Architecture is reaching certain limits such that processor gains will soon come from custom and dedicated logic. IP issues limit the ability to do research on ISAs. We are forced to write simulators that may or may not mimic actual hardware behavior.

Proprietary ISAs are continuing to grow in size, about 2 instructions per month. This provides the marketing reason to purchase the new cores, rather than just the architectural improvements.

Instead, let's develop a modular ISA using many of the ideas from existing designs. For example, atomic instructions for both fetch-and-op, as well as load link / store conditional. Or, compressed instruction format so certain instructions can use a 16-bit format rather than 32-bits (see ARM).

RISC-V has support for the standard open-source software: compilers (gcc, LLVM), Linux, etc. It also provides synthesizable core designs, simulators, etc.

Sunday, June 14, 2015

Conference Attendance FCRC - Day 1 - WCAE / SPAA

In Portland for the next 5 days attending the Federated Computing Research Conference, which is a vast co-location of the top ACM conferences. For my part, this includes ISCA and PLDI. Following registration and checking in as a student volunteer, I ducked in to the Workshop on Computer Architecture Education (WCAE). There were a couple of presentations on different tools being used to teach architectural concepts.

Following the morning break, it was time for the keynote for SPAA, given by Hans-J Boehm, titled, "Myths and Misconceptions about Threads". For example,

#include "foo"
f() {
foo_t x, a;
...
x = a; // Is this atomic?
}

Which lead to the discussion of 'is assignment atomic?' and the audience tossed out increasing complex examples of how it is not. Fundamentally, the programming model is becoming "data-race free", and the specifications can treat races as "undefined behavior". In general, a sequential program will view its execution following the sequential consistency model, even if the hardware is executing the code with a weaker model.

What then should the programming language provide for the atomics / synchronization? Recall that the compiler has considerable flexibility for emitting the final program. With data-race free code, the compiler is treating anything that is not an atomic as part of sequential code and therefore subject to any reordering that would still be valid sequentially. The following example is how this can go awry. X is a global, and the compiler could substitute x anyplace tmp is, because the model assumes "there are no races on x". And if the program does happen to modify x is a racy manner, then the behavior is undefined.

bool tmp = x;
if (tmp) f = new ...
...
if (tmp) f->foo();

Gah! The programmer wanted to take a snapshot of the global value, but ended up with a different result. So the atomics are becoming more than just the "hacker's" way to quickly update shared values, and instead can be seen as annotations to the compiler to clearly encapsulate the shared state. This means the type of x is not bool, but atomic<bool>. Then the compiler knows the programmer's (likely) intent of this code. And this then rolls back to a deeper question of my research, "What could the system do more efficiently if it knew more about the programmer's intent?"

Wednesday, February 18, 2015

Going ARM (in a box)

ARM, that exciting architecture, is ever more available for home development. At first, I was intrigued by the low price points of a Raspberry Pi. And I have one, yet I felt the difficulties of three things: the ARMv6 ISA, the single core, and 512MB of RAM. For my purposes, NVIDIA's development board served far better. At present, that board is no longer available on Amazon; however, I have heard rumors of a 64-bit design being released soon.

With the release of the Raspberry Pi 2, many of my concerns have been allayed. I am also intrigued by the possibility it offers of running Windows 10.

Tuesday, March 18, 2014

Turing Complete x86 Mov

Here is a small work that showed the power (or perhaps horror) of modern assembly languages. I've told people before that an instruction to "subtract and branch if negative" is sufficient for all programming. But never would have I imagined that x86's mov is also sufficient. Until you think about what Intel lets you do:

Set the instruction pointer
Compute arithmetic expressions via address calculation
Compare two values by using their corresponding locations

I am skeptical about whether this could work, given aliasing. Regardless, it is an interesting exercise and I applaud the author. Now, I can only imagine one of these oddities legitimately arising in programming. You have all been warned. :-)

Friday, June 21, 2013

What is Time: Keeping Time in Software

In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application. However, sometimes what you want is to measure an interval of time. We can generally agree to do this by taking two measurements: the time before and the time after. Then find the difference, which is the interval of time elapsed. Great! But how exactly are we going to measure time?

A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.

But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:

#include <stdio.h>
#include <time.h>

int main(int argc, char** argv)
{

    struct timespec n, o;
    clockid_t t = CLOCK_MONOTONIC;

    clock_getres(t, &n);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    clock_gettime(t, &n);
    clock_gettime(t, &o);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    printf("%d - %d\n",o.tv_sec, o.tv_nsec);

    return 0;
}

% ./t_test
0 - 999848
13390932 - 798720095
13390932 - 798720095

This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.

But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?

But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.

But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.

These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).

Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.

As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.

Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature. Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).

In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.

Wednesday, May 22, 2013

More on Haswell

Since I am already excited about some of the features of the new Haswell family, it was great to see an even deeper look into microarchitecture. There are the usual improvements, like expanding the instruction window or increasing cache bandwidth. So read for yourselves at Ars Technica.

Wednesday, April 3, 2013

Architecture Abstraction Leaks

As a computer architect, I like to think that the hardware abstraction is generally opaque. If a programmer really needs performance or certain effects, then it is possible to reach through the layer, but in general the computer gives you what you want. Sometimes, a programmer can do simple things that pull the covers back and reveal the architecture in all of its gritty detail. One example would be following row versus column indexing when looping over large pieces of data.

In this post, I wanted to draw your attention to a question posed on stackoverflow, Why is processing a sorted array faster than an unsorted array? In this question, we are ignoring the time to sort the array. The operation being applied to each array element is relatively simple and is conditional on the value of the element. By sorting the array, the same operations could be applied to contiguous elements. And as the same operations are being applied, the code in question was having a clear win from branch prediction.

But the branch predictor is abstracted hardware, so once again the programmer just learned about the architecture when he / she didn't intend to.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA. More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc. I'm always a fan of properly implemented locks, as they so often limit performance and scalability. So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions. Within the transactional section, all reads and writes are added to a read- or a write-set accordingly. The granularity for tracking is a cache line. If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested. A transaction can only commit if the outer transaction is complete. Internally nested transactions do not commit on XEND. If any transaction in the nest aborts, then the entire transaction aborts. If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible. Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms. Each cache line gains an additional bit to indicate if it is part of a transaction. Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats. If a dirty, transactional block is evicted, then the transaction fails. If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails. In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line. And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Tuesday, February 14, 2012

Bit Packing: A World Gone Mad

Let's start with a mythical program. It has some classes (structs, objects, etc) that are instantiated. Each field has a meaning given to it by the code. Many fields can hold far more data than is required of them. In our mythical program, these unused bits are of no concern, as memory is plentiful. But then the mythical program is run on a real machine and bad things happen. Maybe it runs slowly, or crashes when malloc() returns NULL. Now what?!

In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.

First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.

The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.

Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:

atomic_add(int* loc, int val)
{
    lock()
    *loc += val
    unlock()
}

Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.

Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.

Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!

A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.

[I/F:1][LENGTH:15][ADDR:48]

The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.

Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.

Friday, December 23, 2011

Measuring Performance

Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.

First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").

The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).

So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.

But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).

Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.

Again, always measure performance before cutting (i.e., optimizing).

Thursday, June 16, 2011

Book Review: Write Great Code Volume 1 (part 3 of 4)

With Parts 1 and Parts 2 behind us, the text can race forward through character sets, present memory usage / layout (like my Pointers on Pointers series), discuss Boolean logic (which I only learned 6? times in college), and settle into CPU architecture, which is chapter 9. This chapter would roughly serve as a syllabus for my undergraduate and graduate computer architecture classes, whereby the reader is told "Hey, processors are capable of doing this... somehow" and note the class where one can learn "this is how."

In learning about CPU architectures, one of the main points was made 30 pages into the chapter, to quote "The techniques up to this point in this chapter can be treated as if they were transparent to the programmer." The paragraph continues by stating that programmers can achieve further improvements by knowing the architecture; however, the huge caveat is whether the changes required are "worth it" (c.f., Performance Anti-Patterns). Nonetheless, there are two things to note. First, branches can significantly affect processor performance. Changing their prevalence and particular direction can have benefits. Second, shorter instructions are usually faster, as they save cache space and therefore memory accesses.

Chapter 10 presents the other half of architecture, the instruction set (ISA). Let's just say that I've been working off and on with the Intel instruction set for years and I had difficulty with this chapter. At one level ISAs are simple to explain, and so this chapter did in introduction. But modern ISAs (e.g., x86) have many complexities that make understanding difficult, as this work took whole pages to print the charts for particular opcode bits. So I reached the end of the chapter, knowing what the author intended to achieve, but not actually knowing this knowledge.

I had hoped to conclude volume 1 in three parts, but chapter 11 (memory architecture) warrants its own post as it has wrong information.

Tuesday, October 26, 2010

A Pointer on Pointers Part 1

People often express uncertainty about the working of pointers, structures, etc. They may not be my readers, but hearing their plight moves me to write nonetheless. Others have noticed this too, again looking at StackOverflow. Everything in this post is written from a C-centric point of view, unless otherwise noted.

There are two separate items to understand about pointers. First, where they come from. And second, what they point to. We will roughly be working off of the following picture of memory usage, but understand that this is so simplified that it is almost wrong.

Memory can be allocated from two places: stack and heap. Both are generalizations of the actual mechanics. Stack allocations are explicitly named variables in the source that are either made globally or local to a function. Global variables use space delineated in the binary image, rather than the stack pages, but they should be considered identically by a programmer in that the allocations are only good within their scope. This leads to my first example:

struct foo* newFoo()

{

    struct foo F;

    return &F;

}

Managed language programmers often attempt the above example. newFoo() creates a foo and returns a reference; however, the referenced foo is on the stack. In case there is uncertainty, let's see the assembly for function newFoo():

; Using 32bit x86 assembly, as 64bit has calling convention differences

  push    ebp

  mov     ebp, esp

  sub     esp, 0x10  ; Allocate 16 bytes for a "foo"

  mov     eax, esp   ; Put the address into the return value

  mov     esp, ebp   ; Cleanup

  pop     ebp

ret

Esp points to the end of the stack, so the subtract instruction changes the location pointed to, thereby setting aside the required space for an object foo on the stack, but the space is only available while in the function. When the cleanup section is reached, the space is "reclaimed" as esp is changed back. (The funny thing is, an optimizing compiler may inline newFoo() and therefore change the scope of this stack allocation). So to make an allocation persist beyond its allocation scope, we need the "heap".

Heap memory is memory that won't be reclaimed until the program explicitly requests (or termination). While malloc() is the most common method for heap allocations, it is not the only one. From these methods, a section of memory has been set aside for the program's usage. To again simplify our picture of computing, a Java implementation of newFoo would disassemble to something like the following:

  push    ebp

  mov     ebp, esp

  sub     esp, 0x4     ; Space for arguments 

  mov     [esp], 0x10  ; "Push argument"

  call    malloc

  mov     esp, ebp     ; Cleanup

  pop     ebp

ret

A language like Java might prohibit making an allocation of an object foo on the stack and therefore force the allocation to made from the heap. And it does so by not using "sub esp, 0x10" to make the allocation but instead calling out to malloc() to set aside the necessary space. Now, out of this function, the address returned is not reclaimed.

An entire series of courses would be required to fully explain the trade-offs and reasoning, and since the length is growing long, I will conclude here and discuss the usage of this memory in a later post.