Monday, September 23, 2013

Book Review: Turing's Cathedral

Recently, I read the work of history, Turing's Cathedral: The Origins of the Digital Universe (Vintage), which is an interesting book that tells of the development of some of the first computers in the United States.  It's particular focus is on the founding of the Institute for Advanced Study (IAS), and then John von Neumann's time there.  Now, the von Neumann architecture is something I regularly conceptualize and use in teaching.  And it was interesting to read of how this architectural model was developed, and why.

In contrast, a significant portion of the book was instead written as a history of the Institute.  Given that the Institute provided access to the records used as a significant part of the source material, it is understandable that the author's focus would be so directed.  However, it adds to the misleading focus that this work follows.

Of perhaps greater slight is that a work titled "Turing's Cathedral" only features Alan Turing for a small part of the writing.  Instead we find greater focus placed on his work and how it fit into the research of that time.  Eventually leading von Neumann to explore the usage of electronic digital computers to solve the US military's problems.  He, like many European scientists, had left his homeland ahead of Hitler, and these scientists supported work leading to Germany's defeat.

The grand development that von Neumann introduced was making a computer, programmable. Beyond just reconfigurable, the project he lead at the IAS was programmable, the electronic device could store both data as well as codes that were instructions for what the device was to do.  Consider that for the next 50 years, programs would be constrained by having to store the instructions in memory, which was often a very limited resource.

So John von Neumann stars in a book titled for Turing, and a book that devotes a third of its pages to references.  A good, interesting work that could probably have been improved by an editor's scissors.  To trim the writing down to the core bits about computers, and set aside so much of the well researched chapters to attain a focus that is lacking.

Wednesday, July 24, 2013

Repost: There's nothing like a tidy codebase

When I grade student assignments, around 90% of the grade is for what the code does.  The other 10% is for how it looks, because There's nothing like a tidy codebase.  So take some additional time and help ensure that the code you write is readable.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize."  At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements).  All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization.  And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors.  The first pattern is termed "loop interchange", named for its solution.  In this pattern, the program does not access data with spatial locality.  Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets.  In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
  for (i = 0; i < N; i ++)
    f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing.  Threads in a program intentionally and unintentionally share data.  Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need.  The hardware is expecting locality from arrays and data structures.  When multithreaded, the cache line is the unit by which the hardware tracks sharing of data.  So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching.  The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
  for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
      X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing.  Common data structures may also have different sharing patterns for each element.  For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right.  The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two.  Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;
   
    // tree sorted by value
    struct _node* left;
    struct _node* right;
   
    // tree sorted by encoding

    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;


The final pattern explored in the paper is data alignment.  Ignoring the issue of misaligned accesses, let's look at misaligned allocations.  Suppose we allocate an array of 48-byte data structures in a multithreaded program.  Sometimes accessing an element is one cache miss, but sometimes it is two.  The runtime system has packed the data structures together, with 4 fitting in 3 cache lines.  In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line.  The primary solution is to use support calls that change the allocation alignment.  This may waste space, but now the allocation comes using our expected number of cache lines.  And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years.  I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Thursday, July 11, 2013

Book Review: Exceptional C++ Style 40 New Engineering Puzzles, ...

Generally, I do not write code in C++; however, on occasion (like when writing LLVM compiler passes), I am forced into using this language.  I also more regularly find myself grading student assignments that have used C++.  Particularly reading these assignments, I will be thankful to have read this book and better be able to express how students have violated the standards, as they have done in the past.

Were I forced to read a book directly on the C++ standards, let's just say I can think of lots of things I'd rather be doing.  But while Exceptional C++ Style: 40 New Engineering Puzzles, Programming Problems, and Solutions exposed me to more of the standards, I never felt like I was reading quotes from the standard.  Instead, I was listening to an interesting conversation about some real programming questions that just may require invoking the standards to answer.

I enjoyed chapters 20 and 21, as I appreciate the effort toward explaining how memory is allocated and structures laid out.  Help dispel the ignorance that new / malloc are how the OS provides memory.  And I then learned that new will throw an exception instead of returning NULL.  Perhaps time to rewrite some code.  Furthermore, I understand now why most C++ code uses preincrement on iterators.

It is not strictly a book on style, but instead this tome covers the style I care most about: good programming practices.  I don't care which style convention you use, so long as you use one consistently.  But for whatever style your code has, it had better be good code.

I recommend reading this book even if you do not regularly use C++.  I will note that it is dated; however, unless you are now using C++11, the text is still timely and even if you are using C++11 I doubt that everything has changed (though I did notice the discussion of the auto keyword was out of date).

Friday, June 21, 2013

What is Time: Keeping Time in Software

In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application.  However, sometimes what you want is to measure an interval of time.  We can generally agree to do this by taking two measurements: the time before and the time after.  Then find the difference, which is the interval of time elapsed.  Great!  But how exactly are we going to measure time?

A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution).  A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example).  These functions provide time in (roughly) millisecond resolution, like having a stopwatch.  For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT).  This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock.  Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage.  Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.

But, you object, I want better accuracy!  Let's turn to the HPET then.  A hardware device that internally operates at 10+MHz (or 100ns ticks).  Thanks to this device, the system has a microsecond time source.  For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter.  And both come also with supplemental routines that indicate the resolution of the time returned.  For example:

#include <stdio.h>
#include <time.h>

int main(int argc, char** argv)
{

    struct timespec n, o;
    clockid_t t = CLOCK_MONOTONIC;

    clock_getres(t, &n);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    clock_gettime(t, &n);

    clock_gettime(t, &o);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    printf("%d - %d\n",o.tv_sec, o.tv_nsec);
 

    return 0;
}


% ./t_test
0 - 999848
13390932 - 798720095

13390932 - 798720095

This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller.  Thus on my test machine, I could measure time by microseconds.

But, you object, I want better accuracy!  So let's move on to the timestamp counter then.  This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick).  clock_gettime and QueryPerformanceCounter may use this time source.  So we can query it a nanosecond accurate time measurement.  Done, or are we?

But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless.  If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns.  Did the function lie about its resolution?  No.  Imagine calling NIST or CERN to get the time from their clock.  Their clock is accurate to better than 1ns, but it takes time to call them.  In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format.  Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.

But, you object, I really need to know how many nanoseconds this something takes.  There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.

These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call?  In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value.  A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).

Hah!  You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles.  But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously.  When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed).  So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate.  Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.

As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor.  The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results.  Many modern OSes try to set the counters so that the measurements are within a small margin of each other.  In practice, I have observed deltas from low 100s to 10s of millions, so be warned.

Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds.  Modern chips have invariant TSC, per Intel architecture manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.
As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature.  Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).

In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors.  In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in.  So be warned and be cautious, and be as precise as you need to be.

Wednesday, June 5, 2013

Atomics versus Locks

I've written in the past about using intrinsics to improve performance.  And I've discussed using this functionality to build your own synchronization.  I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question.  But sometimes a little knowledge is a dangerous thing.  The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.

Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation.  So let's test and see how the results turn out.  First, the base code:

unsigned long long counter = 0;

int main(int argc, char** argv)
{
    unsigned long long start, end, o;

    start = rdtsc();
    do {
        counter++;
    } while (counter < MAX);
    end = rdtsc();
   
    printf("%lld\n", (end - start));  
    return 0;
}


This version is the "Single Thread" result.  But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded.  By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration.  Overhead, versus keeping the value in a register, like the assembly that follows:

loop: add    $0x1,%rax
      cmp    $0x1dcd64ff,%rax
      jbe    loop

And the volatile code adds: mov    0x1429(%rip),%rax and a corresponding write after.  Already this is enough to measurably impact performance.  Let's move forward and switch to performing our increment using compare-and-swap.

    do {
        unsigned long long new, old = counter;
       
        if (old < MAX)
        {
            new = old + 1;
            __sync_val_compare_and_swap(&counter, old, new);
        }
        else
            i = 0;
       
    } while (i == 1);


Now, we only want to increment when we haven't reached max.  If we have, then terminate.  But the assembly is now unpleasant:

loop: mov    0x144b(%rip),%rax        # 401a90 <counter>
      cmp    $0x1dcd64ff,%rax
      jbe    swap
...
swap: lea    0x1(%rax),%rcx
      lock cmpxchg %rcx,0x140b(%rip)        # 401a90 <counter>
      jmp    loop

First, the optimized code has split this into two pieces.  The check and the increment, and these have been split.  A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together.  But the fact is, the impact of using the atomic operation looks like 6x.

As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead).  Each version is then modified to spawn a second thread that does the same work as the first.  All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):

     TEST         -      -O3      -      -O0
Single Thread          -   1 cyc / inc -   7 cyc / inc
Single Thread volatile -   6 cyc / inc -  16 cyc / inc
Single Thread w/ CAS   -  38 cyc / inc -  46 cyc / inc
Single Thread w/ lock  -  90 cyc / inc -  92 cyc / inc
Two Threads   w/ CAS   - 250 cyc / inc - 251 cyc / inc
Two Threads   w/ lock  - 730 cyc / inc - 630 cyc / inc

What can we conclude?  First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced.  Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.  

Second, but why did unoptimized lock work better?  Because there was a different thread interleaving.  Perhaps the OS put one of threads asleep and the other thread ran happily for a while.  The point is that if you are synchronizing, then that is your overhead.

Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock.  Yet, this test was highly artificial.  The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization.  Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730).  Furthermore, how often is your synchronization a single variable?

To conclude, IF what you want to synchronize is a single variable, then using CAS can work.  Have I done this?  Yes.  Do I normally?  No.  I have this tool in my programming toolbox, for those times that I really need it.

Anyway, why not use atomic increment instead?  Because we want to ensure that the final value is exactly MAX, no more and no less.

Wednesday, May 22, 2013

More on Haswell

Since I am already excited about some of the features of the new Haswell family, it was great to see an even deeper look into microarchitecture.  There are the usual improvements, like expanding the instruction window or increasing cache bandwidth.  So read for yourselves at Ars Technica.

Tuesday, April 16, 2013

When whitespace matters

In the process of grading student programming projects, I discovered that whitespace can matter in how C code is compiled.  Some programmers may be aware that strings can be concatenated together:

const char* t = "this string" " is one";

Because the whitespace is ignored.  Furthermore, programmers may encode strings via macros, and replace the instance with the macro:

#define A_STRING " is one"

Now, in the latest version of g++, new C++ support can treat the item following the string literal as either a macro or a user defined literal.  The compiler makes the determination based on whether there is whitespace.

const char* t = "this string"A_STRING; // compiler error
const char* t = "this string" A_STRING; // expected behavior

I like consistent use of whitespace in code for stylistic reasons, but introducing a singular dependency on whitespace is odd for C/C++, in contrast to python that is highly whitespace dependent.

Wednesday, April 3, 2013

Architecture Abstraction Leaks

As a computer architect, I like to think that the hardware abstraction is generally opaque.  If a programmer really needs performance or certain effects, then it is possible to reach through the layer, but in general the computer gives you what you want.  Sometimes, a programmer can do simple things that pull the covers back and reveal the architecture in all of its gritty detail.  One example would be following row versus column indexing when looping over large pieces of data.

In this post, I wanted to draw your attention to a question posed on stackoverflow, Why is processing a sorted array faster than an unsorted array?  In this question, we are ignoring the time to sort the array.  The operation being applied to each array element is relatively simple and is conditional on the value of the element.  By sorting the array, the same operations could be applied to contiguous elements.  And as the same operations are being applied, the code in question was having a clear win from branch prediction.

But the branch predictor is abstracted hardware, so once again the programmer just learned about the architecture when he / she didn't intend to.

Wednesday, March 27, 2013

Transactional Memory and Intel's TSX

It was some years ago that I sat in the audience and heard AMD present a sketch for how transactional memory (TM) would be implemented in the x64 ISA.  More recently a fellow student mentioned that Intel has some new extensions entering the x64 ISA for locks, etc.  I'm always a fan of properly implemented locks, as they so often limit performance and scalability.  So let's dig into Intel's TSX and why I actually want to go buy a gadget when it's released.

A programmer can delineate the transactional section with XBEGIN and XEND instructions.  Within the transactional section, all reads and writes are added to a read- or a write-set accordingly.  The granularity for tracking is a cache line.  If another processor makes a read request to a line in the write-set or either request to a read-set, then the transaction aborts.

Transactions can be semi-nested.  A transaction can only commit if the outer transaction is complete.  Internally nested transactions do not commit on XEND.  If any transaction in the nest aborts, then the entire transaction aborts.  If |XBEGIN| equals |XEND|, then the entire transaction commits and becomes globally visible.  Transactions can be explicitly aborted by the XABORT instruction, which enables the code to abort early when it can determine that the transaction will or should fail.

As I understand it, TSX is being built on top of the existing cache coherence mechanisms.  Each cache line gains an additional bit to indicate if it is part of a transaction.  Each memory operation is treated normally between the processor and the coherence hierarchy with several caveats.  If a dirty, transactional block is evicted, then the transaction fails.  If a dirty, transactional block is demand downgraded from modified to shared or invalid, then the transaction fails.  In this case, a new message would indicate that the request to forward the data fails and the request should be satisfied by memory.

If the transaction commits, then the transactional bits are cleared on each cache line.  And the lines operate normally according to existing cache coherence mechanisms.

Wrapping this up, TSX is an ISA extension that almost every program can take advantage of and therefore has an appeal toward conducting personal testing, just like building my own x64 machine back in 2005.

Monday, March 4, 2013

Maps and Red-Black Trees

When cooperating on a prior work of research, Brainy: effective selection of data structures, I learned about the cost of selecting particular data structures.  I expect that every Computer Scientist knows the general cost of using standard data structures like trees and linked lists.  Now, the extension that Brainy provided was to recognize that a "tree" could have different implementations and these implementations can have different costs for a given workload.  I, and I expect others, learned about AVL, splay, red-black and other tree types.

All this is good until your type hides its implementation and you reason about it in abstract.  A map is commonly misinterpreted by its implementation.  A map is a key-value store, where a "key" has an associated "value".  An address book can be understood as a map of name to address / phone number / etc.

Now, the map is assumed to be implemented as a hash table, to give O(1) look-up cost.  However in the C++ standard library, map uses a red-black tree for O(log n).  Before you go and request that the libraries be changed, understand that experimental measurements suggest that the red-black implementation wins out when n > 500, as the hash collisions dominate the idealized O(1) cost.  This is the problem that Brainy attempts to solve: the programmer selects the functionality, e.g., map and Brainy selects the best implementation given your program and architecture.

So go ahead and use a map, but recognize that you may not have O(1) cost and definitely won't as n grows increasingly large.

Monday, January 28, 2013

Repost: When Beauty is Not Truth

While it was not a field discussed in the article, When Beauty Is Not Truth, nonetheless I wonder about the focus I put on elegance in programming (starting with the name of the blog).  So let's consider a couple of quick things about Computer Science and the beauty of the code.  I should add that the article discusses a rough equivalence between beauty, elegance, and simplicity.

First, beautiful code is not always correct.

Second, beautiful code, by virtue of its simplicity, is less likely to have bugs.

Third, beautiful code is more readable, which facilitates comprehension.

(Now back to preparing the lecture for today's class)