Wednesday, July 24, 2013

Repost: There's nothing like a tidy codebase

When I grade student assignments, around 90% of the grade is for what the code does.  The other 10% is for how it looks, because There's nothing like a tidy codebase.  So take some additional time and help ensure that the code you write is readable.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize."  At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements).  All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization.  And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors.  The first pattern is termed "loop interchange", named for its solution.  In this pattern, the program does not access data with spatial locality.  Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets.  In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
  for (i = 0; i < N; i ++)
    f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing.  Threads in a program intentionally and unintentionally share data.  Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need.  The hardware is expecting locality from arrays and data structures.  When multithreaded, the cache line is the unit by which the hardware tracks sharing of data.  So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching.  The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
  for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
      X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing.  Common data structures may also have different sharing patterns for each element.  For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right.  The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two.  Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;
   
    // tree sorted by value
    struct _node* left;
    struct _node* right;
   
    // tree sorted by encoding

    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;


The final pattern explored in the paper is data alignment.  Ignoring the issue of misaligned accesses, let's look at misaligned allocations.  Suppose we allocate an array of 48-byte data structures in a multithreaded program.  Sometimes accessing an element is one cache miss, but sometimes it is two.  The runtime system has packed the data structures together, with 4 fitting in 3 cache lines.  In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line.  The primary solution is to use support calls that change the allocation alignment.  This may waste space, but now the allocation comes using our expected number of cache lines.  And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years.  I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Thursday, July 11, 2013

Book Review: Exceptional C++ Style 40 New Engineering Puzzles, ...

Generally, I do not write code in C++; however, on occasion (like when writing LLVM compiler passes), I am forced into using this language.  I also more regularly find myself grading student assignments that have used C++.  Particularly reading these assignments, I will be thankful to have read this book and better be able to express how students have violated the standards, as they have done in the past.

Were I forced to read a book directly on the C++ standards, let's just say I can think of lots of things I'd rather be doing.  But while Exceptional C++ Style: 40 New Engineering Puzzles, Programming Problems, and Solutions exposed me to more of the standards, I never felt like I was reading quotes from the standard.  Instead, I was listening to an interesting conversation about some real programming questions that just may require invoking the standards to answer.

I enjoyed chapters 20 and 21, as I appreciate the effort toward explaining how memory is allocated and structures laid out.  Help dispel the ignorance that new / malloc are how the OS provides memory.  And I then learned that new will throw an exception instead of returning NULL.  Perhaps time to rewrite some code.  Furthermore, I understand now why most C++ code uses preincrement on iterators.

It is not strictly a book on style, but instead this tome covers the style I care most about: good programming practices.  I don't care which style convention you use, so long as you use one consistently.  But for whatever style your code has, it had better be good code.

I recommend reading this book even if you do not regularly use C++.  I will note that it is dated; however, unless you are now using C++11, the text is still timely and even if you are using C++11 I doubt that everything has changed (though I did notice the discussion of the auto keyword was out of date).