Elegant C: patterns

Showing posts with label patterns. Show all posts

Monday, April 13, 2015

Book Review: Structured Parallel Programming

At SIGCSE this year, I spoke with several publishers about possible textbooks. Primarily, for ones that would work in the different classes that I might teach. As well as those that relate to my present research. In this case, I received a copy of Structured Parallel Programming: Patterns for Efficient Computation from the publisher to review and consider for possible future use in a course.

This work is in three parts: the basics of parallelism and performance, the common patterns in which parallelism is expressed, and example implementations of several algorithms. The second part is the core of the work. To show maps, reduce, scatter and gather, stencil, fork-join, and pipeline. But before we learned those details, we would come to key quotes for all that I do:

You cannot neglect the performance of serial code, hoping to make up the difference with parallelism.

And:

[The] performance of scalar processing is important; if it is slow it can end up dominating performance.

Therefore, parallelism is a method for improving the performance of already efficient code.

With both the common patterns, as well as the example implementations, the authors generally provide the source code for each pattern and implementation using Cilk, TBB, and OpenMP. This source is not for casual readers. More involved implementations can stretch for several pages, as the initial implementation and then subsequent refinements are explored. While it serves well as a reference, it may have worked better to focus on one parallelism approach for each section and therefore give further explanation to the code, especially the language features used. And thereby retain the pattern itself rather than becoming a practitioners' reference.

The example implementations (the third part) are perhaps the least interesting for the classroom and potentially the most interesting for practitioners. Clearly, if I was trying to write code similar to one of these problems, I would have an excellent reference and starting point. However, that is quite rarely the case for myself and I suspect most people as well.

If I was teach a parallel programming course, I might consider using this work (although I still have other, similar textbooks to review); however, were I to do so I would be confining my teaching to the first two parts and may even to just 1 parallel programming paradigm. Yes, I will admit that the last parallel programming course I took covered a diversity of paradigms (Cilk, vectorization, GPUs, OpenMP, MPI), yet I would have preferred to focus more on what one or two paradigms are capable of rather than just the taste of many. Parallel programming takes a lot of work to learn and this book is one piece in that effort.

Monday, February 23, 2015

Compilers and Optimizations, should you care?

Compiler optimizations matter. One time in helping a fellow Ph.D. student improve a simulator's performance, I did two things: first, I replaced an expensive data structure with a more efficient one. Second, I turned on compiler optimizations. Together, the simulator ran 100x faster.

A question posted in the stackexchange system asked, "Why are there so few C compilers?" The main answer pointed out that any C compiler needs to be optimizing. Lots of optimizations are occurring on every compilation, and each one gaining tiniest increments in performance. While I enjoy discussing them in detail, I generally wave my hands and tell of how they are good, yet make debugging difficult. These optimizations are lumped together as the "optimization level".

In "What Every Programmer Should Know About Compiler Optimizations", we revisit optimizations. First, the compiler is no panacea and cannot correct for inefficient algorithms or poor data structure choices (although I am party to research on the later). The article then suggests four points to assist the compiler in its efforts at optimizing the code.

"Write understandable, maintainable code." Please do this! Usually, the expensive resource is the programmer. So the first optimization step is to improve the programmer's efficiency with the source code. Remember Review: Performance Anti-patterns and do not start optimizing the code until you know what is slow.

"Use compiler directives." Scary. Excepting the inline directive, I have used these less than a half dozen times in almost as many years of performance work. Furthermore, the example of changing the calling convention is less relevant in 64-bit space where most conventions have been made irrelevant.

"Use compiler-intrinsic functions." (see Compiler Intrinsics the Secret Sauce) This can often dovetail with the first point by removing ugly bit twiddling code and putting in clean function calls.

"Use profile-guided optimization (PGO)." This optimization is based on the dynamic behavior of the program. Meaning that if you take a profile of the program doing X, and later the program does Y; executing Y can be slower. The key is picking good, representative examples of the program's execution.

So you have dialed up the optimization level, and written understandable code sprinkled with intrinsics. Now what? The next step (with which I agree) is to use link time optimizations (LTO) / Link-Time Code Generation (LTCG). This flag delays many optimizations until the entire program is available to be linked. One of the principles of software performance is that the more of the program available to be optimized, the better it can be optimized. (This principle also applies in computer architecture). Thus, by delaying many optimization until the entire program is available, the linker can find additional and better opportunities than were present in individual components.

The article notes, "The only reason not to use LTCG is when you want to distribute the resulting object and library files." And alas, I have fought several battles to overcome this point, as my work requires the use of LTO. Perhaps in the next decade, LTO will be standard.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize." At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements). All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization. And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors. The first pattern is termed "loop interchange", named for its solution. In this pattern, the program does not access data with spatial locality. Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets. In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
for (i = 0; i < N; i ++)
f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing. Threads in a program intentionally and unintentionally share data. Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need. The hardware is expecting locality from arrays and data structures. When multithreaded, the cache line is the unit by which the hardware tracks sharing of data. So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching. The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
    X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing. Common data structures may also have different sharing patterns for each element. For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right. The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two. Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;

    // tree sorted by value
    struct _node* left;
    struct _node* right;

    // tree sorted by encoding
    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;

The final pattern explored in the paper is data alignment. Ignoring the issue of misaligned accesses, let's look at misaligned allocations. Suppose we allocate an array of 48-byte data structures in a multithreaded program. Sometimes accessing an element is one cache miss, but sometimes it is two. The runtime system has packed the data structures together, with 4 fitting in 3 cache lines. In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line. The primary solution is to use support calls that change the allocation alignment. This may waste space, but now the allocation comes using our expected number of cache lines. And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years. I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Friday, March 4, 2011

Parallel Programming - First Pass

With my time dominated by preparing to take PhD qualifying exams (i.e., quals), I have been even more slack than usual with regards to preparing regular posts. Nonetheless, let's talk a little on parallel programming. In one aspect, the parallel paradigm is the future of computer science, even if I remain highly skeptical about what the specifics of this computing will be. But just because its usage in general computing may be occluded, the specific usefulness of parallel computing is not in doubt. This post will serve as an overview of several concepts in parallel programming.

First to distinguish between concurrent and parallel execution. Concurrent execution has the possibility or potential for executing simultaneously. Parallel execution is when this potential is realized. Concurrent execution is possible with a single core; however, parallel execution is not.

Synchronization is the main question when writing concurrent code. Synchronization introduces a specific ordering to what was otherwise independent execution. There are two common flavors: exclusion and notification. Exclusion consists of mutexes, spinlocks, and other constructs that guarantee a single instance of concurrent execution performing a specific set of operations. With notification, concurrent executions establish information with respect to each other, for example every instance has reached a specific point (e.g., barrier).

An ongoing quest with synchronization research is transactional memory (TM). TM provides the ability to make a set of memory updates atomicly. Processors provide the ability to make simple updates atomic (see Compiler Intrinsics), yet a series of updates requires the explicit exclusion guarantee provided by spinlocks, etc. TM brings the exclusion to the memory address itself, rather than the abstract object protected by the spinlock, and allows an arbitrary set of accesses to be encapsulated in the atomic operation. However, TM is not presently feasible.

Parallel patterns are formed based on the observation that parallel programs and algorithms can be classified into several distinct groups (i.e., patterns). An assembly line is a parallel operation and fits the "pipelined" pattern. By the programmer recognizing the pattern, certain common errors can be avoided. With the pipeline, the programmer recognizes that the data is to be passed through discreet stages.

Well, that's my prelude to what will likely be many more posts on parallel programming.