Elegant C: performance

Showing posts with label performance. Show all posts

Wednesday, May 2, 2018

Performance of Atomic Operations on NUMA Systems

It is the end of the semester, so time for posters about student projects. I visited two sessions so far with three more to go. I specifically wanted to highlight the results from one poster.

The pair of students wrote a microbenchmark around compare-and-swap, where the value is read, a local update is computed and then compare-and-swap attempts to place the new value into memory iff the old value is present, otherwise fail and retry. Running the code in tight loop with a thread per hardware context, there is clearly going to be significant contention. In this scenario, they had two observations from the results:

If the requesting thread is located on the same node as the memory, it will almost always fail. Implying that accessing NUMA local memory takes a different path than NUMA remote, thereby exhibiting worse performance on contended atomic operations.
The Intel processors had a higher success rate as neighboring threads were more likely to pass along access between each other. The AMD system did not exhibit this behavior.

Caveats: The precise NUMA topology was not known. And the AMD processors were several generations older than the Intel processors.

Monday, July 10, 2017

Wrote my own Calling Convention

I have been listening to the Dungeon Hacks audiobook, and it has both reminded of my past joys of playing Angband, as well as some interesting little "hacks" I did in high school.

In high school, I wrote many programs on my TI-83, often to help me in other classes, which lead to an interesting conversation:
Student: "Teacher, Brian has written programs on his calculator for this class."
Teacher: calls me forward "Brian, is this true?"
Me: "Yes."
Teacher: "Did you write them yourself?"
Me: "Yes."
Teacher: "I do not see what the problem is."

And besides writing useful programs, I also wrote games. All in the TI-83's BASIC. However, the TI-83 only had 24kB of space for programs and data. And eventually I started exceeding this limit. So I started finding interesting hacks that would reduce the space of programs, such as the trailing " of a string is not required if that string ends the line. Each would save 1 byte, but it started to add up.

Now, the TI-BASIC on the 83 was very primitive. Particularly it had no GOSUB instruction, only GOTO. So you cannot write functions / subroutines and reuse code, which would both improve the design and reduce the space required. Now I already knew the basics (no pun intended) of C, so I knew that code should be able to call other functions and get values back. But TI-BASIC would let you call another program and then return to the calling program after the called one finished. That's like a function call, right?

Therefore, I wrote a library program. Variables were global, so the calling program could set specific parameters in the global variables, call the library program. The library would use one parameter to determine which functionality to execute, update the necessary globals and then exit, thus returning. And consequently, I had a library of common functions which sped up my development and reduced the space I needed.

And so it was only in listening to the audio book last week, did I realize that long ago I had developed a simple calling convention in high school. And now I teach, among other things, the x86-64 Linux ABI (i.e. calling convention) to college students.

A calling convention just dictates how each register is used when calling a function. Which ones need to be preserved, arguments, return value, etc. And it also dictates the management of the stack.

Wednesday, May 3, 2017

PhD Defense - Meeting Tail Latency SLOs in Shared Networked Storage

Today I went to Timothy Zhu's PhD thesis defense. His work is on achieving better sharing of data center resources to improve performance, and particularly to reduce tail latency. He also TA'd for me last fall.

Workloads are generally bursty, and their characteristics are different. Furthermore, they may have service level objectives (SLOs), and the system needs to meet these different objectives. And the system contains a variety of resources that must be shared in some form. It is not sufficient to just divide the bandwidth. Nor can the system measure the latency and try reacting, particularly as bursty workloads do not give sufficient time to react. While each workload has deadlines, it would be too complex to tag request packets with the deadlines for queuing and routing. However, the deadlines can be used to generate priorities for requests.

The system is architected to have storage and network enforcement components to ensure QoS. There is also a controller that receives an initial trace to characterize each workload, and that workload's SLOs. The controller works through a sequence of analyses to successfully place each workload into the overall system.

Effectively, each workload is assigned a "bucket" of tokens, where the bucket size provides the ability to handle bursts and the rate that tokens are added covers the request rate for the workload. Shorter burstier workloads receive large buckets and low rates, while constant workloads with little bursts have high rates and small buckets. In both cases, only when the bucket is empty, is the workload rate-limited in its requests, and these requests receive the lowest priority. Deterministic Network Calculus (DNC) to model the worst-case queue scenarios. This plots two curves: the requesting flow and the service curve, both plotted as tokens by function of window size (dt). The maximum distance between the curves is the maximum latency.

Using three traces: DisplayAds, MSN, and LiveMaps, they tested three approaches: Cake (reactive approach), earliest deadline first, and Timothy's scheme (PriorityMeister). His scheme did significantly better than the others at meeting the SLOs. However, the DNC analysis was based on achieving 100% and not the SLO's 99% (or other percentile success). Depending on the characteristics, there can be significant differences between these guarantees. To model the latency percentiles, Stochastic Network Calculus (SNC) can achieve this; however, the math is significantly more complex. And the math had not previously been applied to this problem. DNC is still better when assuming that bursts are correlated or the system is in an adversarial setting. Reducing these assumptions (uncorrelated workloads), the SNC-based analysis permitted the system to admit 3x workloads versus the DNC analysis.

Workloads have a curve of satisfying bucket sizes and token rate pairs. Many systems require the user to provide its rate limit. Other systems use simple heuristics to either find the "knee of the curve" or select a rate limit as a multiple of the average rate. However, for an individual workload, all pairs are satisfying, it is only when workloads are combined in a system do the different pairs matter. The configurations of the set of workloads on the system can be solved for using a system of linear equations. Therefore, when placing new workloads, the controlling architecture can find successful placements, while potentially reconfiguring the workloads assigned.

One extension would be addressing failure modes. Currently, the system is assumed to be at degraded performance when components have failed.

Monday, March 13, 2017

Book Review: Optimized C++: Proven Techniques for Heightened Performance

I have spent significant time on performance issues and have been in search of a book that can summarize the diversity of issues and techniques well. I hoped that Optimized C++: Proven Techniques for Heightened Performance would provide some of the guidance I want and
This book is not quite it. There is good material here, yet I found repeatedly thinking that the author was not aware of the past 10(?) years of changes to the field. Not an issue of the book was from the early 2000s, but it was published last year.

A key step in improving the performance of programs is measuring it. There are a variety of techniques for doing so. Tools based on instrumentation and tools based on sampling profiling. I find greater value to using the sampling profiling tools (for measuring performance) due to their lower overhead and ability to pinpoint where in a function this cost exists. Yet the book's focus is limited to gprof-esque approaches. I tell students that this approach is best with deep call trees, which may be a greater issue for C++ programming specifically.

The author is somewhat dismissive to compiler optimizations and emphasizes that his observed benefit has been particularly limited to function inlining. There are many more optimizations, and you should care about them. But again, I wonder if his experience of C++ has been deep call trees that could particularly benefit from inlining.

In a take it or leave it, this work also discourages the use of dynamic libraries. Yes, they impose a performance penalty, but they also provide valuable functionality. It all depends on your use case for whether you should statically or dynamically link your code. Code that is reused by separate executables should be in a dynamic library, as it reduces the memory requirements when running and reduces the effort to patch and update those executables. Components that are only used by a single executable should be statically linked, unless the components are of significant size such that decoupling can still benefit memory usage and the updating process.

The author related that replacing printf with puts to just print a string has performance advantages, due to printf being a complicated "God function". The basic point is valid that printf has significant functionality; however, the anecdote should be taken with a grain of salt. Current compilers will do this optimization (replace printf with puts) automatically.

While most of the work provides small examples, the final chapters on concurrency (?) and memory management do not. The concurrency chapter reads as a reference book, as it lists the various APIs available and what each does. It would be better for the book to assume that the readers are familiar with these calls (as the author does with many other topics) and discuss possible optimizations within this scope.

To conclude, the book is not bad, but I also cannot say it is accurate on every point. Especially with performance, programmers are apt to make prompt design decisions based on "their experience" or "recent publications". Measure your code's performance. Only then can you discern which techniques will provide value.

Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year. I still have not done much testing personally, but I did direct two students who explored and measured this support. As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks. Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report. Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code. Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive. This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates. Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach. However, their measurements revealed some counter-intuitive results. For example, an AVL tree is a self-balancing data structure. The self-balanced nature is a benefit in that fewer memory accesses should be required. Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation. We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache). As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort. Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions. In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads). In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations. It still remains the advice that simple atomic updates should utilize the appropriate instructions. What if you need to perform several of these operations? Again, we turn to the final report. The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction". Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Wednesday, January 13, 2016

Repost: Avoid Panicking about Performance

In a recent post, another blogger related how a simple attempt to improve performance nearly spiraled out of control. The lesson is that always measure and understand your performance problem before attempting any solution. Now, the very scope of your measurements and understanding can vary depending on the complexity of your solution. And when your "optimizations" have caused the system to go sideways, it is time to take a careful appraisal of whether to revert or continue. I have done both, and more often have I wished that I reverted rather than continued. Afterall, it is better for the code to work slowly rather than not work.

Again, always measure before cutting.

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly. I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters. These counters are not well documented and so the tools are low-level. Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue. Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body. For some applications, several loops account for most of the time. Then the loops can be perturbed in different ways that force certain resources to be exercised further. For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.

Monday, April 13, 2015

Book Review: Structured Parallel Programming

At SIGCSE this year, I spoke with several publishers about possible textbooks. Primarily, for ones that would work in the different classes that I might teach. As well as those that relate to my present research. In this case, I received a copy of Structured Parallel Programming: Patterns for Efficient Computation from the publisher to review and consider for possible future use in a course.

This work is in three parts: the basics of parallelism and performance, the common patterns in which parallelism is expressed, and example implementations of several algorithms. The second part is the core of the work. To show maps, reduce, scatter and gather, stencil, fork-join, and pipeline. But before we learned those details, we would come to key quotes for all that I do:

You cannot neglect the performance of serial code, hoping to make up the difference with parallelism.

And:

[The] performance of scalar processing is important; if it is slow it can end up dominating performance.

Therefore, parallelism is a method for improving the performance of already efficient code.

With both the common patterns, as well as the example implementations, the authors generally provide the source code for each pattern and implementation using Cilk, TBB, and OpenMP. This source is not for casual readers. More involved implementations can stretch for several pages, as the initial implementation and then subsequent refinements are explored. While it serves well as a reference, it may have worked better to focus on one parallelism approach for each section and therefore give further explanation to the code, especially the language features used. And thereby retain the pattern itself rather than becoming a practitioners' reference.

The example implementations (the third part) are perhaps the least interesting for the classroom and potentially the most interesting for practitioners. Clearly, if I was trying to write code similar to one of these problems, I would have an excellent reference and starting point. However, that is quite rarely the case for myself and I suspect most people as well.

If I was teach a parallel programming course, I might consider using this work (although I still have other, similar textbooks to review); however, were I to do so I would be confining my teaching to the first two parts and may even to just 1 parallel programming paradigm. Yes, I will admit that the last parallel programming course I took covered a diversity of paradigms (Cilk, vectorization, GPUs, OpenMP, MPI), yet I would have preferred to focus more on what one or two paradigms are capable of rather than just the taste of many. Parallel programming takes a lot of work to learn and this book is one piece in that effort.

Monday, February 23, 2015

Compilers and Optimizations, should you care?

Compiler optimizations matter. One time in helping a fellow Ph.D. student improve a simulator's performance, I did two things: first, I replaced an expensive data structure with a more efficient one. Second, I turned on compiler optimizations. Together, the simulator ran 100x faster.

A question posted in the stackexchange system asked, "Why are there so few C compilers?" The main answer pointed out that any C compiler needs to be optimizing. Lots of optimizations are occurring on every compilation, and each one gaining tiniest increments in performance. While I enjoy discussing them in detail, I generally wave my hands and tell of how they are good, yet make debugging difficult. These optimizations are lumped together as the "optimization level".

In "What Every Programmer Should Know About Compiler Optimizations", we revisit optimizations. First, the compiler is no panacea and cannot correct for inefficient algorithms or poor data structure choices (although I am party to research on the later). The article then suggests four points to assist the compiler in its efforts at optimizing the code.

"Write understandable, maintainable code." Please do this! Usually, the expensive resource is the programmer. So the first optimization step is to improve the programmer's efficiency with the source code. Remember Review: Performance Anti-patterns and do not start optimizing the code until you know what is slow.

"Use compiler directives." Scary. Excepting the inline directive, I have used these less than a half dozen times in almost as many years of performance work. Furthermore, the example of changing the calling convention is less relevant in 64-bit space where most conventions have been made irrelevant.

"Use compiler-intrinsic functions." (see Compiler Intrinsics the Secret Sauce) This can often dovetail with the first point by removing ugly bit twiddling code and putting in clean function calls.

"Use profile-guided optimization (PGO)." This optimization is based on the dynamic behavior of the program. Meaning that if you take a profile of the program doing X, and later the program does Y; executing Y can be slower. The key is picking good, representative examples of the program's execution.

So you have dialed up the optimization level, and written understandable code sprinkled with intrinsics. Now what? The next step (with which I agree) is to use link time optimizations (LTO) / Link-Time Code Generation (LTCG). This flag delays many optimizations until the entire program is available to be linked. One of the principles of software performance is that the more of the program available to be optimized, the better it can be optimized. (This principle also applies in computer architecture). Thus, by delaying many optimization until the entire program is available, the linker can find additional and better opportunities than were present in individual components.

The article notes, "The only reason not to use LTCG is when you want to distribute the resulting object and library files." And alas, I have fought several battles to overcome this point, as my work requires the use of LTO. Perhaps in the next decade, LTO will be standard.

Monday, July 14, 2014

Book Review: The Practice of Programming

(contains Amazon affiliate link)
I recently found, The Practice of Programming (Addison-Wesley Professional Computing Series), sitting on the shelf at my local library. I am generally skeptical when it comes to programming books, and particularly those from different decades, but I trusted the name "Brian Kernighan" so I checked the book out.

And I am so glad that I did. From the first chapter that discussed style, I wanted to read more. And the only reason to ever stop reading was to pull out a computer and put these things into practice. I didn't even mind that it wasn't until chapter 7 that performance was discussed. Still, I will readily acknowledge that I disagree with some of statements in the book. Furthermore, there are some parts of the text that are clearly dated, like discussing current C / C++ standards.

I'd like to conclude with a brief code snippet from the work. This code is part of a serializer / deserializer. Such routines are always a pain to write and particularly if you have many different classes / structs that need them. Thus the authors suggest using vargs and writing a single routine that can handle this for you. Here is the unpack (i.e., deserialize) routine:

/* unpack: unpack packed items from buf, return length */
int unpack(uchar *buf, char *fmt, ...)
{
    va_list args;
    char *p;
    uchar *bp, *pc;
    ushort *ps;
    ulong *pl;

    bp = buf;
    va_start(args, fmt);
    for (p = fmt; *p != '\0'; p++) {
        switch (*p) {
        case 'c': /* char */
            pc = va_arg(args, uchar*);
            *pc = *bp++;
            break;
         case 's': /* short */
             ps = va_arg(args, ushort*);
             *ps = *bp++ << 8;
             *ps |= *bp++;
             break;
         case 'l': /* long */
             pl = va_arg(args, ulong*);
             *pl = *bp++ << 24;
             *pl |= *bp++ << 16;
             *pl |= *bp++ << 8;
             *pl |= *bp++;
         default: /* illegal type character */
             va_end(args);
             return -1;
         }
     }
     va_end(args);
     return bp - buf;
}

So now we have a little language for describing the format of the data in the buffer. We invoke unpack with a string like "cscl" and pointers to store the char, short, char and long. Hah! That's it. Anytime we add new types, we just to call the pack / unpack.

Does it matter that the variables are only sequences like "pl" or "bp"? No. Variable names should be meaningful and consistent. "i" can be fine for a loop iterator.

We have given up some performance (*gasp*), but gained in the other parts that matter like readability and maintainability. I plan on using this in my current research (but only the unpack, as my serializers are already highly optimized). All in all, I approve of this book and may even someday require it as a textbook for students.

Monday, July 22, 2013

Review: Patterns for Cache Optimizations on Multi-processor Machines (ParaPLoP '10)

In order to optimize any code, know that the optimizations "depend on the exact architecture of the machine (processor, memory, etc), the exact version of the compiler, the exact version of the operating system and the particular configuration of the program that [you] are trying to optimize." At minimum, you should establish what your workload is and how its performance is measured, and then have some gauge of what its current performance is (including the distribution of measurements). All of which is covered in Patterns for cache optimizations on multi-processor machines before they are touch on any optimization. And I applaud them for doing so.

In this work, they explore three patterns of cache (mis-)use on modern processors. The first pattern is termed "loop interchange", named for its solution. In this pattern, the program does not access data with spatial locality. Instead of accessing every element in a cache line, the program has a different ordering and only touches a subset of the cache line, while later (after the line has been evicted) it accesses other subsets. In the example below, assume that N and M are both quite large (say 1+ million), so this code will likely have significant cache misses (at minimum L1 misses), while switching the "i" and "j" for loops (i.e. interchange) will considerably reduce the number of cache misses.

int X[N][M];
for (j = 0; j < M; j++)
for (i = 0; i < N; i ++)
f(X[i][j]); // Any set of operations applied to this element.

The next pattern is false sharing. Threads in a program intentionally and unintentionally share data. Data structures are written by programmers to logically group data; however, the grouping and structuring of the data is often made by the programmer and not for algorithmic need. The hardware is expecting locality from arrays and data structures. When multithreaded, the cache line is the unit by which the hardware tracks sharing of data. So if different threads write to different data in the same cache line, then hardware treats the writes as being made to the same thing, which precludes it from caching. The usual recommendation for solving this problem is to pad the data, so that the software notion (int) and hardware notion (cache line) are the same size.

int X[N];
void* thread_work(int tid)
{
for (int i = 0; i < N; i++)
    if (i % num_threads == tid)
    X[i] = do_work(X[i]);
}

This second example goes beyond the paper's scope for false sharing. Common data structures may also have different sharing patterns for each element. For example in this data structure, the following fields are written to: encoding, sum, weight_left, and weight_right. The rest are read-only. Currently the data structure uses two cache lines (as all fields are 8-bytes in size).   If the structure was rearranged so that the written fields were in one cache line and the read-only fields in the second line, then updates by any thread would only result in one cache miss rather than two. Padding may be required, but the key insight here is arranging data by sharing pattern, which is a generalization of the previous paragraph.

typedef struct _node {
    graph_value_t value, encoding;
    unsigned long long sum;
    struct _edge* fwd;
    struct _edge* back;

    // tree sorted by value
    struct _node* left;
    struct _node* right;

    // tree sorted by encoding
    struct _node* weight_left;
    struct _node* weight_right;
} node, *pnode;

The final pattern explored in the paper is data alignment. Ignoring the issue of misaligned accesses, let's look at misaligned allocations. Suppose we allocate an array of 48-byte data structures in a multithreaded program. Sometimes accessing an element is one cache miss, but sometimes it is two. The runtime system has packed the data structures together, with 4 fitting in 3 cache lines. In general, when you allocate data structures, they come with the same alignment as in the array, made to a 16-byte boundary, but this boundary is not guaranteed to be the start of a cache line. The primary solution is to use support calls that change the allocation alignment. This may waste space, but now the allocation comes using our expected number of cache lines. And by using the lines we expect, we can tailor the program to the architecture and observe the expected performance characteristics.

The patterns are three simple ones that architects and performance minded programmers have known for years. I am pleased to see them being reiterated, but the response may be like that from the developer after I changed his code per these patterns years ago, "Why can't the compiler just do that for me?!"

Friday, June 21, 2013

What is Time: Keeping Time in Software

In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application. However, sometimes what you want is to measure an interval of time. We can generally agree to do this by taking two measurements: the time before and the time after. Then find the difference, which is the interval of time elapsed. Great! But how exactly are we going to measure time?

A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.

But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:

#include <stdio.h>
#include <time.h>

int main(int argc, char** argv)
{

    struct timespec n, o;
    clockid_t t = CLOCK_MONOTONIC;

    clock_getres(t, &n);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    clock_gettime(t, &n);
    clock_gettime(t, &o);
    printf("%d - %d\n",n.tv_sec, n.tv_nsec);
    printf("%d - %d\n",o.tv_sec, o.tv_nsec);

    return 0;
}

% ./t_test
0 - 999848
13390932 - 798720095
13390932 - 798720095

This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.

But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?

But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.

But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.

These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).

Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.

As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.

Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature. Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).

In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.

Wednesday, June 5, 2013

Atomics versus Locks

I've written in the past about using intrinsics to improve performance. And I've discussed using this functionality to build your own synchronization. I'm not alone in this knowledge, and I want to further expand the answer to a stackoverflow question. But sometimes a little knowledge is a dangerous thing. The trouble came from a paper that looked at, among other things, the performance of incrementing a value using locks versus using interlocked compare-and-swap for two threads.

Alright, compare-and-swap (CAS) should be faster, as one can approximate lock acquire and release as this operation. So let's test and see how the results turn out. First, the base code:

unsigned long long counter = 0;

int main(int argc, char** argv)
{
    unsigned long long start, end, o;

    start = rdtsc();
    do {
        counter++;
    } while (counter < MAX);
    end = rdtsc();

    printf("%lld\n", (end - start));
    return 0;
}

This version is the "Single Thread" result. But the first variation is to switch counter to be volatile, as this variable needs this property when we move to multithreaded. By doing so, the compiler has to read, modify, then write the result to memory on every loop iteration. Overhead, versus keeping the value in a register, like the assembly that follows:

loop: add    $0x1,%rax
      cmp    $0x1dcd64ff,%rax
      jbe    loop

And the volatile code adds: mov    0x1429(%rip),%rax and a corresponding write after. Already this is enough to measurably impact performance. Let's move forward and switch to performing our increment using compare-and-swap.

    do {
        unsigned long long new, old = counter;

        if (old < MAX)
        {
            new = old + 1;
            __sync_val_compare_and_swap(&counter, old, new);
        }
        else
            i = 0;

    } while (i == 1);

Now, we only want to increment when we haven't reached max. If we have, then terminate. But the assembly is now unpleasant:

loop: mov    0x144b(%rip),%rax        # 401a90 <counter>
      cmp    $0x1dcd64ff,%rax
      jbe    swap
...
swap: lea    0x1(%rax),%rcx
      lock cmpxchg %rcx,0x140b(%rip)        # 401a90 <counter>
      jmp    loop

First, the optimized code has split this into two pieces. The check and the increment, and these have been split. A fun exercise would be to convince the compiler to switch the jbe into ja and combine the pieces back together. But the fact is, the impact of using the atomic operation looks like 6x.

As stated before, a lock is approximately twice the cost of the atomic operation and that result is roughly observed (along with some further overhead). Each version is then modified to spawn a second thread that does the same work as the first. All of my results are cycles per increment and follow (taken from a single run, reproducibility not guaranteed):

TEST - -O3 - -O0

Single Thread - 1 cyc / inc - 7 cyc / inc

Single Thread volatile - 6 cyc / inc - 16 cyc / inc

Single Thread w/ CAS - 38 cyc / inc - 46 cyc / inc

Single Thread w/ lock - 90 cyc / inc - 92 cyc / inc

Two Threads w/ CAS - 250 cyc / inc - 251 cyc / inc

Two Threads w/ lock - 730 cyc / inc - 630 cyc / inc

What can we conclude? First, adding -O3 (optimized) had almost no effect on results when the synchronization mechanisms are introduced. Yes, unoptimized code is bad, but it doesn't matter when all of your work is synchronization and the effects that it imposes on the great processor performance tricks.

Second, but why did unoptimized lock work better? Because there was a different thread interleaving. Perhaps the OS put one of threads asleep and the other thread ran happily for a while. The point is that if you are synchronizing, then that is your overhead.

Third, and here is the dangerous knowledge, you might conclude that you want to use CAS instead of the lock. Yet, this test was highly artificial. The synchronized operation took just a couple of cycles optimized, so we could observe the overhead of synchronization. Suppose your work is now 1250 cycles, now using CAS is only roughly a 33% improvement (1250 + 250 versus 1250 + 730) and not a 3x (250 versus 730). Furthermore, how often is your synchronization a single variable?

To conclude, IF what you want to synchronize is a single variable, then using CAS can work. Have I done this? Yes. Do I normally? No. I have this tool in my programming toolbox, for those times that I really need it.

Anyway, why not use atomic increment instead? Because we want to ensure that the final value is exactly MAX, no more and no less.

Wednesday, April 3, 2013

Architecture Abstraction Leaks

As a computer architect, I like to think that the hardware abstraction is generally opaque. If a programmer really needs performance or certain effects, then it is possible to reach through the layer, but in general the computer gives you what you want. Sometimes, a programmer can do simple things that pull the covers back and reveal the architecture in all of its gritty detail. One example would be following row versus column indexing when looping over large pieces of data.

In this post, I wanted to draw your attention to a question posed on stackoverflow, Why is processing a sorted array faster than an unsorted array? In this question, we are ignoring the time to sort the array. The operation being applied to each array element is relatively simple and is conditional on the value of the element. By sorting the array, the same operations could be applied to contiguous elements. And as the same operations are being applied, the code in question was having a clear win from branch prediction.

But the branch predictor is abstracted hardware, so once again the programmer just learned about the architecture when he / she didn't intend to.

Monday, March 4, 2013

Maps and Red-Black Trees

When cooperating on a prior work of research, Brainy: effective selection of data structures, I learned about the cost of selecting particular data structures. I expect that every Computer Scientist knows the general cost of using standard data structures like trees and linked lists. Now, the extension that Brainy provided was to recognize that a "tree" could have different implementations and these implementations can have different costs for a given workload. I, and I expect others, learned about AVL, splay, red-black and other tree types.

All this is good until your type hides its implementation and you reason about it in abstract. A map is commonly misinterpreted by its implementation. A map is a key-value store, where a "key" has an associated "value". An address book can be understood as a map of name to address / phone number / etc.

Now, the map is assumed to be implemented as a hash table, to give O(1) look-up cost. However in the C++ standard library, map uses a red-black tree for O(log n). Before you go and request that the libraries be changed, understand that experimental measurements suggest that the red-black implementation wins out when n > 500, as the hash collisions dominate the idealized O(1) cost. This is the problem that Brainy attempts to solve: the programmer selects the functionality, e.g., map and Brainy selects the best implementation given your program and architecture.

So go ahead and use a map, but recognize that you may not have O(1) cost and definitely won't as n grows increasingly large.

Thursday, May 31, 2012

Know your "N"

I once was working on some performance analysis and came to the developers asking how their code was consuming a disproportionate amount of time. At low load, the code was say 5% of non-idle CPU usage, but at high load, the code was now 50%. The developers noted that the code contained an algorithm that was O(n^2), but they had only run n=1,2,...10. My high load was n=1000, which exposed the scaling issues of their code. The developers had done everything right, except knowing the scope of the problem (and to be fair, their code was originally for desktops and I was testing servers).

Another time a developer came and discussed memory limitations, due to using 64-bit counters. I asked about how large the input was and learned it was under a billion elements. Well, a billion fits into 32-bit counters and now the memory consumption is cut by half. (Maybe I need more stories, as this was also in Bit Packing.)

The first step to any good performance analysis is selecting an algorithm, based on the problem (including its size). Bubble sort an array of 10 elements, but not 10,000. In fact, ask whether the problem really requires sorting at all.

I will continue to strongly urge developers to measure the performance of their code, before trying to tune their code. But part of this measurement is knowing the "size" of the inputs.

Friday, December 23, 2011

Measuring Performance

Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.

First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").

The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).

So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.

But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).

Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.

Again, always measure performance before cutting (i.e., optimizing).

Friday, November 11, 2011

Memory Mapped File IO

How often do you have to write a loop to read a file? Do you issue fscanf, fread, or perhaps create a stream via some object-oriented language?

Operating systems also provide support for easily reading in the entire file. A couple of function calls concluding with mmap() or MapViewOfFileEx() provides a char* pointer to the file's data. And with that, the file has been read into the program. So let's take a few moments to understand what the operating system is doing and use that to identify the advantages (and disadvantages) that memory mapping file IO provides.

When a program opens a file and performs IO via calls like fscanf, the program is using buffered IO. The operating system opens the file and sends IO operations based on the disk's granularity, which is at least 512B and often more. On a call like, fscanf(fp, "%d", &value), the program is requesting somewhere between 1 and 16 bytes, which is far less than the disk's block size. The operating system will read in the entire disk block and then copy the appropriate bytes to the program, while retaining the remained in its file cache. (By the way, the standard libraries also perform buffering, where the previous fscanf call might request 128 bytes from the OS and then process this data before returning to the application itself). Now, in making this call, how many copies of the data now exist in the system? There is one on disk, one in the OS's file cache, one in the standard library, and one in the application.

If the application requested the file via mmap, nothing happens except the OS reserves a range of the application's virtual address space for the file's data. Then when the application accesses an address in this range, the page fault will read in the data from the disk into the file cache and the virtual address will point to this same data.

If your program needs to make multiple passes through the file, or has a regular structure (such that a data structure could be defined, perhaps a future post on this), then memory mapped file IO can save time and space. However, if there are many small files, the file data only needs to be read once, or significant processing is required, then using a more "traditional" IO mechanism would be advisable.

Tuesday, June 21, 2011

Performance is a Feature

Performance is a Feature discusses recent performance work for the site, stackoverflow. I found the post interesting, albeit somewhat removed from my own work due to its web focus. As always with performance, there are two things I highlight. First, improve the biggest contributors first. Second, after a certain threshold, further improvements may not matter. For example, high frequency trading depends on the microseconds required, rendering a web page does not.

Tuesday, September 21, 2010

Review: Performance Anti-Patterns

Being a performance engineer, I always appreciate when others give solid guidance for how to write performant code. One such work is Performance Anti-Patterns. While you can read this article, there are a couple of things I'll reiterate here.

First, performance should be part of the entire development process. Trying to optimize code at the end of the cycle limits the options that are available. Conversely, adding micro-optimizations early is potentially wasteful as future development may negate the effect.

Second, benchmarks are key to proper measurement of performance. Establish what are the important scenarios for the application. And the metrics of interest for the benchmark. Is the goal to increase throughput? Or perhaps cycles per request? Power consumption at a given load? Without a specific measurement of performance, a benchmark just serves as an application that is run regularly.

This also defines the common case. Take a storage driver and three operations: read, write, and recovery. In all likelihood, the read and write operations occur far more often than recovery. If a disk fails, does it matter if the user is notified 10ms versus 20ms after the event? Does it matter if a read takes 10ms versus 20ms? Spend the effort on optimizations like-wise.

Third, think parallel. This requirement is beginning to sink in; however, just being multi-threaded is not enough. How many threads? One per task, CPU? Is the data properly partitioned? A scientific application will operate very different from a web server.

Finally, there is a cost and trade-off to performance. Reliability and security are obviously paramount. But also the readability of the code and future extensions to the application. Taking these elements into account can distinguish performant code from truly elegant implementations.