Friday, December 23, 2011

Measuring Performance

Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.

First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").

The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).

So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.

But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).

Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.

Again, always measure performance before cutting (i.e., optimizing).

Friday, November 11, 2011

Memory Mapped File IO

How often do you have to write a loop to read a file? Do you issue fscanf, fread, or perhaps create a stream via some object-oriented language?

Operating systems also provide support for easily reading in the entire file. A couple of function calls concluding with mmap() or MapViewOfFileEx() provides a char* pointer to the file's data. And with that, the file has been read into the program. So let's take a few moments to understand what the operating system is doing and use that to identify the advantages (and disadvantages) that memory mapping file IO provides.

When a program opens a file and performs IO via calls like fscanf, the program is using buffered IO. The operating system opens the file and sends IO operations based on the disk's granularity, which is at least 512B and often more. On a call like, fscanf(fp, "%d", &value), the program is requesting somewhere between 1 and 16 bytes, which is far less than the disk's block size. The operating system will read in the entire disk block and then copy the appropriate bytes to the program, while retaining the remained in its file cache. (By the way, the standard libraries also perform buffering, where the previous fscanf call might request 128 bytes from the OS and then process this data before returning to the application itself). Now, in making this call, how many copies of the data now exist in the system? There is one on disk, one in the OS's file cache, one in the standard library, and one in the application.

If the application requested the file via mmap, nothing happens except the OS reserves a range of the application's virtual address space for the file's data. Then when the application accesses an address in this range, the page fault will read in the data from the disk into the file cache and the virtual address will point to this same data.

If your program needs to make multiple passes through the file, or has a regular structure (such that a data structure could be defined, perhaps a future post on this), then memory mapped file IO can save time and space. However, if there are many small files, the file data only needs to be read once, or significant processing is required, then using a more "traditional" IO mechanism would be advisable.

Thursday, November 3, 2011

Repost Are you a Programmer

Consider the question, what percentage of programs are released to the public (either commercial or open source)? Even excluding student assignments, this percentage may be far lower than one might think. My expectations are biased by having many friends at Apple, Facebook, Google, and Microsoft. But thinking further, I spent 10-20% of my time as a salaried employee doing just that, writing little programs. These coding projects were intended to help me do my job faster, which was contributing to a commercial software product.

This is my introduction to a piece of career advice regarding being a Computer Scientist (*cough* programmer), Don't Call Yourself a Programmer.

Wednesday, October 5, 2011

Repost Presentation Prep

So much of what I and many other Computer Scientists do is give presentations. When I worked full time, I prepared at least one presentation a month, often more, and wrote regular reports of my work for upper management. As a grad student, I give fewer, but still write many papers.

Not to say I'm anywhere near perfect at presenting, but I have come a long way in learning how to write and present, skills that I once thought had no relation to my profession. So let me again refer to another blog that outlines the time and steps required to prepare these presentations. Many people think they are ready to present with a just set of slides and skimp on these steps. I've sat through "dry runs" where the presenter hadn't even practiced. Or even presentations where the presenter isn't ready. So take the time and practice, otherwise we walk back to our desk and laugh about the latest speaker that we didn't understand.

Tuesday, September 27, 2011

Repost Heartbeat

I still do all the daily stuff. Read emails, commute to campus, research new ideas for conferences like POPL or PLDI, and imagine I'll someday have something that will squeak out to be posted. So when my usual routine leads me to another programming blog I read posting about POPL research etc, it seems I should copy them - Heartbeat. Blog is still alive.

Monday, July 25, 2011

Book Review: Masterminds of Programming

Several years ago I read a couple of interesting interviews with language designers, so my curiously was piqued by Masterminds of Programming: Conversations with the Creators of Major Programming Languages (Theory in Practice (O'Reilly)). While the title is pretentious, the premise was interesting in that here are 17 well-used languages, so what was the designer thinking? Why did he make the choices that he did?

Alas, it was not quite to be. Interesting yes, but the discussions were sometimes more general involving all of computer science. Yet nearly every interview had memorable lines and claims; a few of which I shall reproduce here:

"C is a reasonably good language for compilers to generate, but the idea that humans beings should program in it is completely absurd." - Bertrand Meyer, Eiffel

"One of our programming languages guys proposed a competition in which people would program the same program in all three of those [languages] and we'd see .... It turned out the only thing we figured out is it all depended on where the brightest programmer went...." - Charles Geschke, PostScript

"A good programmer writes good code quickly. Good code is correct, compact and readable. 'Quickly' means hours to days."
and
"An operating system does absolutely nothing for you. As long as you had something - a subroutine called a disk driver, a subroutine called some kind of communication support, in the modern world, it doesn't do anything else." - Chuck Moore, FORTH

"I do recommend [C++] and not everybody is reluctant. In fact, I don't see much reluctance in [system software or embedded systems] beyond the natural reluctance to try something new in established organizations. Rather, I see steady and significant growth in C++ use."
and
"I have never seen a program that could be written better in C than in C++. I don't think such a program could exist." - Bjarne Stroustrup, C++

So now I go to Amazon to remove the book from the list to read, and record a score of 3 out of 5.

Saturday, July 16, 2011

Parallel Programming - Synchronization

Having submitted a paper this week to a conference, I have been lax on posting any updates.  With the submission behind me, I shall present another topic that has arisen recently in research and conversations - scalable synchronization.  For this post, I presume everyone has some knowledge of synchronization constructs like mutual exclusion, and knowledge of cache coherence is also valuable.  The story of this post begins 20 years ago when Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors was published.

This work studied the cost of mutexes / spinlocks and barriers.  They presented several different algorithms that scale well with increasing core counts.  When I read a work two decades old in Computer Science, at best I might think "Ah, here is where this idea was first proposed."  Which is true about the above work, but I also sit scratching my head and wondering how these ideas haven't diffused into common knowledge.

This post will focus on a single type of spinlock, the queued spinlock.  I'll discuss the hidden costs and dangers of common spinlocks and why it is sometimes worth using a queued spinlock.  Some of the authors were so good to propose several adaptations in Scalable queue-based spin locks with timeout, which I will also be discussing.  For this post, we will use an atomic_swap instruction, which is available (in one form or another) on modern architectures.

T atomic_swap(T* loc, T val){ T old = *loc; *loc = *val; return old;}

A basic (test-and-set) spinlock tries swapping in the value "locked" until it gets "unlocked" back, as follows.

void acquire_spinlock(int* lock)
{
    int value;
    do {
        value = atomic_swap(lock, LOCKED);
    } while (value == LOCKED);
}

Programmers soon realized that this implementation has some drawbacks.  Each core is constantly making requests for the same cache line and each one is trying to modify it.  The poor cache line is therefore going from processor to processor without anything useful happening.  So two changes were made: test-and-test-and-set and exponential backoff.  The following acquire routine has been extended with these changes, see "//".

void acquire_spinlock(int* lock)
{
    int value, delay = 1;
    value = atomic_swap(lock, LOCKED);
    while (value == LOCKED)
    {
        pause(delay); // take time before retrying
        delay = delay * 2;
        if (*lock != LOCKED) // test before swapping
            value = atomic_swap(lock, LOCKED);
    }
}

By delaying some time before attempting the lock again, each processor reduces its load on the memory system.  However, the core can be unlucky and be delaying when the lock is available, if it was released just after the test.  "Test-and-test-and-set" (TTS) is a simple paradigm where the code attempts an inexpensive check, like "*lock != LOCKED" before the expensive operation "atomic_swap".

However, TTS can still have many threads of execution enter the same expensive region.  The spinlock here is safe, but a programmer might have:
if (global_buffer == null) {global_buffer = malloc(SPACE);}
This sequence could have several threads attempting to allocate the same large buffer and therefore leaking memory, or worse.

But back to spinlocks.  With TTS, many threads have all requested read permissions for the lock's cache line.  Each sees that the lock is UNLOCKED and requests write permissions.  The cores are sending a storm of requests that will conclude with one core acquiring the lock and the rest spinning again.

Skipping over some intermediary research, the state of the art has generally settled on the queued spinlock.  The idea is that each waiter will spin on a local variable rather than the "lock" itself.  While both referenced works give code for the lock, it is relatively simple to derive once you know it exists (a fellow student and I worked out the algorithm in less than 30min), but I'll provide a sketch here.

struct entry
{
    volatile entry* next;
    volatile int state;
};

void acquire_queued_spinlock(void* lock, entry* me)
{
    me->next = null; 
    me->state = UNLOCKED;
    entry* prev = atomic_swap(lock, me);
    if (prev == null) return;  // lock is not held
    me->state = LOCKED;
    prev->next = me;
    while (me->state == LOCKED) ;
}

The queued spinlock only maintains a tail-pointer.  Each thread should have a pointer to its entry and insertions of new entries are only made to the tail.  After inserting, the thread has a unique pointer to the "old" tail of the queue, where it extends the queue.  The thread now spins on the local variable "me->state". Furthermore, on the release event, only one thread is unlocked, which minimizes the memory traffic from the cores fighting over the lock.  However, the queued lock is more expensive to acquire when there is no contention and it can be a greater pain to debug.

Reviewing the measurements in Algorithms..., the simple test-and-set spinlock performs best at very low levels of contention.  After a modest amount of lock contention, the TTS with exponential backoff is performing best.  But as the number of requests continue to increase, the queued spinlock dominates performance from then on.

As with all performance work, measure twice, cut once.  The correct spinlock for your parallel code will depend on: level of contention and the specific architecture of the system.

Tuesday, June 21, 2011

Performance is a Feature

Performance is a Feature discusses recent performance work for the site, stackoverflow.  I found the post interesting, albeit somewhat removed from my own work due to its web focus.  As always with performance, there are two things I highlight.  First, improve the biggest contributors first.  Second, after a certain threshold, further improvements may not matter.  For example, high frequency trading depends on the microseconds required, rendering a web page does not.

Monday, June 20, 2011

Book Review: Write Great Code Volume 1 (part 4 of 4)

As I stated in part 3, I found some erroneous information in chapter 11 about NUMA.  While this book was published in 2004, I cannot excuse the mistakes given that the research dates to the 90s (c.f., The SGI Origin: A ccNUMA Highly Scalable Server - ISCA'97).  Fortunately, they are relatively small (only a couple of paragraphs); however, the topic is valuable to know and I've spent a fair amount of my time trying to educate programmers on good practices with NUMA (c.f., Topology information under Windows) and so I don't appreciate books providing bad information.

NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory.  For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses.  Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.

In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory."  And furthermore provides the examples of graphics card memory and flash devices.

The last chapter, 12, discusses I/O.  Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated.  Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous.  Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included.  Second, at the highest performance levels, polling of devices can perform faster than interrupts.  For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point).  Finally, using asynchronous I/O is at the top of my list for things to know about devices.  In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.

Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course).  It is therefore hard for me to know who the appropriate audience would be.  Therefore, I have a neutral take on it.  But this isn't the last, as volume 2 has also been checked out from the library.

Thursday, June 16, 2011

Book Review: Write Great Code Volume 1 (part 3 of 4)

With Parts 1 and Parts 2 behind us, the text can race forward through character sets, present memory usage / layout (like my Pointers on Pointers series), discuss Boolean logic (which I only learned 6? times in college), and settle into CPU architecture, which is chapter 9.  This chapter would roughly serve as a syllabus for my undergraduate and graduate computer architecture classes, whereby the reader is told "Hey, processors are capable of doing this... somehow" and note the class where one can learn "this is how."

In learning about CPU architectures, one of the main points was made 30 pages into the chapter, to quote "The techniques up to this point in this chapter can be treated as if they were transparent to the programmer."  The paragraph continues by stating that programmers can achieve further improvements by knowing the architecture; however, the huge caveat is whether the changes required are "worth it" (c.f., Performance Anti-Patterns).  Nonetheless, there are two things to note.  First, branches can significantly affect processor performance.  Changing their prevalence and particular direction can have benefits.  Second, shorter instructions are usually faster, as they save cache space and therefore memory accesses.

Chapter 10 presents the other half of architecture, the instruction set (ISA).  Let's just say that I've been working off and on with the Intel instruction set for years and I had difficulty with this chapter.  At one level ISAs are simple to explain, and so this chapter did in introduction.  But modern ISAs (e.g., x86) have many complexities that make understanding difficult, as this work took whole pages to print the charts for particular opcode bits.  So I reached the end of the chapter, knowing what the author intended to achieve, but not actually knowing this knowledge.

I had hoped to conclude volume 1 in three parts, but chapter 11 (memory architecture) warrants its own post as it has wrong information.

Tuesday, June 7, 2011

Conference Time: HotPar-3

A little over a week ago, I was at Berkeley for the 3rd annual Workshop on Hot Topics in Parallelism.  The exciting thing was that researchers from a variety of areas were there, unlike my impression of conferences where everyone is in the same area.  Computational scientists, computer architects, programming language and compiler folk, and more were present to discuss their work on parallelism.

There are a couple of works that I would like to highlight (none of which are my own), from the complete list available at HotPar11.  First, "Parallel Programming with Inductive Synthesis" had some of the coolest and most unexpected ideas that I heard.  Their work provides a programmer with the ability to sketch a solution.  A simple example is factoring.  Given an initial function: f(x) = x^2 - 2 * x - 3.  And the programmer then requests the system provide an equivalent function in the form of g(x) = (x - ??) * (x - ??).  (?? is the notation for the system to synthesize the value).

Factoring is a known problem, so the system just factors, right?  Wrong!  Rather (based on my memory of the presentation), the system converts the two functions into circuits, then sets them equal to each other and finds a satisfying assignment to all variables.  Therefore, the system can create the synthesized functions in the general case.

The second interesting work explored data races.  In "How to Miscompile Programs with "Benign" Data Races", the attendees saw how even benign data races could cause errors given certain compiler optimizations.  Unfortunately, the examples are more involved and therefore I refer you to the paper.

The final interesting result that I wish to highlight is that given current technology trends, matrix multiply will be memory bound in 10 years.  In "Balance Principles for Algorithm-Architecture Co-Design", the authors derive simplified formulas for specific functions that can show how their performance relates to current / future technology.  The intent is to enable architects, etc to understand where to best focus the design "dollar", in order to achieve the best performance for that dollar.

I've selected these works primarily on the basis of the results staying with me.  I found many others professionally interesting, but the three papers here are the ones that made me sit up, scratch my beard, and go "hmm".

Thursday, May 19, 2011

Book Review: Write Great Code Volume 1 (part 2 of 4)

+/- 1.F x 2^(Exp)

That's the modern floating point representation (with a few caveats).  It is stored bitwise as SignExpF.  As this representation uses a variable exponent, the point floats to different positions.  By doing so, the computer can store a much greater range of numbers than integer representations, but with a small loss of precision.  In chemistry, I first learned of "significant digits" and floating point limits the user to 5 - 10 significant decimal digits (which requires roughly 15 - 30 binary digits).  For this post and subsequent posts in this series, I'll be writing a combination of what I knew prior to reading, as well as information from the book's text.

There are four official formats for floating point numbers using 16, 32 (a C float), 64 (a C double), and 128 bits respectively.  Primarily, the different representations devote bits to the fraction, which provides increased precision.  Intel also has a 80-bit representation, which has 64 bits for the fraction thereby enabling existing integer arithmetic support when the exponents are the same.  Furthermore, ordering the bits: sign, exponent, fraction also enables integer comparisons between floats.

Given the precision limitations (even with 128-bit floats), there are some gotchas with arithmetic.  First, equality is hard to define.  Equality tests need to respect the level of error present in the floating points.  For example, equality is finding that two numbers are within this error, as follows.

bool Equal(float a, float b)
{
    bool ret = (a == b); // Don't do this

    ret = (abs(a - b) < error);  // Test like this

    return ret;
}

Another gotcha is preserving precision.  With addition and subtraction, the floats need to be converted to have the same exponent, which can result in loss of precision.  Therefore, the text recommends multiplication and division first.  This seems reasonable, and wasn't something I had heard before.

Friday, May 13, 2011

Book Review: Write Great Code Volume 1 (part 1 of 4)

What is great code?  What characteristics would you describe it having?  These questions come to mind as I am reading the first of a four volume series on writing great code.  The first work is subtitled, "Understanding the Machine."  Before I delve into what I've "learned" by reading this volume, what should one hope to gain?  What understanding of modern machines is required for writing code, especially great code?

First, I'd emphasize that great code is more about maintainability and understanding than efficiency.  Correctness over performance.  And therefore, the text is more about how understanding the machine can clean-up the code, simplify any designs, and ensure correctness.

Second, in the context of the previous post, most programs do not need to "understand" the machine.  Most applications will not be constrained by execution time, and therefore the programmer effort should be directed elsewhere.  Yet programmers reach their own insights into the application / computer interaction and modify accordingly, and usually, myself included, these insights are irrelevant for the application.

Third, there are specific aspects of modern computer architecture / machine design that is still worth knowing.  For example, I find that most programmers have limited understanding of branch prediction, cache usage, NUMA hierarchies, and superscalar processors.  (Based on my reading so far, I would also add floating point to this list).

What else should a programmer know?  What should I be looking for while I read this book?

Monday, April 18, 2011

Is vs Ought in Computer Science

In philosophy, we debate what something is versus what it ought to be.  And often the usage is confounded.  In thinking about functional programming, I recalled this debate.

A functional program declares what ought to occur.  The computer has large leeway in how it will successfully execute this program.  In contrast, an imperative program declares what will occur.  The computer has little leeway in executing this program.

In programming imperatively, we can more clearly reason about the specific execution time / space required, as we can know the fullest details about the program.  Each step and instruction is clear, along with the specific allocation points to comprehend the memory usage.  Yet, this model effectively went away over 20 years ago, with superscalar processors and overall program complexity.

A common fallacy that programmers make is to try to reason about the details of execution time.  Questions are pondered like should an increment be "++" or "+= 1"?  Would it make you feel better to know I can construct cases where one executes faster than the other?  Most programmers who want to reason about their program are not sufficiently equipped to do so.

Lest we discard this benefit of imperative programming entirely, there are other decisions that are still within reason.  For example, the specific layout of the data in a structure can have impact on execution time.  And while it is doable by hand, the response I received from other programmers is "can't the compiler do this for me?"

Fine.  Let's only program with what the application ought to be doing.  The compiler / run-time / operating system all serve to take these statements of ought and make them be.  Still someone must reason about the is of programs, unless we have a processor that directly takes the functional statements.

I shall continue to be skeptical about programming functionally; however, I advocate further support for stating the "ought" of program statements to aid the verification, etc that is so highly lauded.  Imperative programs need to contain more details about what the program ought to be doing, which can be used in interesting ways beyond mere verification of correct execution.  This direction is part of my research, which I look forward to discussing at HotPar 2011 - "Parallel Pattern Detection for Architectural Improvement".  The specifics of which I'll wait to discuss with the proceedings.

Saturday, March 26, 2011

Functional vs Imperative Programming

I was recently provided a link to another "programming" blog.  The blog, Existential Type, is written by a professor from my undergrad institution, Carnegie Mellon.  In it, he appears to be chronicling his efforts of teaching functional programming to freshmen computer science students.  I have been deeply engrossed in reading the posts thus far and they have revived knowledge of past debates in language choice, etc.  You may look forward to several future posts delving deeper into this issue.  But today is Saturday and I have more research to do.

Monday, March 7, 2011

A Rare Error

In the course of my research, I encountered the following error and was unable to find any reference to solutions.  However, this error is likely confined to programmers writing their own types using reflection or perhaps a buggy compiler.

PEVerify - "call to .ctor only allowed to initialize this pointer ..."

This error signifies that a constructor (.ctor) should call the constructor for itself or for the type it is extending; however, it is directly calling a different routine.  The following code is a rough sketch of the error condition.

class baseT  // implicitly derives from Object
{
    baseT() { }
}

class derT : baseT
{
    derT()
    {
        Object();  // calling a different constructor than base class
    }
}

For my work, changing the call from Object() to baseT() fixed the error.

Friday, March 4, 2011

Parallel Programming - First Pass

With my time dominated by preparing to take PhD qualifying exams (i.e., quals), I have been even more slack than usual with regards to preparing regular posts.  Nonetheless, let's talk a little on parallel programming.  In one aspect, the parallel paradigm is the future of computer science, even if I remain highly skeptical about what the specifics of this computing will be.  But just because its usage in general computing may be occluded, the specific usefulness of parallel computing is not in doubt.  This post will serve as an overview of several concepts in parallel programming.

First to distinguish between concurrent and parallel execution.  Concurrent execution has the possibility or potential for executing simultaneously.  Parallel execution is when this potential is realized.  Concurrent execution is possible with a single core; however, parallel execution is not.

Synchronization is the main question when writing concurrent code.  Synchronization introduces a specific ordering to what was otherwise independent execution.  There are two common flavors: exclusion and notification.  Exclusion consists of mutexes, spinlocks, and other constructs that guarantee a single instance of concurrent execution performing a specific set of operations.  With notification, concurrent executions establish information with respect to each other, for example every instance has reached a specific point (e.g., barrier).

An ongoing quest with synchronization research is transactional memory (TM).  TM provides the ability to make a set of memory updates atomicly.  Processors provide the ability to make simple updates atomic (see Compiler Intrinsics), yet a series of updates requires the explicit exclusion guarantee provided by spinlocks, etc.  TM brings the exclusion to the memory address itself, rather than the abstract object protected by the spinlock, and allows an arbitrary set of accesses to be encapsulated in the atomic operation.  However, TM is not presently feasible.

Parallel patterns are formed based on the observation that parallel programs and algorithms can be classified into several distinct groups (i.e., patterns).  An assembly line is a parallel operation and fits the "pipelined" pattern.  By the programmer recognizing the pattern, certain common errors can be avoided.  With the pipeline, the programmer recognizes that the data is to be passed through discreet stages.

Well, that's my prelude to what will likely be many more posts on parallel programming.

Thursday, February 24, 2011

VC++ Concurrency Runtime

I was not aware of any particular concurrency support in the VC++ environment, so I was delighted when a friend of mine posted about the VC++ Concurrency Runtime and lambda expressions.  Therefore, while the article was about using lambda expressions, I learned about the concurrency support and especially learning of the parallel pattern library.  Lambda expressions intrigue me, not for actually using them but rather I persist in imagining how cool they are.  I shall save lambdas for another post.

Wednesday, February 9, 2011

Compiler Intrinsics - The Secret Sauce

Compiler intrinsics are one of the things that C programmers are often unaware.  Sometimes they know that such a functionality should exist, but is suspected to be cloaked in assembly.  Other times, the intrinsic does something otherwise unimagined by the developer.  For this article, I will confine my notation and functionality to that provided by Visual Studio (see MSDN); however, I have every expectation that other platforms / compilers will operate in a similar manner.

The intrinsic provides two things to a programmer.  First, compiler supported functionality might otherwise be unavailable.  Second and related, the compiler has understanding of what each intrinsic does.  The intrinsics are of three flavors: obscured functionality, replicated functionality, and compiler insights.  And in discussing these flavors, I think the provisions will be better understood.

Architectures provide many instructions that are not immediately accessible to the average programmer.  A simple example is counting the bits in a variable.  Such functionality is quite simple for hardware and already provided by the processor, yet I am aware of no language that has notation for such an operation.  Instead, a programmer can invoke __popcnt(var) and be done.  Invoking this intrinsic provides two benefits: first, performance and second correctness.  On Stack Overflow, the equivalent C implementation is suggested as:


int NumberOfSetBits(int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

Replicated functionality steps up one level in complexity.  These operations may not have simple assembly equivalents, yet there still may be advantages from using the intrinsics.  An example is memcpy (in a prior job, I worked extensively to improve the implementation of memcpy so this is particularly dear).  There is a version of memcpy in the C runtime library, which is a good general purpose implementation.  Yet, if the program is not copying variable length data, perhaps inlining the call can provide benefit.  Beyond just avoiding the function call, having a fixed copy size also enables the compiler to generate instructions without the loops, etc that are in the memcpy implementation.  By utilizing the memcpy intrinsic, the compiler can do all of this, whereas without the intrinsic the linker is limited to linking in the function from the separate library.

Many times I've heard someone cite the halting problem and correctly assert that a compiler cannot fully intuit the operation of a program.  Beyond this, even limited analysis of source code is a difficult problem.  To this end, there exist a limited set of intrinsics that provide the compiler with further information regarding program state.  In particular, there is __assume.  This intrinsic posits that some expression is true, which can result in further optimization of the subsequent code.  It is recommended that this intrinsic is paired with asserts, as follows.

#ifdef DEBUG
#define INVARIANT(e) assert(e)
#else
#define INVARIANT(e) __assume(e)
#endif

Compiler intrinsics provide the programmer the opportunity to improve application performance by both simplifying algorithms with better hardware support and by guiding the compiler with a greater understanding of the code's operation.

EDIT: The intrinsics are also termed "builtins" for GCC, see here.

Wednesday, February 2, 2011

Ray Tracing Performance and Practice Part 3

When last we left the ray tracer, it could read in PLY files and then render a scene.  But while it rendered, you'd probably want to get dinner, read some blogs, and write some code.  Very slow.  Improving performance is very straight forward: either reduce the work done or find a way to do the same work faster.

At the time of original development, I knew that there existed further performance opportunities via Oct Trees.  The idea behind these and other similar structures is to partition the scene into sub-scenes that can be treated in the aggregate.  Let's work through a simple example.

While the actual ray tracer is in 3D, we'll reason through a 2D tracer.  We'll have lots of circles and triangles in standard XY coordinates.  We can start partitioning the scene into quadrants: +X +Y, +X -Y, etc.  Now, when we draw a line (i.e., the ray) through the scene, rather than checking whether the ray is intersecting every circle and triangle, we can first check whether it is passing through any quadrant.  If the ray doesn't pass through the quadrant, then the ray will also not pass through any object in the quadrant.  The Oct Trees work very similarly, except besides X and Y, the scene is also split by +/- Z.

If the scene is split just once, this can likely provide some benefit, but for real practicality each octant of the tree must also be split repeatedly until only a few scene objects (i.e., spheres and triangles) remain in each octant.  This is all well and good; however, four particular problems arise: first, where should each octant be split.  Second, how many scene objects should be in each octant.  Third, how should objects that straddle octant boundaries be handled?  Finally, how does one practically render with an Oct Tree?

Where?  Two possibilities were tried: splitting in the "center" of the objects and splitting in the center of the octant.  If the split point is based on the objects, then each octant should have a similar number of objects.  This bounds the depth of the Oct Tree by making each octant have a subset of the parent octant.  However, this also forces every ray to traverse several levels of the Oct Tree, as every octant should contain scene objects.  Instead, if each octant is split at its center, then many octants may be empty, but the ones with scene objects may have significant number of levels in the tree before reaching individual objects.

How many?  Ideally, this number would be reached based on considerable experimentation and analysis.  At present, the code just splits an octant when it contains more than 16 scene objects.  Given the 50k triangle scene from the previous post, this would create roughly 3000 final octants, 375 parent octants, 47 grandparents, 6 great grandparents, and 1 root.  Ideally.

Yet, the split points are never ideal and so the octants are imbalanced.  Especially as some scene objects will cross the octant boundaries.  For example, a sphere placed at the origin is present in all 8 octants.  So the Oct Tree places the scene object into every octant that it is present within.  Rather than computing whether the object is precisely within the octant, the Oct Tree creates a bounding cube around the object.  Back to the 2D space, draw a triangle from (1,1), (-1, 1), and (-1, -1).  A square around this triangle would also include (1, -1), and therefore the triangle would also be "part" of the +X -Y quadrant.  I would be happy to switch to computing whether the object intersects the octant boundary planes; however, that would be when I understand the math involved.

The final issue is how to put the Oct Tree into practice.  In the immediate, the intersect code follows:
    if (No Octants) 
        foreach SceneObject in OctTree
        {
            Compute Distance for Ray to Intersect Object
        }
        Return closest object or none
    foreach Octant in OctTree {
        Compute Distance for Ray to intersect Octant
    }
    Sort Distances
    Select closest octant
    do {
        Recurse on octant
        if (Object returned) return object
        Select next closest octant
    } while (Octants remain)
    return none

A potential optimization is to handle shadows, reflected rays, etc from the lowest octant rather than starting at the root.  However, implementing such an optimization requires significant state tracking, which also impedes making the ray tracer multi-threaded.

In a future part 4, I'll explore the hurdles of making the tracer multi-threaded rather than its current implementation of one thread.

Wednesday, January 19, 2011

Review: Measurement Bias (ASPLOS'09)

Given that I read 150 - 200 research papers a year, it seems only reasonable that I point out papers that I find particularly interesting.  One of the first papers that I read in grad school (and still one of the most interesting) is the subject of this post, Producing wrong data without doing anything obviously wrong!

We know of many sources of variance in our measurements, like whether other applications or processing is occurring.  Or second order items like, where is the data laid out on disk (inner versus outer tracks) or what are the specific pages of memory allocated (as it can influence caching)? But these variations are (usually) different from run to run, so by taking many measurements we can see an accurate performance where the events above occur with some frequency.

The paper tests the following comparison: what benefit does -O3 provide over -O2 in gcc?  Beyond the variations above, what items may affect performance of which we aren't aware, particularly those that don't vary over runs.  The danger is that these artifacts can result in this "wrong data" without our knowing it.  Two artifacts are analyzed in the paper: linking order and environment size.  Taking these in order.

The authors found that changing the order that the libraries are linked in the applications showed a performance variation of 15%.  On further analysis, they found that certain performance critical sections of code would have different alignments depending on the linking order.  Sometimes the code would be in one cache line, other times two.  This bias persists in both gcc and ICC (Intel's compiler).

Environment size also has unpredictable effects on application performance.  On the UNIX systems tested, environment variables are loaded onto the stack before the call into the application's main() function.  As the variables increase in size, the alignment of the stack changes and this causes performance effects throughout the code.  Some applications have minimal effects, others will vary by +/- 10%.

While these are but two cautionary tales of how small changes to program state can have significant performance impacts, the main take-away is that these effects will always be present.  But they call us to action in preparing more diverse workloads that can address these biases, like conducting multiple runs address interrupts, context switches, etc.

Wednesday, January 12, 2011

From C to C#

My recent work has lead me to port several programs that I'd previously developed in C into C#.  In a previous post, I explained some of my preference for C, but work being work (or research)....  After these efforts, I thought I'd expand upon some of my joys and sorrows experienced during this effort.

First, overall the process was surprisingly easy.  Most code required little to no changes.  Often, entire code bodies were copied between projects with a single find / replace: "." in place of "->".

Second, the minor changes were basically those within the object-based model.  Fields of a struct become public (or private).  And global functions need to be assigned to a class.

Third, some types and constructs do not have an easy conversion.  While not difficult, it was nonetheless annoying to have to redo arrays as having run-time defined length.  Or to change types to UInt32.

The final aspect to porting between languages is the run-time / OS / system interfaces that do change.  So while the internals of the magical program box remain relatively constant.  The externals of how it interacts with the system to take in / send out data change.  Fortunately, for my porting tasks, this code was relatively modest.  (One of the more difficult parts of this step is covered in Ray Tracing ... Part 2).