Tuesday, June 21, 2011

Performance is a Feature

Performance is a Feature discusses recent performance work for the site, stackoverflow.  I found the post interesting, albeit somewhat removed from my own work due to its web focus.  As always with performance, there are two things I highlight.  First, improve the biggest contributors first.  Second, after a certain threshold, further improvements may not matter.  For example, high frequency trading depends on the microseconds required, rendering a web page does not.

Monday, June 20, 2011

Book Review: Write Great Code Volume 1 (part 4 of 4)

As I stated in part 3, I found some erroneous information in chapter 11 about NUMA.  While this book was published in 2004, I cannot excuse the mistakes given that the research dates to the 90s (c.f., The SGI Origin: A ccNUMA Highly Scalable Server - ISCA'97).  Fortunately, they are relatively small (only a couple of paragraphs); however, the topic is valuable to know and I've spent a fair amount of my time trying to educate programmers on good practices with NUMA (c.f., Topology information under Windows) and so I don't appreciate books providing bad information.

NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory.  For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses.  Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.

In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory."  And furthermore provides the examples of graphics card memory and flash devices.

The last chapter, 12, discusses I/O.  Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated.  Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous.  Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included.  Second, at the highest performance levels, polling of devices can perform faster than interrupts.  For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point).  Finally, using asynchronous I/O is at the top of my list for things to know about devices.  In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.

Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course).  It is therefore hard for me to know who the appropriate audience would be.  Therefore, I have a neutral take on it.  But this isn't the last, as volume 2 has also been checked out from the library.

Thursday, June 16, 2011

Book Review: Write Great Code Volume 1 (part 3 of 4)

With Parts 1 and Parts 2 behind us, the text can race forward through character sets, present memory usage / layout (like my Pointers on Pointers series), discuss Boolean logic (which I only learned 6? times in college), and settle into CPU architecture, which is chapter 9.  This chapter would roughly serve as a syllabus for my undergraduate and graduate computer architecture classes, whereby the reader is told "Hey, processors are capable of doing this... somehow" and note the class where one can learn "this is how."

In learning about CPU architectures, one of the main points was made 30 pages into the chapter, to quote "The techniques up to this point in this chapter can be treated as if they were transparent to the programmer."  The paragraph continues by stating that programmers can achieve further improvements by knowing the architecture; however, the huge caveat is whether the changes required are "worth it" (c.f., Performance Anti-Patterns).  Nonetheless, there are two things to note.  First, branches can significantly affect processor performance.  Changing their prevalence and particular direction can have benefits.  Second, shorter instructions are usually faster, as they save cache space and therefore memory accesses.

Chapter 10 presents the other half of architecture, the instruction set (ISA).  Let's just say that I've been working off and on with the Intel instruction set for years and I had difficulty with this chapter.  At one level ISAs are simple to explain, and so this chapter did in introduction.  But modern ISAs (e.g., x86) have many complexities that make understanding difficult, as this work took whole pages to print the charts for particular opcode bits.  So I reached the end of the chapter, knowing what the author intended to achieve, but not actually knowing this knowledge.

I had hoped to conclude volume 1 in three parts, but chapter 11 (memory architecture) warrants its own post as it has wrong information.

Tuesday, June 7, 2011

Conference Time: HotPar-3

A little over a week ago, I was at Berkeley for the 3rd annual Workshop on Hot Topics in Parallelism.  The exciting thing was that researchers from a variety of areas were there, unlike my impression of conferences where everyone is in the same area.  Computational scientists, computer architects, programming language and compiler folk, and more were present to discuss their work on parallelism.

There are a couple of works that I would like to highlight (none of which are my own), from the complete list available at HotPar11.  First, "Parallel Programming with Inductive Synthesis" had some of the coolest and most unexpected ideas that I heard.  Their work provides a programmer with the ability to sketch a solution.  A simple example is factoring.  Given an initial function: f(x) = x^2 - 2 * x - 3.  And the programmer then requests the system provide an equivalent function in the form of g(x) = (x - ??) * (x - ??).  (?? is the notation for the system to synthesize the value).

Factoring is a known problem, so the system just factors, right?  Wrong!  Rather (based on my memory of the presentation), the system converts the two functions into circuits, then sets them equal to each other and finds a satisfying assignment to all variables.  Therefore, the system can create the synthesized functions in the general case.

The second interesting work explored data races.  In "How to Miscompile Programs with "Benign" Data Races", the attendees saw how even benign data races could cause errors given certain compiler optimizations.  Unfortunately, the examples are more involved and therefore I refer you to the paper.

The final interesting result that I wish to highlight is that given current technology trends, matrix multiply will be memory bound in 10 years.  In "Balance Principles for Algorithm-Architecture Co-Design", the authors derive simplified formulas for specific functions that can show how their performance relates to current / future technology.  The intent is to enable architects, etc to understand where to best focus the design "dollar", in order to achieve the best performance for that dollar.

I've selected these works primarily on the basis of the results staying with me.  I found many others professionally interesting, but the three papers here are the ones that made me sit up, scratch my beard, and go "hmm".