A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Tuesday, June 21, 2011
Performance is a Feature
Performance is a Feature discusses recent performance work for the site, stackoverflow. I found the post interesting, albeit somewhat removed from my own work due to its web focus. As always with performance, there are two things I highlight. First, improve the biggest contributors first. Second, after a certain threshold, further improvements may not matter. For example, high frequency trading depends on the microseconds required, rendering a web page does not.
Monday, June 20, 2011
Book Review: Write Great Code Volume 1 (part 4 of 4)
As I stated in part 3, I found some erroneous information in chapter 11 about NUMA. While this book was published in 2004, I cannot excuse the mistakes given that the research dates to the 90s (c.f., The SGI Origin: A ccNUMA Highly Scalable Server - ISCA'97). Fortunately, they are relatively small (only a couple of paragraphs); however, the topic is valuable to know and I've spent a fair amount of my time trying to educate programmers on good practices with NUMA (c.f., Topology information under Windows) and so I don't appreciate books providing bad information.
NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory. For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses. Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.
In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory." And furthermore provides the examples of graphics card memory and flash devices.
The last chapter, 12, discusses I/O. Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated. Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous. Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included. Second, at the highest performance levels, polling of devices can perform faster than interrupts. For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point). Finally, using asynchronous I/O is at the top of my list for things to know about devices. In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.
Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course). It is therefore hard for me to know who the appropriate audience would be. Therefore, I have a neutral take on it. But this isn't the last, as volume 2 has also been checked out from the library.
NUMA or non-uniform memory access (or architecture) describes a system that has multiple sets of main memory. For programmers, this will usually lead to varying access time as some memory is "closer" than other memory addresses. Fortunately, modern operating systems (Windows and Linux) will give applications "close" memory when they allocate it, which works well most of the time.
In Write Great Code, the author is apparently unaware that main memory can have varying access times and therefore writes, "the term NUMA is used to describe blocks of memory that are electronically similar to main memory but, for one reason or another, operate significantly slower than main memory." And furthermore provides the examples of graphics card memory and flash devices.
The last chapter, 12, discusses I/O. Of all the chapters, this one is most tied to "current" hardware and therefore is the most dated. Here are three topics that I would revise, if there was a second edition: AGP versus PCI-E, polling versus interrupts, synchronous versus asynchronous. Accelerated Graphics Port (AGP) was a late-90s / early-00s connection that has been phased out for the more general PCI Express (PCI-E) interface; however, PCI-E wasn't standardized until 2004 and therefore missed being included. Second, at the highest performance levels, polling of devices can perform faster than interrupts. For example, there is always more data available with a 10Gb/s connection, so polling always returns successfully; however, interrupts are still better at lower transfer rates (although I don't know the crossover point). Finally, using asynchronous I/O is at the top of my list for things to know about devices. In all, this chapter stood at 75 pages and is effectively a rehash of what can now be found on wikipedia.
Overall, most of this work was subsumed by my undergraduate Computer Science coursework (particularly the intro systems course). It is therefore hard for me to know who the appropriate audience would be. Therefore, I have a neutral take on it. But this isn't the last, as volume 2 has also been checked out from the library.
Thursday, June 16, 2011
Book Review: Write Great Code Volume 1 (part 3 of 4)
With Parts 1 and Parts 2 behind us, the text can race forward through character sets, present memory usage / layout (like my Pointers on Pointers series), discuss Boolean logic (which I only learned 6? times in college), and settle into CPU architecture, which is chapter 9. This chapter would roughly serve as a syllabus for my undergraduate and graduate computer architecture classes, whereby the reader is told "Hey, processors are capable of doing this... somehow" and note the class where one can learn "this is how."
In learning about CPU architectures, one of the main points was made 30 pages into the chapter, to quote "The techniques up to this point in this chapter can be treated as if they were transparent to the programmer." The paragraph continues by stating that programmers can achieve further improvements by knowing the architecture; however, the huge caveat is whether the changes required are "worth it" (c.f., Performance Anti-Patterns). Nonetheless, there are two things to note. First, branches can significantly affect processor performance. Changing their prevalence and particular direction can have benefits. Second, shorter instructions are usually faster, as they save cache space and therefore memory accesses.
Chapter 10 presents the other half of architecture, the instruction set (ISA). Let's just say that I've been working off and on with the Intel instruction set for years and I had difficulty with this chapter. At one level ISAs are simple to explain, and so this chapter did in introduction. But modern ISAs (e.g., x86) have many complexities that make understanding difficult, as this work took whole pages to print the charts for particular opcode bits. So I reached the end of the chapter, knowing what the author intended to achieve, but not actually knowing this knowledge.
I had hoped to conclude volume 1 in three parts, but chapter 11 (memory architecture) warrants its own post as it has wrong information.
In learning about CPU architectures, one of the main points was made 30 pages into the chapter, to quote "The techniques up to this point in this chapter can be treated as if they were transparent to the programmer." The paragraph continues by stating that programmers can achieve further improvements by knowing the architecture; however, the huge caveat is whether the changes required are "worth it" (c.f., Performance Anti-Patterns). Nonetheless, there are two things to note. First, branches can significantly affect processor performance. Changing their prevalence and particular direction can have benefits. Second, shorter instructions are usually faster, as they save cache space and therefore memory accesses.
Chapter 10 presents the other half of architecture, the instruction set (ISA). Let's just say that I've been working off and on with the Intel instruction set for years and I had difficulty with this chapter. At one level ISAs are simple to explain, and so this chapter did in introduction. But modern ISAs (e.g., x86) have many complexities that make understanding difficult, as this work took whole pages to print the charts for particular opcode bits. So I reached the end of the chapter, knowing what the author intended to achieve, but not actually knowing this knowledge.
I had hoped to conclude volume 1 in three parts, but chapter 11 (memory architecture) warrants its own post as it has wrong information.
Tuesday, June 7, 2011
Conference Time: HotPar-3
A little over a week ago, I was at Berkeley for the 3rd annual Workshop on Hot Topics in Parallelism. The exciting thing was that researchers from a variety of areas were there, unlike my impression of conferences where everyone is in the same area. Computational scientists, computer architects, programming language and compiler folk, and more were present to discuss their work on parallelism.
There are a couple of works that I would like to highlight (none of which are my own), from the complete list available at HotPar11. First, "Parallel Programming with Inductive Synthesis" had some of the coolest and most unexpected ideas that I heard. Their work provides a programmer with the ability to sketch a solution. A simple example is factoring. Given an initial function: f(x) = x^2 - 2 * x - 3. And the programmer then requests the system provide an equivalent function in the form of g(x) = (x - ??) * (x - ??). (?? is the notation for the system to synthesize the value).
Factoring is a known problem, so the system just factors, right? Wrong! Rather (based on my memory of the presentation), the system converts the two functions into circuits, then sets them equal to each other and finds a satisfying assignment to all variables. Therefore, the system can create the synthesized functions in the general case.
The second interesting work explored data races. In "How to Miscompile Programs with "Benign" Data Races", the attendees saw how even benign data races could cause errors given certain compiler optimizations. Unfortunately, the examples are more involved and therefore I refer you to the paper.
The final interesting result that I wish to highlight is that given current technology trends, matrix multiply will be memory bound in 10 years. In "Balance Principles for Algorithm-Architecture Co-Design", the authors derive simplified formulas for specific functions that can show how their performance relates to current / future technology. The intent is to enable architects, etc to understand where to best focus the design "dollar", in order to achieve the best performance for that dollar.
I've selected these works primarily on the basis of the results staying with me. I found many others professionally interesting, but the three papers here are the ones that made me sit up, scratch my beard, and go "hmm".
There are a couple of works that I would like to highlight (none of which are my own), from the complete list available at HotPar11. First, "Parallel Programming with Inductive Synthesis" had some of the coolest and most unexpected ideas that I heard. Their work provides a programmer with the ability to sketch a solution. A simple example is factoring. Given an initial function: f(x) = x^2 - 2 * x - 3. And the programmer then requests the system provide an equivalent function in the form of g(x) = (x - ??) * (x - ??). (?? is the notation for the system to synthesize the value).
Factoring is a known problem, so the system just factors, right? Wrong! Rather (based on my memory of the presentation), the system converts the two functions into circuits, then sets them equal to each other and finds a satisfying assignment to all variables. Therefore, the system can create the synthesized functions in the general case.
The second interesting work explored data races. In "How to Miscompile Programs with "Benign" Data Races", the attendees saw how even benign data races could cause errors given certain compiler optimizations. Unfortunately, the examples are more involved and therefore I refer you to the paper.
The final interesting result that I wish to highlight is that given current technology trends, matrix multiply will be memory bound in 10 years. In "Balance Principles for Algorithm-Architecture Co-Design", the authors derive simplified formulas for specific functions that can show how their performance relates to current / future technology. The intent is to enable architects, etc to understand where to best focus the design "dollar", in order to achieve the best performance for that dollar.
I've selected these works primarily on the basis of the results staying with me. I found many others professionally interesting, but the three papers here are the ones that made me sit up, scratch my beard, and go "hmm".
Thursday, May 19, 2011
Book Review: Write Great Code Volume 1 (part 2 of 4)
+/- 1.F x 2^(Exp)
That's the modern floating point representation (with a few caveats). It is stored bitwise as SignExpF. As this representation uses a variable exponent, the point floats to different positions. By doing so, the computer can store a much greater range of numbers than integer representations, but with a small loss of precision. In chemistry, I first learned of "significant digits" and floating point limits the user to 5 - 10 significant decimal digits (which requires roughly 15 - 30 binary digits). For this post and subsequent posts in this series, I'll be writing a combination of what I knew prior to reading, as well as information from the book's text.
There are four official formats for floating point numbers using 16, 32 (a C float), 64 (a C double), and 128 bits respectively. Primarily, the different representations devote bits to the fraction, which provides increased precision. Intel also has a 80-bit representation, which has 64 bits for the fraction thereby enabling existing integer arithmetic support when the exponents are the same. Furthermore, ordering the bits: sign, exponent, fraction also enables integer comparisons between floats.
Given the precision limitations (even with 128-bit floats), there are some gotchas with arithmetic. First, equality is hard to define. Equality tests need to respect the level of error present in the floating points. For example, equality is finding that two numbers are within this error, as follows.
Another gotcha is preserving precision. With addition and subtraction, the floats need to be converted to have the same exponent, which can result in loss of precision. Therefore, the text recommends multiplication and division first. This seems reasonable, and wasn't something I had heard before.
That's the modern floating point representation (with a few caveats). It is stored bitwise as SignExpF. As this representation uses a variable exponent, the point floats to different positions. By doing so, the computer can store a much greater range of numbers than integer representations, but with a small loss of precision. In chemistry, I first learned of "significant digits" and floating point limits the user to 5 - 10 significant decimal digits (which requires roughly 15 - 30 binary digits). For this post and subsequent posts in this series, I'll be writing a combination of what I knew prior to reading, as well as information from the book's text.
There are four official formats for floating point numbers using 16, 32 (a C float), 64 (a C double), and 128 bits respectively. Primarily, the different representations devote bits to the fraction, which provides increased precision. Intel also has a 80-bit representation, which has 64 bits for the fraction thereby enabling existing integer arithmetic support when the exponents are the same. Furthermore, ordering the bits: sign, exponent, fraction also enables integer comparisons between floats.
Given the precision limitations (even with 128-bit floats), there are some gotchas with arithmetic. First, equality is hard to define. Equality tests need to respect the level of error present in the floating points. For example, equality is finding that two numbers are within this error, as follows.
bool Equal(float a, float b)
{
bool ret = (a == b); // Don't do this
ret = (abs(a - b) < error); // Test like this
return ret;
}Another gotcha is preserving precision. With addition and subtraction, the floats need to be converted to have the same exponent, which can result in loss of precision. Therefore, the text recommends multiplication and division first. This seems reasonable, and wasn't something I had heard before.
Friday, May 13, 2011
Book Review: Write Great Code Volume 1 (part 1 of 4)
What is great code? What characteristics would you describe it having? These questions come to mind as I am reading the first of a four volume series on writing great code. The first work is subtitled, "Understanding the Machine." Before I delve into what I've "learned" by reading this volume, what should one hope to gain? What understanding of modern machines is required for writing code, especially great code?
First, I'd emphasize that great code is more about maintainability and understanding than efficiency. Correctness over performance. And therefore, the text is more about how understanding the machine can clean-up the code, simplify any designs, and ensure correctness.
Second, in the context of the previous post, most programs do not need to "understand" the machine. Most applications will not be constrained by execution time, and therefore the programmer effort should be directed elsewhere. Yet programmers reach their own insights into the application / computer interaction and modify accordingly, and usually, myself included, these insights are irrelevant for the application.
Third, there are specific aspects of modern computer architecture / machine design that is still worth knowing. For example, I find that most programmers have limited understanding of branch prediction, cache usage, NUMA hierarchies, and superscalar processors. (Based on my reading so far, I would also add floating point to this list).
What else should a programmer know? What should I be looking for while I read this book?
First, I'd emphasize that great code is more about maintainability and understanding than efficiency. Correctness over performance. And therefore, the text is more about how understanding the machine can clean-up the code, simplify any designs, and ensure correctness.
Second, in the context of the previous post, most programs do not need to "understand" the machine. Most applications will not be constrained by execution time, and therefore the programmer effort should be directed elsewhere. Yet programmers reach their own insights into the application / computer interaction and modify accordingly, and usually, myself included, these insights are irrelevant for the application.
Third, there are specific aspects of modern computer architecture / machine design that is still worth knowing. For example, I find that most programmers have limited understanding of branch prediction, cache usage, NUMA hierarchies, and superscalar processors. (Based on my reading so far, I would also add floating point to this list).
What else should a programmer know? What should I be looking for while I read this book?
Monday, April 18, 2011
Is vs Ought in Computer Science
In philosophy, we debate what something is versus what it ought to be. And often the usage is confounded. In thinking about functional programming, I recalled this debate.
A functional program declares what ought to occur. The computer has large leeway in how it will successfully execute this program. In contrast, an imperative program declares what will occur. The computer has little leeway in executing this program.
In programming imperatively, we can more clearly reason about the specific execution time / space required, as we can know the fullest details about the program. Each step and instruction is clear, along with the specific allocation points to comprehend the memory usage. Yet, this model effectively went away over 20 years ago, with superscalar processors and overall program complexity.
A common fallacy that programmers make is to try to reason about the details of execution time. Questions are pondered like should an increment be "++" or "+= 1"? Would it make you feel better to know I can construct cases where one executes faster than the other? Most programmers who want to reason about their program are not sufficiently equipped to do so.
Lest we discard this benefit of imperative programming entirely, there are other decisions that are still within reason. For example, the specific layout of the data in a structure can have impact on execution time. And while it is doable by hand, the response I received from other programmers is "can't the compiler do this for me?"
Fine. Let's only program with what the application ought to be doing. The compiler / run-time / operating system all serve to take these statements of ought and make them be. Still someone must reason about the is of programs, unless we have a processor that directly takes the functional statements.
I shall continue to be skeptical about programming functionally; however, I advocate further support for stating the "ought" of program statements to aid the verification, etc that is so highly lauded. Imperative programs need to contain more details about what the program ought to be doing, which can be used in interesting ways beyond mere verification of correct execution. This direction is part of my research, which I look forward to discussing at HotPar 2011 - "Parallel Pattern Detection for Architectural Improvement". The specifics of which I'll wait to discuss with the proceedings.
A functional program declares what ought to occur. The computer has large leeway in how it will successfully execute this program. In contrast, an imperative program declares what will occur. The computer has little leeway in executing this program.
In programming imperatively, we can more clearly reason about the specific execution time / space required, as we can know the fullest details about the program. Each step and instruction is clear, along with the specific allocation points to comprehend the memory usage. Yet, this model effectively went away over 20 years ago, with superscalar processors and overall program complexity.
A common fallacy that programmers make is to try to reason about the details of execution time. Questions are pondered like should an increment be "++" or "+= 1"? Would it make you feel better to know I can construct cases where one executes faster than the other? Most programmers who want to reason about their program are not sufficiently equipped to do so.
Lest we discard this benefit of imperative programming entirely, there are other decisions that are still within reason. For example, the specific layout of the data in a structure can have impact on execution time. And while it is doable by hand, the response I received from other programmers is "can't the compiler do this for me?"
Fine. Let's only program with what the application ought to be doing. The compiler / run-time / operating system all serve to take these statements of ought and make them be. Still someone must reason about the is of programs, unless we have a processor that directly takes the functional statements.
I shall continue to be skeptical about programming functionally; however, I advocate further support for stating the "ought" of program statements to aid the verification, etc that is so highly lauded. Imperative programs need to contain more details about what the program ought to be doing, which can be used in interesting ways beyond mere verification of correct execution. This direction is part of my research, which I look forward to discussing at HotPar 2011 - "Parallel Pattern Detection for Architectural Improvement". The specifics of which I'll wait to discuss with the proceedings.
Subscribe to:
Posts (Atom)