A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Tuesday, July 10, 2012
Meta Post
It took 677 days, but I've now received 1000 views. Just over 50% of views are from the United States and 60% of views are from machines running Windows. Now, I'd say that I'm going to get back to research, but I'm taking today off to run errands with my wife.
Thursday, May 31, 2012
Know your "N"
I once was working on some performance analysis and came to the developers asking how their code was consuming a disproportionate amount of time. At low load, the code was say 5% of non-idle CPU usage, but at high load, the code was now 50%. The developers noted that the code contained an algorithm that was O(n^2), but they had only run n=1,2,...10. My high load was n=1000, which exposed the scaling issues of their code. The developers had done everything right, except knowing the scope of the problem (and to be fair, their code was originally for desktops and I was testing servers).
Another time a developer came and discussed memory limitations, due to using 64-bit counters. I asked about how large the input was and learned it was under a billion elements. Well, a billion fits into 32-bit counters and now the memory consumption is cut by half. (Maybe I need more stories, as this was also in Bit Packing.)
The first step to any good performance analysis is selecting an algorithm, based on the problem (including its size). Bubble sort an array of 10 elements, but not 10,000. In fact, ask whether the problem really requires sorting at all.
I will continue to strongly urge developers to measure the performance of their code, before trying to tune their code. But part of this measurement is knowing the "size" of the inputs.
Another time a developer came and discussed memory limitations, due to using 64-bit counters. I asked about how large the input was and learned it was under a billion elements. Well, a billion fits into 32-bit counters and now the memory consumption is cut by half. (Maybe I need more stories, as this was also in Bit Packing.)
The first step to any good performance analysis is selecting an algorithm, based on the problem (including its size). Bubble sort an array of 10 elements, but not 10,000. In fact, ask whether the problem really requires sorting at all.
I will continue to strongly urge developers to measure the performance of their code, before trying to tune their code. But part of this measurement is knowing the "size" of the inputs.
Thursday, May 10, 2012
The Minimum of Computer Science
The NY Times ran an article at the start of last month on what CS education is required for non-majors, Computer Science for Non-Majors. Is the knowledge of how to write a (simple) program so vital to modern work that it should rank with multiplication tables, English grammar, etc? And if so, how much knowledge and in what language(s) should this be taught?
Should someone learn the basics of javascript, so they can add to webpages? Perhaps a little python to do simple processing of data? But probably not C, as the language expresses what the computer is to do and not what the computer should do.
Edit: A strong response to this general question here.
Should someone learn the basics of javascript, so they can add to webpages? Perhaps a little python to do simple processing of data? But probably not C, as the language expresses what the computer is to do and not what the computer should do.
Edit: A strong response to this general question here.
Tuesday, April 3, 2012
Rules of Computer Science
As I've spent these many years learning and practicing computer science, I've come to have a couple of "rules" (i.e. guidelines) of programming.
1) If a computer can execute something, you can write it in C. (aka Turing-completeness)
1a) Still, it is probably faster to write the program in a high-level language.
2) The computer does what it is told to do. There is no magic.
2a) If a behavior is not what you expect, then you don't know what you told the computer to do.
I am sure there are other such "rules", so please share if you have other maxims from your time and experience.
1) If a computer can execute something, you can write it in C. (aka Turing-completeness)
1a) Still, it is probably faster to write the program in a high-level language.
2) The computer does what it is told to do. There is no magic.
2a) If a behavior is not what you expect, then you don't know what you told the computer to do.
I am sure there are other such "rules", so please share if you have other maxims from your time and experience.
Friday, March 2, 2012
Usage of NULL
I read an argument recently about whether a test for NULL should be:
- or -
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
if (ptr)
- or -
if (ptr != NULL)
Do you know that the assembly for these two sequences is the same? So the first has not optimized the running time of the program. But it does save 8 characters, which mattered long ago. But now we have plenty of disk space. So always write the test for NULL by actually testing against NULL. Or I'll come take off style points the next time I'm grading. (If you are writing one-use code, which no one will see then go ahead.)
Separately, the operating system provides explicit support for NULL. It sets up every process's virtual memory so that virtual addresses 0x0 - 0x3ff (and possibly more) have no mapping so any access to NULL will never be valid.
Tuesday, February 14, 2012
Bit Packing: A World Gone Mad
Let's start with a mythical program. It has some classes (structs, objects, etc) that are instantiated. Each field has a meaning given to it by the code. Many fields can hold far more data than is required of them. In our mythical program, these unused bits are of no concern, as memory is plentiful. But then the mythical program is run on a real machine and bad things happen. Maybe it runs slowly, or crashes when malloc() returns NULL. Now what?!
In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.
First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.
The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.
Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:
Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.
Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.
Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!
A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.
The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.
Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.
In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.
First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.
The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.
Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:
atomic_add(int* loc, int val) { lock() *loc += val unlock() }
Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.
Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.
Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!
A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.
[I/F:1][LENGTH:15][ADDR:48]
The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.
Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.
Friday, December 23, 2011
Measuring Performance
Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.
First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").
The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).
So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.
But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).
Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.
Again, always measure performance before cutting (i.e., optimizing).
First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").
The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).
So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.
But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).
Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.
Again, always measure performance before cutting (i.e., optimizing).
Subscribe to:
Posts (Atom)