Sol Boucher presented his thesis proposal today via Zoom. It was well attended and I think many of us are already looking forward to his defense.
Lightweight Preemptable Functions (LPF)
Function calls require some knowledge of the cost for a program to know whether it can invoke it or not in other time critical execution.
In the current space of multitasking, three approaches will be first reviewed: futures, threads, and processes. Futures rely on the runtime to provide asynchronous execution, but preemption requires yield points, and not all execution can support these points, nor can it be clear what this future will do. With threads, execution is fairly independent, but minimal information is provided toward scheduling and cancellation is hard. Processes provide good support for cancellation, except that on fork(), only one thread is required to be carried over in the new process. The other threads are canceled, which can result in inconsistent state.
The thesis proposal is: Introducing a novel abstraction for preemption at the granularity of a synchronous function call, which includes a timeout.
launch(function, timeout, argument) returning a continuation. If the timeout elapses, the continuation is returned, but it is then the choice of the caller for whether to resume it or cancel. This LPF executes within the same thread context as the caller, thereby reducing overhead. However, to resume the LPF, it will need a new stack. To support the timeout, it relies on a timer signal that can occur every ~5 microseconds. Launch / resume have overhead comparable to this, significantly better than fork or pthread_create. However, cancel is extremely expensive.
LPFs also have an issue with calling functions that are non-reentrant, similar to the rules governing signal handlers. To address this, the runtime provides selective relinking to capture what the LPF is calling via the global offset table (GOT). Some GOT entries point to dynamic libraries, other entries are initially pointing to the dynamic linker. This runtime support also needs to intercept thread local variables. This interception support imposes about 10ns of overhead, which is little above the cost of function calls themselves.
Microservice approaches have significant latency, often tends to hundreds of microseconds. Primarily the requirement to create a sufficient container, often via processes or virtualization. If the microservice was instead written safely and using LPFs, then the latency could be reduced toward the hardware bound as measured by communicating between VMs or committing transactions.
Cancellation cleanup is difficult in languages, such as C, that require explicit cleanup. In other languages, adding a new exception path for timeout and cancellation could then invoke the necessary destructors. Nonetheless, this can be expensive (perhaps single milliseconds).
Other possible future work:
Another cancellation step is the cost of unloading the mirrored library, so could the runtime instead track the changes made and then determine whether to roll back or discard.
Is it possible to reduce the overhead of the preemption signals or improving their granularity.
A discussion of how to do Computer Science well, particularly writing code and architecting program solutions.
Showing posts with label operating systems. Show all posts
Showing posts with label operating systems. Show all posts
Wednesday, April 22, 2020
Monday, April 8, 2019
Presentation: The Quest for Energy Proportionality in Mobile & Embedded Systems
This is a summary of the presentation on "The Quest for Energy Proportionality in Mobile & Embedded Systems" by Lin Zhong.
We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.
The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.
So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.
Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.
However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.
We want mobile and other systems to be energy efficient, and particularly use energy in proportion to the intensity of the required operation. However, processor architectures only have limited regions where these are in proportion, given certain physical and engineering constraints on the design. ARM's big.LITTLE gives the a greater range in efficiency by placing two similar cores onto the same chip; however, it is constrained by a need to ensure the cores remain cache coherent.
The recent TI SoC boards also contained another ARM core, running the Thumb ISA for energy efficiency. This additional core was hidden behind a TI driver (originally to support MP3 playing), but was recently exposed, so allowing further design to utilize it as part of computation. But this core is not cache coherent with the other, main core on the board.
So Linux was extended to be deployed onto both cores (compiled for the different ISAs), while maintaining the data structures, etc in the common, shared memory space. Then the application can run and migrate between the cores, based on application hints as to the required intensity of operations. With migration, one of the core domains is put to sleep and releases the memory to the other core. This design avoids synchronization between the two domains, which simplifies the code and the concurrency demands are low in the mobile space. And here was a rare demonstration of software-managed cache coherence.
Therefore, DVFS provides about a 4x change in power, then big.LITTLE has another 5x. The hidden Thumb core supports an additional 10x reduction in power for those low intensity tasks, such as mobile sensing. Thus together, this covers a significant part of the energy / computation space.
However, this does not cover the entire space of computation. At the lowest space, there is still an energy intensive ADC component (analog digital conversion). This component is the equivalent of tens of thousands of gates. However, for many computations, they could be pushed into the analog space, which saves on power by computing a simpler result for digital consumption and that the computation can be performed on lower quality input (tolerating noise), which reduces the energy demand.
Wednesday, October 17, 2018
Thesis Defense: Practical Concurrency Testing
Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.
To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.
Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.
This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.
As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)
Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.
This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.
When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.
Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.
Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.
Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.
In attendance are the recent instructors of Operating Systems and the TAs.
To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.
Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.
This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.
As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)
Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.
This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.
When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.
Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.
Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.
Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.
In attendance are the recent instructors of Operating Systems and the TAs.
Tuesday, February 27, 2018
Conference Attendance SIGCSE 2018
I have just finished attending SIGCSE 2018 in Baltimore. In contrast to my earlier conference attendance, this time I have had higher involvement in its execution.
On Wednesday I went to the New Educator's Workshop (NEW). Even being faculty for two years, there was still a number of things that were either new or good reminders. Such as including or discussing learning objectives with each lecture and assignment, or being careful with increasing one's level of service. As a new faculty member, each service request seems exciting, as no one has asked me before! But many senior faculty emphasized that this is the time in which they are protecting us from lots of service opportunities such that we can spend time on our teaching and research.
On Thursday morning, I presented my recent work that updated a programming assignment in Introduction to Computer Systems, and from which we saw improvements in student exam scores. We did not research the specific action, and are therefore left with two theories. First, the improvement could be from using better style in the starter code and emphasizing this style in submissions. Second, we redesigned the traces to require submissions to address different cases and thereby implement different features. I lean toward the formed, but have no data driven basis for this hypothesis.
Let's discuss active learning briefly. I attended (or ran) several sessions focused on this class of techniques. The basic idea is that students have better engagement and learning by actively participating in class. There are a variety of techniques that work to help increase student activity. On Thursday afternoon, Sat Garcia of USD, presented Improving Classroom Preparedness Using Guided Practice, which showed how student learning improved from participating in Peer Instruction, which particularly requires students to come to class prepared. Shortly later, Cynthia Taylor joined Sat and I in organizing a Bird of Feather (BoF) session on using Active-learning in Systems Courses. We had about 30-40 attendees there split into two groups discussing some techniques they have used and problems they have observed. 5 years ago, a similar BoF had attendance around 15-20, so we are making progress as a field.
On Friday, I spoke with Brandon Myers who has done work on using POGIL in Computer Organization and Architecture. In POGIL, students are working in groups of 3-4 with specific roles through a guided learning, guiding students into discovering the concepts themselves. We had a nice conversation and may be merging our draft resources. This last point is often the tricky part of using active learning in that developing reasonable materials can be both time intensive and requires several iterations.
The Friday morning keynote presentation was given by Tim Bell, who spoke about K-12. This topic is rather distant from my own work and research, so I was skeptical. Yet, I came out quite enthused. It was interesting to think about presenting Computer Science concepts in non-traditional ways, based initially on having to explain your field at elementary school when the other presenters are a cop and a nurse (his example). How could you get 6 year olds to sort? Or see the advantage of binary search as the data grows?
In the afternoon, I was a session chair for the first time. I moderated the session on Errors, so obviously the AV system stopped working for a short duration. Beyond that incident, the session seemed to go well.
I always like going to SIGCSE. It is rejuvenating and exhausting. So many teachers to speak with about courses, curriculum, and other related topics. And then you find that you've been social for 16 hours or so hours.
Tuesday, September 6, 2016
When is it signaled?
Signals are a mechanism for notifying a process of a simple event. Most programmers and programs can ignore them, as the default behaviors are reasonable and taking steps to handle them would greatly increase the program complexity. But when teaching future computer scientists, we want the programmers to know about these mechanisms and therefore properly understand the functioning of the system.
In working with signals, the developing programmers are often exposed to their first dose of concurrency. The idea that execution can be happening in simultaneous, arbitrary orders except when action is taken by the program. With signals, a program can do several things:
We are going to consider the problem of sending and receiving a signal. After a signal is sent, when does the other process receive it? Students make the assumption that when the sending function (i.e., kill) has returned, the signal has been sent *and received*. However, I have found no text that explicitly guarantees this condition. Instead, I prepared a simple program (source at the end of this post) to test this condition.
As signals are communicated through the operating system, we want a different mechanism for measuring simultaneity, in this case shared memory. The experiment program will create and set up a small space of shared memory between two processes. Next, it will wait until both programs are running in a known state (i.e., barrier). Then one (the parent) will signal the other (the child), while measuring how long it takes to send, as well as receive the signal. Finally, run this experiment a million times.
On my current Ubuntu box with Skylake processors, the experimental measurements show that 80% of the time, the child lasts for an average of 40 cycles after kill returns. The maximum time is almost 1200 cycles. Assuming that each core's clock has a smaller skew, this effectively means that the child can continue to run even after the function has returned.
Source code follows: (compiled with gcc version 4.8.3, with gcc -O3 -lrt)
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdint.h>
#include "rdtsc.h"
static const int ITER = 1000 * 1000;
int main(int argc, char** argv)
{
int shm_fd = shm_open("squig", O_CREAT | O_RDWR, (S_IREAD | S_IWRITE));
int count = 0, i;
uint64_t zero = 0, min = ~0x0, max = 0, sum = 0;
uint64_t minN = ~0x0, maxN = 0, sumN = 0;
write(shm_fd, &zero, 8); // Give the shared file "space"
void* msh = mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
volatile uint64_t* tsc = msh;
for (i = 0; i < ITER; i++)
{
*tsc = 0;
pid_t chd = fork();
if (chd == 0)
{
// Give the compiler something to think about
while(zero < ~0x0) { *tsc = rdtsc(); zero++;}
}
else
{
// Wait for the child
while (*tsc == 0) {}
uint64_t st = *tsc; // When is it for the child?
kill(chd, SIGINT); // Send SIGINT to child
uint64_t k = rdtsc(); // Returned from kill
wait(NULL); // Reap
uint64_t delta = 0;
uint64_t en = *tsc; // When did the child end?
// K >, implies that kill returned after the child terminated
if (k > en)
{
count ++;
delta = k - en;
if (delta < minN) minN = delta;
if (delta > maxN) maxN = delta;
sumN += delta;
}
else
{
delta = en - k;
if (delta < min) min = delta;
if (delta > max) max = delta;
sum += delta;
}
}
}
printf("Min: %lx, Max: %lx, Avg: %lx\n", min, max, (sum / (ITER - count)));
printf("Min: %lx, Max: %lx, Avg: %lx\n", minN, maxN, (sumN / (count)));
printf("Percent Parent After: %lf\n", (count / (double)ITER));
return 0;
}
Update: Results also hold when sending SIGKILL.
In working with signals, the developing programmers are often exposed to their first dose of concurrency. The idea that execution can be happening in simultaneous, arbitrary orders except when action is taken by the program. With signals, a program can do several things:
- Provide and install a handler for one or more signals
- Block the receipt of one or more signals
- Send a signal to one or more processes
We are going to consider the problem of sending and receiving a signal. After a signal is sent, when does the other process receive it? Students make the assumption that when the sending function (i.e., kill) has returned, the signal has been sent *and received*. However, I have found no text that explicitly guarantees this condition. Instead, I prepared a simple program (source at the end of this post) to test this condition.
As signals are communicated through the operating system, we want a different mechanism for measuring simultaneity, in this case shared memory. The experiment program will create and set up a small space of shared memory between two processes. Next, it will wait until both programs are running in a known state (i.e., barrier). Then one (the parent) will signal the other (the child), while measuring how long it takes to send, as well as receive the signal. Finally, run this experiment a million times.
On my current Ubuntu box with Skylake processors, the experimental measurements show that 80% of the time, the child lasts for an average of 40 cycles after kill returns. The maximum time is almost 1200 cycles. Assuming that each core's clock has a smaller skew, this effectively means that the child can continue to run even after the function has returned.
Source code follows: (compiled with gcc version 4.8.3, with gcc -O3 -lrt)
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdint.h>
#include "rdtsc.h"
static const int ITER = 1000 * 1000;
int main(int argc, char** argv)
{
int shm_fd = shm_open("squig", O_CREAT | O_RDWR, (S_IREAD | S_IWRITE));
int count = 0, i;
uint64_t zero = 0, min = ~0x0, max = 0, sum = 0;
uint64_t minN = ~0x0, maxN = 0, sumN = 0;
write(shm_fd, &zero, 8); // Give the shared file "space"
void* msh = mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
volatile uint64_t* tsc = msh;
for (i = 0; i < ITER; i++)
{
*tsc = 0;
pid_t chd = fork();
if (chd == 0)
{
// Give the compiler something to think about
while(zero < ~0x0) { *tsc = rdtsc(); zero++;}
}
else
{
// Wait for the child
while (*tsc == 0) {}
uint64_t st = *tsc; // When is it for the child?
kill(chd, SIGINT); // Send SIGINT to child
uint64_t k = rdtsc(); // Returned from kill
wait(NULL); // Reap
uint64_t delta = 0;
uint64_t en = *tsc; // When did the child end?
// K >, implies that kill returned after the child terminated
if (k > en)
{
count ++;
delta = k - en;
if (delta < minN) minN = delta;
if (delta > maxN) maxN = delta;
sumN += delta;
}
else
{
delta = en - k;
if (delta < min) min = delta;
if (delta > max) max = delta;
sum += delta;
}
}
}
printf("Min: %lx, Max: %lx, Avg: %lx\n", min, max, (sum / (ITER - count)));
printf("Min: %lx, Max: %lx, Avg: %lx\n", minN, maxN, (sumN / (count)));
printf("Percent Parent After: %lf\n", (count / (double)ITER));
return 0;
}
Update: Results also hold when sending SIGKILL.
Friday, June 21, 2013
What is Time: Keeping Time in Software
In a prior post, Measuring Performance, I discussed some techniques for how a developer might measure the performance of his or her application. However, sometimes what you want is to measure an interval of time. We can generally agree to do this by taking two measurements: the time before and the time after. Then find the difference, which is the interval of time elapsed. Great! But how exactly are we going to measure time?
A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.
But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:
This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.
But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?
But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.
But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.
These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).
Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.
As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.
Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:
In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.
A common theme to the questions I answer on stackoverflow is proper measurement of time (see for example, Analysis of CPU caches, Function Pointers versus Inline Functions, or Windows system time with millisecond resolution). A programmer starts with the tools most readily available, usually a call to gettimeofday or GetTickCount (for example). These functions provide time in (roughly) millisecond resolution, like having a stopwatch. For its own timekeeping purposes, an OS sets up hardware to send an interrupt every ~16ms (i.e., 64Hz via the PIT). This interrupt primarily denotes a scheduling quantum, but for little additional overhead, the OS updates its clock. Some applications need finer-grained scheduling and increase the frequency of the interrupt (i.e., decrease the quantum), which increases overhead and power usage. Using these calls, the programmer can measure events that take at least 1 centisecond with reasonable accuracy.
But, you object, I want better accuracy! Let's turn to the HPET then. A hardware device that internally operates at 10+MHz (or 100ns ticks). Thanks to this device, the system has a microsecond time source. For this resolution, a programmer may use clock_gettime or QueryPerformanceCounter. And both come also with supplemental routines that indicate the resolution of the time returned. For example:
#include <stdio.h>
#include <time.h>
int main(int argc, char** argv)
{
struct timespec n, o;
clockid_t t = CLOCK_MONOTONIC;
clock_getres(t, &n);
printf("%d - %d\n",n.tv_sec, n.tv_nsec);
clock_gettime(t, &n);
clock_gettime(t, &o);
printf("%d - %d\n",n.tv_sec, n.tv_nsec);
printf("%d - %d\n",o.tv_sec, o.tv_nsec);
return 0;
}
% ./t_test
0 - 999848
13390932 - 798720095
13390932 - 798720095
This time source has a reported resolution of approximately 1ms; however, sometimes the printed "n,o" pair actually shows a one microsecond difference, which suggests that the return from clock_getres could be smaller. Thus on my test machine, I could measure time by microseconds.
But, you object, I want better accuracy! So let's move on to the timestamp counter then. This counter is per-processor and runs at the processor's frequency (1+ GHz or <1ns per tick). clock_gettime and QueryPerformanceCounter may use this time source. So we can query it a nanosecond accurate time measurement. Done, or are we?
But, I object, it takes time to do this measurement; perhaps a very small amount of time, but time nonetheless. If we switch the code above to use CLOCK_PROCESS_CPUTIME_ID, the measurements show a 400 - 500ns difference between n and o, even though the reported "resolution" is 1ns. Did the function lie about its resolution? No. Imagine calling NIST or CERN to get the time from their clock. Their clock is accurate to better than 1ns, but it takes time to call them. In the code example, each call is taking about 400ns to make the function call(s), query the timestamp and then convert into the appropriate format. Note that this overhead was also present in the earlier experiments, which is why the earlier output occasionally differed by 1 microsecond, as the 400ns of overhead caused the measurements to fall in different microsecond ticks.
But, you object, I really need to know how many nanoseconds this something takes. There are two approaches to remedy this: first, we could write some code does the above repeatedly to determine the overhead and then subtract it from the time difference, or find a different time source.
These function calls are querying the timestamp counter, but architects have exposed the counter to programmers for years, so why make the call? In x86, it is the instruction RDTSC or RDTSCP, which provides the 64-bit timestamp value. A single instruction is about as small of overhead as we're going to get (although, we can also look at register pressure, cycles required, etc).
Hah! You exclaim, now I can know which instruction sequence is fastest, because I can measure CPU cycles. But at this timescale, your program's measurement has left the "classical" architecture model of executing one instruction at a time and is drifting into the reality of many 10s of instructions executing simultaneously. When you toss in your RDTSC, there is no guarantee where in this mix of instructions it will execute and complete (with RDTSCP, you can guarantee that all previous instructions have completed). So it is recommended that even if you are measuring a "small instruction sequence" like some of the cited stackoverflow questions, that you perform thousands of operations and measure the time of this aggregate. Of course, architectural behaviors like branch prediction, cache misses, prefetchers, etc will change over the course of the experiment.
As if this wasn't bad enough, the timestamp counter (including calls to CLOCK_PROCESS_CPUTIME_ID) is specific to each processor. The counter is not global system time, so taking measurements on different processors (say different threads or a thread that migrated) will not necessarily give meaningful results. Many modern OSes try to set the counters so that the measurements are within a small margin of each other. In practice, I have observed deltas from low 100s to 10s of millions, so be warned.
Even worse, older processors really did count CPU cycles as the timestamp, so if the processor slowed down, then a "cycle" meant more nanoseconds. Modern chips have invariant TSC, per Intel architecture manual:
The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.As bonus, all of the clocks can be affected by thermal drift, as their frequency is related to their temperature. Others have investigated this behavior in greater detail, and while it can be accounted for, in practice I've happily ignored it (although I should note in future writing that I have not controlled for this possible measurement bias).
In conclusion, I use RDTSC for most of my measurements and try to control for the clock difference that exists between processors. In my current PhD research, I am still working to try to control and properly account for this difference so that I can measure the behavior that I am interested in. So be warned and be cautious, and be as precise as you need to be.
Friday, December 23, 2011
Measuring Performance
Starting with a primer of Understanding processing time, let's delve into some methods for how we might measure the performance of software. Along the way, you might gain a better understanding into how the operating system (OS) tracks time. So let's take different methods in order from least overhead to greatest.
First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").
The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).
So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.
But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).
Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.
Again, always measure performance before cutting (i.e., optimizing).
First, the wall clock. I stare at my watch and see the second-hand move. With each tick of the hand, another second has passed. Eventually the program completes, and I can record how many ticks have elapsed. (Yes, you can use time or another program to do this). I am constrained in knowing the time by my watch, and the computer is no different; it has an internal clock that ticks only so fast (see my comment on stackoverflow). We now know how long the program takes to run, a measure of performance, but not what it was doing (perhaps it sleeps for 20min then prints "complete!").
The operating system gives some metric of every program's CPU usage, viewable with top, taskmgr, etc. On every scheduling interval, the operating system receives an interrupt and must decide whether to multitask and switch a different thread into execution. Also, the OS will record that the past quantum of time was spent doing whatever was just interrupted. So we view the percentages of one second, where this chunk of 16ms (the default Windows quantum) was spent executing Word, that chunk was spent by the kernel, and all those were idle. There are still inaccuracies, but it is low overhead. As a note, the OS may actually record cycles elapsed on every task switch and thus compute a higher accuracy percentage. (Of course, power management can play havoc with elapsed cycles and elapsed wall clock time).
So we now know that our application was using the CPU during its entire execution, but what was it doing?! Often the next step is to use a profiler. A profiler works just like the OS's CPU usage in the previous paragraph, except it does so on an application granularity (although some profilers like xperf can do the whole system). Now on each interrupt (which also may be more frequent, like 1ms with corresponding overhead), the profiler records the instruction pointer (IP) of the running process and again ascribes the quantum of time to the function containing that IP. Optionally, the profiler might record a call stack on the interrupt, so calling functions can be identified, which is particularly useful when shared routines are invoked (like stl container classes). Some profilers will break a function's usage down by the IP, allowing the analysis to understand the specific assembly instructions that are being executed frequently.
But the profiler is still just relying on probability and sampling to give an estimate for the CPU usage of the program. If we increase the overhead significantly more, we can know precisely what the program is executing. I've used systems where the compiler can put callbacks around every function call. Each callback logs the function being invoked and the current time (in cycles). Now, one has a complete function trace of the running system, along with the cycles required for each segment of code. Only one problem, the overhead is at least 100% (so performance is halved or worse).
Tangential to this entire discussion is the fact that modern processors also provide hardware performance counters, so that the analysis can see cache misses, branch predictions and other metrics of how efficiently the code is executing on the processor. And I have not discussed IO, although my general rule is asynchronous and nonexistent.
Again, always measure performance before cutting (i.e., optimizing).
Subscribe to:
Posts (Atom)