Elegant C: signals

Wednesday, April 22, 2020

Thesis Proposal - Lightweight Preemptable Functions

Sol Boucher presented his thesis proposal today via Zoom. It was well attended and I think many of us are already looking forward to his defense.

Lightweight Preemptable Functions (LPF)
Function calls require some knowledge of the cost for a program to know whether it can invoke it or not in other time critical execution.

In the current space of multitasking, three approaches will be first reviewed: futures, threads, and processes. Futures rely on the runtime to provide asynchronous execution, but preemption requires yield points, and not all execution can support these points, nor can it be clear what this future will do. With threads, execution is fairly independent, but minimal information is provided toward scheduling and cancellation is hard. Processes provide good support for cancellation, except that on fork(), only one thread is required to be carried over in the new process. The other threads are canceled, which can result in inconsistent state.

The thesis proposal is: Introducing a novel abstraction for preemption at the granularity of a synchronous function call, which includes a timeout.

launch(function, timeout, argument) returning a continuation. If the timeout elapses, the continuation is returned, but it is then the choice of the caller for whether to resume it or cancel. This LPF executes within the same thread context as the caller, thereby reducing overhead. However, to resume the LPF, it will need a new stack. To support the timeout, it relies on a timer signal that can occur every ~5 microseconds. Launch / resume have overhead comparable to this, significantly better than fork or pthread_create. However, cancel is extremely expensive.

LPFs also have an issue with calling functions that are non-reentrant, similar to the rules governing signal handlers. To address this, the runtime provides selective relinking to capture what the LPF is calling via the global offset table (GOT). Some GOT entries point to dynamic libraries, other entries are initially pointing to the dynamic linker. This runtime support also needs to intercept thread local variables. This interception support imposes about 10ns of overhead, which is little above the cost of function calls themselves.

Microservice approaches have significant latency, often tends to hundreds of microseconds. Primarily the requirement to create a sufficient container, often via processes or virtualization. If the microservice was instead written safely and using LPFs, then the latency could be reduced toward the hardware bound as measured by communicating between VMs or committing transactions.

Cancellation cleanup is difficult in languages, such as C, that require explicit cleanup. In other languages, adding a new exception path for timeout and cancellation could then invoke the necessary destructors. Nonetheless, this can be expensive (perhaps single milliseconds).

Other possible future work:
Another cancellation step is the cost of unloading the mirrored library, so could the runtime instead track the changes made and then determine whether to roll back or discard.
Is it possible to reduce the overhead of the preemption signals or improving their granularity.

Tuesday, September 6, 2016

When is it signaled?

Signals are a mechanism for notifying a process of a simple event. Most programmers and programs can ignore them, as the default behaviors are reasonable and taking steps to handle them would greatly increase the program complexity. But when teaching future computer scientists, we want the programmers to know about these mechanisms and therefore properly understand the functioning of the system.

In working with signals, the developing programmers are often exposed to their first dose of concurrency. The idea that execution can be happening in simultaneous, arbitrary orders except when action is taken by the program. With signals, a program can do several things:

Provide and install a handler for one or more signals
Block the receipt of one or more signals
Send a signal to one or more processes

We are going to consider the problem of sending and receiving a signal. After a signal is sent, when does the other process receive it? Students make the assumption that when the sending function (i.e., kill) has returned, the signal has been sent *and received*. However, I have found no text that explicitly guarantees this condition. Instead, I prepared a simple program (source at the end of this post) to test this condition.

As signals are communicated through the operating system, we want a different mechanism for measuring simultaneity, in this case shared memory. The experiment program will create and set up a small space of shared memory between two processes. Next, it will wait until both programs are running in a known state (i.e., barrier). Then one (the parent) will signal the other (the child), while measuring how long it takes to send, as well as receive the signal. Finally, run this experiment a million times.

On my current Ubuntu box with Skylake processors, the experimental measurements show that 80% of the time, the child lasts for an average of 40 cycles after kill returns. The maximum time is almost 1200 cycles. Assuming that each core's clock has a smaller skew, this effectively means that the child can continue to run even after the function has returned.

Source code follows: (compiled with gcc version 4.8.3, with gcc -O3 -lrt)
#include <sys/mman.h>
#include <sys/types.h>

#include <sys/wait.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdint.h>

#include "rdtsc.h"

static const int ITER = 1000 * 1000;

int main(int argc, char** argv)
{
int shm_fd = shm_open("squig", O_CREAT | O_RDWR, (S_IREAD | S_IWRITE));
int count = 0, i;
uint64_t zero = 0, min = ~0x0, max = 0, sum = 0;
uint64_t minN = ~0x0, maxN = 0, sumN = 0;
write(shm_fd, &zero, 8); // Give the shared file "space"
void* msh = mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
volatile uint64_t* tsc = msh;

for (i = 0; i < ITER; i++)
{
*tsc = 0;
pid_t chd = fork();

if (chd == 0)
{
// Give the compiler something to think about
while(zero < ~0x0) { *tsc = rdtsc(); zero++;}
}
else
{
// Wait for the child
while (*tsc == 0) {}

uint64_t st = *tsc; // When is it for the child?
kill(chd, SIGINT); // Send SIGINT to child
uint64_t k = rdtsc(); // Returned from kill
wait(NULL); // Reap
uint64_t delta = 0;
uint64_t en = *tsc; // When did the child end?

// K >, implies that kill returned after the child terminated
if (k > en)
{
count ++;
delta = k - en;
if (delta < minN) minN = delta;
if (delta > maxN) maxN = delta;
sumN += delta;
}
else
{
delta = en - k;
if (delta < min) min = delta;
if (delta > max) max = delta;
sum += delta;
}
}
}

printf("Min: %lx, Max: %lx, Avg: %lx\n", min, max, (sum / (ITER - count)));
printf("Min: %lx, Max: %lx, Avg: %lx\n", minN, maxN, (sumN / (count)));
printf("Percent Parent After: %lf\n", (count / (double)ITER));

return 0;
}

Update: Results also hold when sending SIGKILL.