Is there a theoretical foundation for why exponential backoff is a good design? Exponential backoff is a practically developed algorithm that 0.
To develop such a foundation, we need to a model of time; however, requests are asynchronous and not according to a single time source. To address this, model time with adversarial scheduling. Thus when performing a request, there are three sources of delay:
- self-delay: backoff, sleep, local computation
- system-delay: interrupts, context switches
- contention-delay: delay caused by contention
Given this model, the adversary can, to a limited degree, decide when requests that an entity's request have passed from self-delay into the system delay can then move to contention-delay and ultimately be completed.
In BBlelloch'17, this model was applied and the work measured for different approaches.
- With no backoff, there is omega(n3) work.
- Exp backoff reduces to theta(n2 log n) bound on work
- The paper also proposes a new algorithm that has high probability of O(n2)
The second phase of work is developing simple and efficient algorithms for systems that have non-volatile memory (NVRAM). With NVRAM, on a crash or system failure, the contents in memory persist across reboot (or other restore). This permits the system to restore the running program(s) to a finer degree than happens from auto-saves or other current techniques. However, systems also have caches, which are not persistent. Caches are presently managed by hardware and make decisions as to when to write contents back to memory. Algorithms must work with the caches to ensure that results are safely in memory at selected points of execution. There are a variety of approaches for how to select these points.
The third phase of work is modeling RDMA (remote direct memory access) systems. Can there be a model of the different parts of such a system: memory, NIC (network interface card), and CPU? Then explore the contention as well as possible failures in the system.
One scheme is for every processes to also be able to send messages on behalf of its shared memory neighbors, so that even if a process fails, its ability to participate in algorithms, such as consensus, is still possible.
Being a proposal, ongoing work will work on instantiations of these algorithms to measure the practical performance.