Elegant C: performance modeling

Showing posts with label performance modeling. Show all posts

Thursday, September 12, 2019

Thesis Proposal: Theoretical Foundations for Modern Multiprocessor Hardware

Naama Ben-David gave her proposal this morning on Theoretical Foundations for Modern Multiprocessor Hardware.

Is there a theoretical foundation for why exponential backoff is a good design? Exponential backoff is a practically developed algorithm that 0.

To develop such a foundation, we need to a model of time; however, requests are asynchronous and not according to a single time source. To address this, model time with adversarial scheduling. Thus when performing a request, there are three sources of delay:

self-delay: backoff, sleep, local computation
system-delay: interrupts, context switches
contention-delay: delay caused by contention

Given this model, the adversary can, to a limited degree, decide when requests that an entity's request have passed from self-delay into the system delay can then move to contention-delay and ultimately be completed.

In BBlelloch'17, this model was applied and the work measured for different approaches.

With no backoff, there is omega(n³) work.
Exp backoff reduces to theta(n² log n) bound on work
The paper also proposes a new algorithm that has high probability of O(n²)

The second phase of work is developing simple and efficient algorithms for systems that have non-volatile memory (NVRAM). With NVRAM, on a crash or system failure, the contents in memory persist across reboot (or other restore). This permits the system to restore the running program(s) to a finer degree than happens from auto-saves or other current techniques. However, systems also have caches, which are not persistent. Caches are presently managed by hardware and make decisions as to when to write contents back to memory. Algorithms must work with the caches to ensure that results are safely in memory at selected points of execution. There are a variety of approaches for how to select these points.

The third phase of work is modeling RDMA (remote direct memory access) systems. Can there be a model of the different parts of such a system: memory, NIC (network interface card), and CPU? Then explore the contention as well as possible failures in the system.

One scheme is for every processes to also be able to send messages on behalf of its shared memory neighbors, so that even if a process fails, its ability to participate in algorithms, such as consensus, is still possible.

Being a proposal, ongoing work will work on instantiations of these algorithms to measure the practical performance.

Tuesday, July 31, 2018

Book Review: The Art of Application Performance Testing

The Art of Application Performance Testing, covers what it says. The book starts with concepts general to any performance testing, which was interesting to me. Most of the text focuses though on the Application part of the title. The applications here are primarily web-based, or other client-server based setups, and not just the generic "application" referring to any program. That said, I do not work on such applications, so the remainder of the text was of less value to me.

In testing applications, a performance analyst needs to establish a representative workload, which includes the actions to perform, and the combined load. For example, most users logging in to their bank will view their account balance, while others might transfer money or pay a bill. Combined these actions might represent most of the work from users. Then for each unit of server, how many users should be able to perform a mix of those actions, which forms the load.

After establishing the workload, the analyst needs to implement the described workload, which requires a tool that generates the load (either by driving the application itself or replaying a synthetic trace of the load). For those tools, what additional hardware is required to deploy this load? Does the deployment take into account geographic and other user variations (so that the load generation is representative of the user base)? Finally, what tooling and methodology exists for profiling and recording the execution of the workload for present and future analysis?

So I appreciated the content of the book and would recommend it to individuals focusing on testing of user-facing applications.

Wednesday, May 3, 2017

PhD Defense - Meeting Tail Latency SLOs in Shared Networked Storage

Today I went to Timothy Zhu's PhD thesis defense. His work is on achieving better sharing of data center resources to improve performance, and particularly to reduce tail latency. He also TA'd for me last fall.

Workloads are generally bursty, and their characteristics are different. Furthermore, they may have service level objectives (SLOs), and the system needs to meet these different objectives. And the system contains a variety of resources that must be shared in some form. It is not sufficient to just divide the bandwidth. Nor can the system measure the latency and try reacting, particularly as bursty workloads do not give sufficient time to react. While each workload has deadlines, it would be too complex to tag request packets with the deadlines for queuing and routing. However, the deadlines can be used to generate priorities for requests.

The system is architected to have storage and network enforcement components to ensure QoS. There is also a controller that receives an initial trace to characterize each workload, and that workload's SLOs. The controller works through a sequence of analyses to successfully place each workload into the overall system.

Effectively, each workload is assigned a "bucket" of tokens, where the bucket size provides the ability to handle bursts and the rate that tokens are added covers the request rate for the workload. Shorter burstier workloads receive large buckets and low rates, while constant workloads with little bursts have high rates and small buckets. In both cases, only when the bucket is empty, is the workload rate-limited in its requests, and these requests receive the lowest priority. Deterministic Network Calculus (DNC) to model the worst-case queue scenarios. This plots two curves: the requesting flow and the service curve, both plotted as tokens by function of window size (dt). The maximum distance between the curves is the maximum latency.

Using three traces: DisplayAds, MSN, and LiveMaps, they tested three approaches: Cake (reactive approach), earliest deadline first, and Timothy's scheme (PriorityMeister). His scheme did significantly better than the others at meeting the SLOs. However, the DNC analysis was based on achieving 100% and not the SLO's 99% (or other percentile success). Depending on the characteristics, there can be significant differences between these guarantees. To model the latency percentiles, Stochastic Network Calculus (SNC) can achieve this; however, the math is significantly more complex. And the math had not previously been applied to this problem. DNC is still better when assuming that bursts are correlated or the system is in an adversarial setting. Reducing these assumptions (uncorrelated workloads), the SNC-based analysis permitted the system to admit 3x workloads versus the DNC analysis.

Workloads have a curve of satisfying bucket sizes and token rate pairs. Many systems require the user to provide its rate limit. Other systems use simple heuristics to either find the "knee of the curve" or select a rate limit as a multiple of the average rate. However, for an individual workload, all pairs are satisfying, it is only when workloads are combined in a system do the different pairs matter. The configurations of the set of workloads on the system can be solved for using a system of linear equations. Therefore, when placing new workloads, the controlling architecture can find successful placements, while potentially reconfiguring the workloads assigned.

One extension would be addressing failure modes. Currently, the system is assumed to be at degraded performance when components have failed.