Wednesday, May 3, 2017

PhD Defense - Meeting Tail Latency SLOs in Shared Networked Storage

Today I went to Timothy Zhu's PhD thesis defense.  His work is on achieving better sharing of data center resources to improve performance, and particularly to reduce tail latency.  He also TA'd for me last fall.

Workloads are generally bursty, and their characteristics are different.  Furthermore, they may have service level objectives (SLOs), and the system needs to meet these different objectives.  And the system contains a variety of resources that must be shared in some form.  It is not sufficient to just divide the bandwidth.  Nor can the system measure the latency and try reacting, particularly as bursty workloads do not give sufficient time to react.  While each workload has deadlines, it would be too complex to tag request packets with the deadlines for queuing and routing.  However, the deadlines can be used to generate priorities for requests.

The system is architected to have storage and network enforcement components to ensure QoS.  There is also a controller that receives an initial trace to characterize each workload, and that workload's SLOs.  The controller works through a sequence of analyses to successfully place each workload into the overall system.

Effectively, each workload is assigned a "bucket" of tokens, where the bucket size provides the ability to handle bursts and the rate that tokens are added covers the request rate for the workload.  Shorter burstier workloads receive large buckets and low rates, while constant workloads with little bursts have high rates and small buckets.  In both cases, only when the bucket is empty, is the workload rate-limited in its requests, and these requests receive the lowest priority.  Deterministic Network Calculus (DNC) to model the worst-case queue scenarios.  This plots two curves: the requesting flow and the service curve, both plotted as tokens by function of window size (dt).  The maximum distance between the curves is the maximum latency.

Using three traces: DisplayAds, MSN, and LiveMaps, they tested three approaches: Cake (reactive approach), earliest deadline first, and Timothy's scheme (PriorityMeister).  His scheme did significantly better than the others at meeting the SLOs.  However, the DNC analysis was based on achieving 100% and not the SLO's 99% (or other percentile success).  Depending on the characteristics, there can be significant differences between these guarantees.  To model the latency percentiles, Stochastic Network Calculus (SNC) can achieve this; however, the math is significantly more complex.  And the math had not previously been applied to this problem.  DNC is still better when assuming that bursts are correlated or the system is in an adversarial setting.  Reducing these assumptions (uncorrelated workloads), the SNC-based analysis permitted the system to admit 3x workloads versus the DNC analysis.

Workloads have a curve of satisfying bucket sizes and token rate pairs.  Many systems require the user to provide its rate limit.  Other systems use simple heuristics to either find the "knee of the curve" or select a rate limit as a multiple of the average rate.  However, for an individual workload, all pairs are satisfying, it is only when workloads are combined in a system do the different pairs matter.  The configurations of the set of workloads on the system can be solved for using a system of linear equations.  Therefore, when placing new workloads, the controlling architecture can find successful placements, while potentially reconfiguring the workloads assigned.

One extension would be addressing failure modes.  Currently, the system is assumed to be at degraded performance when components have failed.

Monday, May 1, 2017

Book Review: Multicore and GPU Programming: An Integrated Approach

I wanted to like this book, Multicore and GPU Programming: An Integrated Approach, and be able to use it in the classroom.  I teach a class where we cover OpenMP, MPI, Cilk, and Cuda.  Yet, I would not use this book and I have told the publishers this as well, to which they acknowledged my reasoning.

First, the author elects to use QtThreads for the base parallel programming approach.  He notes that he had considered using pthreads or C++11 thread support, and comments that he rejected C++11 threads for the book as the support was incomplete at the time.  Pthreads is left without comment.  That may be, but I have heard of and used those options, while QtThreads is something that I have never considered.

Second, the book serves as an admirable API reference.  For one or a set of function calls, significant space is dedicated to how that call works and illustrating examples for it.  Structured Parallel Programming also covers many APIs, yet it maintains a feel of being about parallel programming in general rather that the calls specifically.  However, that work covers different API sets so the two are not explicitly comparable.

Third, and this issue is really more for the editors rather than the author, the typesetting on each page is poor.  There is significant white space left bordering the text on each page.  Furthermore, the code wraps, and often not for being long code, but for long comments and comments on the same line as the code.  I understand that in programming these are stylistic choices; however, the impact of finding line wraps from long comments leaves the text looking unprofessional.  I must assume that the author wrote the code separately and then provided for being included into the book, but the editor failed to make allowances for typesetting.

In conclusion, I wanted to like and use the book.  Whenever I speak with the publisher, they always direct me to it, I just have to hope for something else to come along.  Use it as a reference perhaps, but I am cautious in my recommendation.

(This book was provided free by the publisher to review for possible use in the classroom.)