Showing posts with label dissertation. Show all posts
Showing posts with label dissertation. Show all posts

Wednesday, May 3, 2017

PhD Defense - Meeting Tail Latency SLOs in Shared Networked Storage

Today I went to Timothy Zhu's PhD thesis defense.  His work is on achieving better sharing of data center resources to improve performance, and particularly to reduce tail latency.  He also TA'd for me last fall.

Workloads are generally bursty, and their characteristics are different.  Furthermore, they may have service level objectives (SLOs), and the system needs to meet these different objectives.  And the system contains a variety of resources that must be shared in some form.  It is not sufficient to just divide the bandwidth.  Nor can the system measure the latency and try reacting, particularly as bursty workloads do not give sufficient time to react.  While each workload has deadlines, it would be too complex to tag request packets with the deadlines for queuing and routing.  However, the deadlines can be used to generate priorities for requests.

The system is architected to have storage and network enforcement components to ensure QoS.  There is also a controller that receives an initial trace to characterize each workload, and that workload's SLOs.  The controller works through a sequence of analyses to successfully place each workload into the overall system.

Effectively, each workload is assigned a "bucket" of tokens, where the bucket size provides the ability to handle bursts and the rate that tokens are added covers the request rate for the workload.  Shorter burstier workloads receive large buckets and low rates, while constant workloads with little bursts have high rates and small buckets.  In both cases, only when the bucket is empty, is the workload rate-limited in its requests, and these requests receive the lowest priority.  Deterministic Network Calculus (DNC) to model the worst-case queue scenarios.  This plots two curves: the requesting flow and the service curve, both plotted as tokens by function of window size (dt).  The maximum distance between the curves is the maximum latency.

Using three traces: DisplayAds, MSN, and LiveMaps, they tested three approaches: Cake (reactive approach), earliest deadline first, and Timothy's scheme (PriorityMeister).  His scheme did significantly better than the others at meeting the SLOs.  However, the DNC analysis was based on achieving 100% and not the SLO's 99% (or other percentile success).  Depending on the characteristics, there can be significant differences between these guarantees.  To model the latency percentiles, Stochastic Network Calculus (SNC) can achieve this; however, the math is significantly more complex.  And the math had not previously been applied to this problem.  DNC is still better when assuming that bursts are correlated or the system is in an adversarial setting.  Reducing these assumptions (uncorrelated workloads), the SNC-based analysis permitted the system to admit 3x workloads versus the DNC analysis.

Workloads have a curve of satisfying bucket sizes and token rate pairs.  Many systems require the user to provide its rate limit.  Other systems use simple heuristics to either find the "knee of the curve" or select a rate limit as a multiple of the average rate.  However, for an individual workload, all pairs are satisfying, it is only when workloads are combined in a system do the different pairs matter.  The configurations of the set of workloads on the system can be solved for using a system of linear equations.  Therefore, when placing new workloads, the controlling architecture can find successful placements, while potentially reconfiguring the workloads assigned.

One extension would be addressing failure modes.  Currently, the system is assumed to be at degraded performance when components have failed.

Monday, March 21, 2016

PhD Defense - Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Subsystems

Vivek Seshadri gave his PhD defense today covering: how different memory resources have different granularities of access, and therefore need different management techniques.  These techniques come out of understanding what hardware could do, without necessarily identifying common features in existing applications that would require / benefit from these techniques.

Page Overlays / Overlay-on-Write: provide the ability to assign physical addresses at sub-page granularities (call them overlays).  This reduces the cost of sparse copy-on-writes.  In effect, assign a fresh sub-page unit and copy to that location.  On every access, check the overlay table in parallel to determine whether to use the normal translation or go to the overlay location.

Gather-Scatter DRAM: provide support for only requesting a subset of cachelines.  First, shuffle the data in a cacheline so that the same subset of multiple cache lines will map to different chips in DRAM.  Second, introduce additional logic (just a few gates) in DRAM that will compute a modified address, where the default pattern (stride 1) is the normal, un-modified access.  

RowClone + BuddyDRAM: can DRAM speedup memcpy (and other bulk memory operations)?  First, by opening one row after another, the bitline will take the initial value and then write it into another row.  More complex is opening multiple rows simultaneously, which results in bit-wise operations across the three rows: final = C (A | B) | ~C (A & B).  By controlling C, bulk bitwise operations are possible.  Using this technique, the system can exceed the memory bandwidth for these operations.

DirtyBlock Index: the problem is that if the source is dirty, then it needs to be written back before the previous techniques can be used.  DBI provides a faster lookup mechanism to determine if / where are any dirty block lines.

These techniques are interesting, but as the candidate noted, they are in effect solutions in search of a problem.  And with DRAM being commodity hardware, it is difficult to envision these techniques being adopted without further work.

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly.  I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters.  These counters are not well documented and so the tools are low-level.  Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue.  Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body.  For some applications, several loops account for most of the time.  Then the loops can be perturbed in different ways that force certain resources to be exercised further.  For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.  

Wednesday, July 15, 2015

PhD Defense Yacin Nadji - Understanding DNS-based Criminal Infrastructure for Informing Takedowns

Yacin Nadji, a PhD candidate in security at Georgia Tech, successfully defended his dissertation work today.

How does one disable a botnet?  It is difficult to identify and repair individually infected machines.  Therefore, targeting the command and control servers can instead break the linkage between the infected machines and the malicious controller.

Manual identification is time-consuming and can lead to collateral damage.  Automation is required to enumerate the machines, evaluate the threat, identify the takedown mechanism, and determine the potential collateral damage by the takedown.  Using a dataset of DNS registrations over time, the tools were tested across this sample of the Internet over time (from Damballa).

APT (Advance persistent threats) are particularly troublesome as they are machines that persist and change their presence overtime according to the botnet controller.  The C&C machines also attempt to go dark by changing their IP resolution to localhost (127.0.0.1), thereby minimizing their observed signature by only having network traffic when an attack is required.  This leads to a suite of detection features that can lead to identifying the actual C&C machines, such as having short-lived IP addresses, changing the domain name to NULL or localhost, and varying the IP address across a diverse set of infrastructure and geographic locations.

Then develop a machine learning algorithm, initially with a ground truth of 50k records of APTs.  The features are scored and then run through different models using the 90/10 on the ground truth dataset.  The following results are only approximate, as I was trying to copy them during the presentation.

ModelAccuracyTrue Positive RateFalse Positive Rate
Naive Bayes709140
General Linear Regression98931
Random Forest99970.04

Then apply to the full dataset of 300 million records.  These are clustered to 1.1 million clusters, of which ~700 are above 0.8 confidence of being APTs.  At 90% confidence, the clusters all contain less than 1000 domain names.

How then do botnets attempt to evade detection?  The infected machines generally use DNS to lookup their C&C machines; however, the lookup can be occasionally spurious or to legitimate IPs.  The machines could be peer to peer, but this requires active connections that are often blocked or restricted by networks (against "legitimate" uses such as bittorrent).

The suite of tools also operates on the malware running in VMs, whereby it works through possible takedown mechanisms and then observes the response of the infection to takedown thereby identifying other, possibly unused, communication approaches.  For most infections, this takes on the order of hours to enumerate through the approaches; however, some can take days.

Open Problems:

  • Attributing the botnet to physical entities
  • Targeting P2P-based botnets