Wednesday, July 15, 2015

PhD Defense Yacin Nadji - Understanding DNS-based Criminal Infrastructure for Informing Takedowns

Yacin Nadji, a PhD candidate in security at Georgia Tech, successfully defended his dissertation work today.

How does one disable a botnet?  It is difficult to identify and repair individually infected machines.  Therefore, targeting the command and control servers can instead break the linkage between the infected machines and the malicious controller.

Manual identification is time-consuming and can lead to collateral damage.  Automation is required to enumerate the machines, evaluate the threat, identify the takedown mechanism, and determine the potential collateral damage by the takedown.  Using a dataset of DNS registrations over time, the tools were tested across this sample of the Internet over time (from Damballa).

APT (Advance persistent threats) are particularly troublesome as they are machines that persist and change their presence overtime according to the botnet controller.  The C&C machines also attempt to go dark by changing their IP resolution to localhost (127.0.0.1), thereby minimizing their observed signature by only having network traffic when an attack is required.  This leads to a suite of detection features that can lead to identifying the actual C&C machines, such as having short-lived IP addresses, changing the domain name to NULL or localhost, and varying the IP address across a diverse set of infrastructure and geographic locations.

Then develop a machine learning algorithm, initially with a ground truth of 50k records of APTs.  The features are scored and then run through different models using the 90/10 on the ground truth dataset.  The following results are only approximate, as I was trying to copy them during the presentation.

ModelAccuracyTrue Positive RateFalse Positive Rate
Naive Bayes709140
General Linear Regression98931
Random Forest99970.04

Then apply to the full dataset of 300 million records.  These are clustered to 1.1 million clusters, of which ~700 are above 0.8 confidence of being APTs.  At 90% confidence, the clusters all contain less than 1000 domain names.

How then do botnets attempt to evade detection?  The infected machines generally use DNS to lookup their C&C machines; however, the lookup can be occasionally spurious or to legitimate IPs.  The machines could be peer to peer, but this requires active connections that are often blocked or restricted by networks (against "legitimate" uses such as bittorrent).

The suite of tools also operates on the malware running in VMs, whereby it works through possible takedown mechanisms and then observes the response of the infection to takedown thereby identifying other, possibly unused, communication approaches.  For most infections, this takes on the order of hours to enumerate through the approaches; however, some can take days.

Open Problems:

  • Attributing the botnet to physical entities
  • Targeting P2P-based botnets

Wednesday, June 17, 2015

Conference Attendance FCRC - Day 5 - Plenary Summary

Plenary Talk today, which pulls together all of the conference attendees.  Sunday's talk was based in databases, with Michael Stonebraker speaking on his Turing-award winning work.  Monday's talk discussed interdisciplinary work, primarily centered in CS theory, and was given by Andrew Yao (a prior Turing Award winner).  On Tuesday, Olivier Temam discussed neural networks in hardware, which focused on his work and efforts to better model or mimic the capabilities of the brain.

The F# Path to Relaxation -
There are opportunities to introduce new work toward relaxing and improving.  Or perhaps create opposing camps.  Thesis <-> Antithesis ==> synthesis.  Or Functional <=> Interop.  Back in 2003, functional languages were isolated, non-interoperable, using their own VMs.  F# (along with Scala, Swift, ...) instead seeks to have an exosystem, being the external industry-standard runtimes.  Another tension is between Enterprise and Openness.  So F# is open and cross-platform.  Tools are available for Android and iOS, as well as packages for Linux.

Functional <=> Objects
Thus embrace objects, without being object-oriented.  Some cases in the cross-product of the expected features for objects and functions requires particular care for synthesis.

Circularities and Modularity in the Wild
Lambdas, generics, etc are clearly being embraced in modern language design.  However, circular type dependencies are unfortunately also widely present.  Languages need to enforce acyclicity.

Pattern Matching <=> Abstraction
How does the language support the functional concept of pattern matching, when you want to include type abstraction?  Alas, the speaker skipped the solution quickly.

Code <=> Data
Most development is to providing tools for the information revolution.  There is exponential growth in Open APIs for accessing data from the internet.  This data then comes with dynamic types, where the types are only known once the data (or schema) has been accessed.  The type creation can also enable blending code for other languages into the F# environment.  For example, the support can allow opening csv or json files and having types for the data.  This feature is, by far, the most exciting and interesting of the presentation.  Not quite worth the price of admission, but clearly a great development.

Applied PL design comes from the synthesis at the heart of these contradictions.  This tension also is part of the proliferation of languages.

Conference Attendance FCRC - Day 4 - PLDI

PLDI starts off this morning with Concurrency.  As a student volunteer, I worked this session and was limited as to what I could note about the content itself.

Composing Concurrency Control - Introducing more diverse and finer-grained locking mechanisms.  The tool works to develop a locking strategy that will guarantee serializability, abort-safety, opacity, and deadlock-freedom.  It particularly works to integrate both locking schemes as well as transactional memory.

In the afternoon, I can dive into the semantics of the C language.

A Formal C Memory Model Supporting Integer-Pointer Casts - What optimizations are possible in the presence of pointers, pointer arithmetic, and integer-pointer casts?  For example, can constants be propagated or is their location potentially targetable by a pointer?  Other optimizations are explored in their paper.  In practice, as code can generate arbitrary addresses, how can the compiler reason about any specific location in memory.

Defining the Undefinedness of C - Extending their prior work that gave semantics to defined behavior of C programs, which required doubling the rules to describe the semantic behavior.  Fundamentally, any instance of undefined behavior that will be definitely encountered in an execution will invalidate that execution.  For example, dividing by zero after a printf is valid to crash before the printf.  The following code example is also undefined.
return (x = 1) + (x = 2);
Many of these cases are dependent on runtime behavior, and therefore a tool that can help identify them is valuable.

Monday, June 15, 2015

Conference Attendance FCRC - Day 3 - PLDI / ISCA

PLDI itself began this morning and after the welcome, we had three distinguished papers.  I am excited that two of these works focused on code performance and compilers, rather than higher-level programming language issues:

Automatically Improving the Accuracy of Floating Point Expressions - How do you address rounding error in your code?  Use formal numeric methods an expert can reduce the errors.  But rather than be an expert, they wrote a tool to use heuristics to apply these methods.  For example, what error do you have when evaluating the quadratic formula.  Based on just the value for b, there are different expressions that have much lower error.

The tool, Herbie, estimates the accuracy of the expression and then attempts to use algebraic transformations (from a database of 120 rules).  Having generated many candidate expressions, the tool then selects using dynamic programming an appropriate set of expressions across the input space.  First, it matches the example cases from the Hamming's Numeric Methods book.  And furthermore has found bugs in existing projects.

Diagnosing Type Errors with Class - SHErrLoc works to identify the likely cause of type errors.  Expressions are given constraints.  These constraints form a graph, which is analyzed for failing paths in the graph.  The tool then attempts to localize the failure and identify the minimal change to the constraints to satisfy the graph.  Even though it is not Haskell specific, it is more accurate at detecting type errors in Haskell programs than related work.

Provably Correct Peephole Optimizations with Alive - Compilers are buggy.  For example, LLVM's InstCombine is an LLVM pass that exploits the LLVM IR to improve performance, which contains many hand-rolled transformations.  Propose a DSL that describes Peephole Optimizations, where the the DSL is basically a simplified LLVM IR annotated with preconditions for the transformation.  Then the expression describing the transformation is passed through constraint checkers to verify it is correct.  And then generate C++ code for that transformation.

Correctness of the expression must not introduce new undefined behaviors, still produces the same result, and properly updates the memory state.  Initially proved the optimizations in InstCombine correct or identified bugs, and eventually could replace the pass with the generated version.  Furthermore, Alive was able to strengthen the post-conditions for many instructions (for example, identifying whether an operation will overflow).

In the afternoon, I was visiting the other side of the conference center with ISCA.  One paper of note there:

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing - They showed that (somehow, as I missed this point) integrating the simple core closer to the memory system, they pushed past the memory bandwidth of HMC (640GB/s) and instead to about 2.3TB/s. They focused on two pieces: an updated programming model for graphs and a prefetching system within their cores. The model introduced async remote procedure calls that are sent to the Tesseract core near the data. These messages accumulate in a queue until either a barrier or the queue is full. While they accumulate, the prefetcher is requesting the appropriate data so when the function fires, the data is available. The prefetcher is able to operate on the two separate streams: the local processing that is sequential and generating the remote requests, and then the remote requests received at this node.

Sunday, June 14, 2015

Conference Attendance FCRC - Day 1 - WCAE / SPAA

In Portland for the next 5 days attending the Federated Computing Research Conference, which is a vast co-location of the top ACM conferences.  For my part, this includes ISCA and PLDI.  Following registration and checking in as a student volunteer, I ducked in to the Workshop on Computer Architecture Education (WCAE).  There were a couple of presentations on different tools being used to teach architectural concepts.

Following the morning break, it was time for the keynote for SPAA, given by Hans-J Boehm, titled, "Myths and Misconceptions about Threads".  For example,
#include "foo"
f() {
    foo_t x, a;
    ...
    x = a; // Is this atomic?
}
Which lead to the discussion of 'is assignment atomic?' and the audience tossed out increasing complex examples of how it is not.  Fundamentally, the programming model is becoming "data-race free", and the specifications can treat races as "undefined behavior".  In general, a sequential program will view its execution following the sequential consistency model, even if the hardware is executing the code with a weaker model.

What then should the programming language provide for the atomics / synchronization?  Recall that the compiler has considerable flexibility for emitting the final program.  With data-race free code, the compiler is treating anything that is not an atomic as part of sequential code and therefore subject to any reordering that would still be valid sequentially.  The following example is how this can go awry.  X is a global, and the compiler could substitute x anyplace tmp is, because the model assumes "there are no races on x".  And if the program does happen to modify x is a racy manner, then the behavior is undefined.
bool tmp = x;
if (tmp) f = new ...
...
if (tmp) f->foo();
Gah!  The programmer wanted to take a snapshot of the global value, but ended up with a different result.  So the atomics are becoming more than just the "hacker's" way to quickly update shared values, and instead can be seen as annotations to the compiler to clearly encapsulate the shared state.  This means the type of x is not bool, but atomic<bool>.  Then the compiler knows the programmer's (likely) intent of this code.  And this then rolls back to a deeper question of my research, "What could the system do more efficiently if it knew more about the programmer's intent?"

Wednesday, May 13, 2015

Course Design Series (Post 1 of N): Why Study Programming Languages

Obviously, there are some practitioners and researchers who base their living on programming language design.  But what of the rest of us?  The Communications of the ACM ran a short article on why: Teach foundational language principles.  Language design is increasingly focused on programmer productivity and correctness.  As programmers, are we aware of the new features and programming paradigms?

Particularly, let's look at three aspects of programming languages: contracts, functional languages, and type systems.  By introducing each to budding programmers, we improve their habits, just as presently I include comments and other style components in project grades.  First, contracts provide a well defined mechanism for specifying the requirements of each component in a system.  Not just commenting each component and interface, but doing so in a manner that permits the compiler / runtime system to verify and enforce it.

Functional languages are well known, and whether or not you may ever use one, they teach valuable techniques and mental models regarding programming.  A programmer should use pure and side-effect free procedures, rather than interleaving logic across different components.  Other practices, such as unit testing, also trend toward clean interfaces; however, being forced to obey these rules via the underlying language is great practice.

Type systems are treated by the authors as a panacea, as they call for language designers to be "educated in the formal foundations of safe programming languages - type systems."  Panacea or not, reasoning about functionality within the bounds of types leads to code that is clearer and more maintainable.  Even in a weak language, such as C, one can use enums rather than everything being "int".

As these aspects are being increasingly used in various forms in current language design, programmers need to be knowledgeable of them in order to be effective and use the languages to their full potential.  It is therefore incumbent on me, and other educators, to appropriately include these aspects when we teach about programming languages.

Which leads me back to writing a course description and learning objectives for my fall course.  Maybe later this month once I figure out why one of the textbooks still hasn't arrived.

Friday, April 17, 2015

Repost: Code Quality

xkcd had a great comic today about the code quality of self-taught programmers.  While there are technically trained programmers that write poor quality code, my impression is that this is more common with self-taught programmers as well as programmers who are only taught programming itself.  Basically as part of the technical training, someone learns more than just programming.  They learn about CS theory, data structures, multiple programming languages, and are more often exposed to well written / designed programs.  Learning to program at a high quality is similar to learning a natural language, in that you study grammar and spelling, you read great works of literature and analyze them, and you practice reading / writing / speaking.

Each month I understand better what it takes to be competent in my field and also respect more the idea that curriculum is established by experts for a purpose, whether or not I may like certain topics.