Friday, October 2, 2015

SCS Lecture Series: Solver-Aided Declarative Programming

Intersection of Programming Languages, Databases, and analytics (Machine Learning, etc)

Businesses are big systems that can then be modeled: financial, data, supply chain, consumer demand, etc.  Simple models can be analytical formulas or spreadsheets prepared by the domain expert.  Or an enterprise system of database + Java layer + ....  Consider the thousands or millions of SKUs plus thousands of stores.  The goal is that the domain expert can build and maintain the model.

For example, one client had 234 billion weekly forecasts (500k skus x 9000 stores x 52 weeks).  Consider the impact of BOGO deals and placing items on endcaps.  How much does this improve sales?  Accuracy of predictions had lower error than competing forecasts.  Resulted in millions of dollars in savings by optimizing inventory.  And benefit to the customer by using cloud computing versus requiring the customer to buy the necessary computing resources.  Customers trust the cloud more to be secure as companies such as Target, Home Depot, T-Mobile, ... have suffered data breaches.  Who do you think has more resources to secure the data: google or Target?

Recall that Excel is the world's most popular IDE with the most popular programming language.  Thus the audience for this work are the business people, not the computer scientists.  So have a programming language of LogiQL, inheriting from datalog.  Language is not Turing-complete, but this provides many valuable analyses that would otherwise be undecidable.  Provide vastly more efficient clique finding in graphs.

Integrity constraints give possible states for the data and the relation.  Therefore, the database knows what is possible in its facts.  Or with weighted constraints, which facts violate the fewest constraints.  Include an operator (ESO - existential second order) such that rather than a fact exists.  ESO is quantified by a relation existing.  For example, ESO can be invoked to find a valid color relation for graph coloring.

Friday, August 28, 2015

Repost: Incentivizing Active Learning in the Computer Science Classroom

Studies have shown that using active learning techniques improve student learning and engagement.  Anecdotally, students have brought up these points to me from my use of such techniques.  I even published at SIGCSE a study on using active learning, between undergraduate and graduate students.  This study brought up an interesting point, that I will return to shortly, that undergraduate students prefer these techniques more than graduate students.

Mark Guzdial, far more senior than me, recently challenged Georgia Tech (where we both are) to incentivize the adoption of active learning.  One of his recent blog posts lists the pushback he received, Active Learning in Computer Science.  Personally, as someone who cares about the quality of my teaching, I support these efforts although I do not get to vote.

Faculty members at R1 institutions, such as Georgia Tech, primarily spend their time with research; however, they are not research scientists and therefore they are being called upon to teach.  And so you would expect that they would do this well.  In meeting with faculty candidates, there was one who expressed that the candidate's mission as a faculty member would be to create new superstar researchers.  Classes were irrelevant to this candidate as a student, therefore there would be no need to teach well as this highest end (telos) of research justifies the sole focus on students who succeed despite their instruction, just like the candidate did.  Mark's blog post suggests that one day Georgia Tech or other institutions may be sued for this sub-par teaching.

What about engagement?  I (along with many students and faculty) attended a visiting speaker talk earlier this week and was able to pay attention to the hour long talk even though it was effectively a lecture.  And for this audience, it was a good talk.  The audience then has the meta-takeaway that lectures can be engaging, after all we paid attention.  But we are experts in this subject!  Furthermore, for most of us there, this is our subfield of Computer Science.  Of course we find it interesting, we have repeatedly chosen to study it.

For us, the material we teach has become self-evidently interesting.  I return to the undergraduate and graduate students that I taught.  Which group is closer to being experts?  Who has more experience learning despite the teaching?  Who prefered me to just lecture?  And in the end, both groups learned the material better.

Edit: I am by no means condemning all of the teaching at R1's or even Georgia Tech.  There are many who teach and work on teaching well.  The Dean of the College of Computing has also put some emphasis on this through teaching evaluations.  Mark's post was partially noting that teaching evaluations are not enough, we can and should do more.

Monday, August 17, 2015

Course Design Series (Post 2 of N): Choosing a Textbook

Having now read both Programming Language Pragmatics, Third Edition and Concepts of Programming Languages (11th Edition), I have settled on the former as my textbook for the fall.  I do not find either book ideally suited, and I wish that the Fourth Edition was being released this summer and not in November, which is why it still hasn't arrived.

For the choice of which textbook to use, it was "Concepts" to lose and having 11 editions, the text should also be better revised.  I dislike reading the examples in a book and questioning how the code would compile.  Beyond which, the examples felt quaint and contrived.

(edit) For example, I was reviewing the material on F# and copied in an example from the text:
let rec factorial x =
  if x <= 1 then 1
  else n * factorial(n-1)
Does anyone else notice that the function parameter is x on the first two lines and n on the last?!

Before the Concepts book is written off entirely, there are many valuable aspects.  I enjoyed reading about the history of programming languages, especially for exposing me to Plankalk├╝l.  The work also took a valuable track in the subject by regularly looking at the trade-offs between different designs and features.  This point certainly helped inform my teaching of the material.

Fundamentally, when I looked at the price of the two textbooks, the benefits of using the newer Concepts textbook could not outweigh the nearly doubled pricetag.  Most of the positive points are small things and can be covered as addendums to the material.

(FCC note - Concepts of Programming Languages was provided free to me by the publisher.)

Wednesday, July 15, 2015

PhD Defense Yacin Nadji - Understanding DNS-based Criminal Infrastructure for Informing Takedowns

Yacin Nadji, a PhD candidate in security at Georgia Tech, successfully defended his dissertation work today.

How does one disable a botnet?  It is difficult to identify and repair individually infected machines.  Therefore, targeting the command and control servers can instead break the linkage between the infected machines and the malicious controller.

Manual identification is time-consuming and can lead to collateral damage.  Automation is required to enumerate the machines, evaluate the threat, identify the takedown mechanism, and determine the potential collateral damage by the takedown.  Using a dataset of DNS registrations over time, the tools were tested across this sample of the Internet over time (from Damballa).

APT (Advance persistent threats) are particularly troublesome as they are machines that persist and change their presence overtime according to the botnet controller.  The C&C machines also attempt to go dark by changing their IP resolution to localhost (, thereby minimizing their observed signature by only having network traffic when an attack is required.  This leads to a suite of detection features that can lead to identifying the actual C&C machines, such as having short-lived IP addresses, changing the domain name to NULL or localhost, and varying the IP address across a diverse set of infrastructure and geographic locations.

Then develop a machine learning algorithm, initially with a ground truth of 50k records of APTs.  The features are scored and then run through different models using the 90/10 on the ground truth dataset.  The following results are only approximate, as I was trying to copy them during the presentation.

ModelAccuracyTrue Positive RateFalse Positive Rate
Naive Bayes709140
General Linear Regression98931
Random Forest99970.04

Then apply to the full dataset of 300 million records.  These are clustered to 1.1 million clusters, of which ~700 are above 0.8 confidence of being APTs.  At 90% confidence, the clusters all contain less than 1000 domain names.

How then do botnets attempt to evade detection?  The infected machines generally use DNS to lookup their C&C machines; however, the lookup can be occasionally spurious or to legitimate IPs.  The machines could be peer to peer, but this requires active connections that are often blocked or restricted by networks (against "legitimate" uses such as bittorrent).

The suite of tools also operates on the malware running in VMs, whereby it works through possible takedown mechanisms and then observes the response of the infection to takedown thereby identifying other, possibly unused, communication approaches.  For most infections, this takes on the order of hours to enumerate through the approaches; however, some can take days.

Open Problems:

  • Attributing the botnet to physical entities
  • Targeting P2P-based botnets

Wednesday, June 17, 2015

Conference Attendance FCRC - Day 5 - Plenary Summary

Plenary Talk today, which pulls together all of the conference attendees.  Sunday's talk was based in databases, with Michael Stonebraker speaking on his Turing-award winning work.  Monday's talk discussed interdisciplinary work, primarily centered in CS theory, and was given by Andrew Yao (a prior Turing Award winner).  On Tuesday, Olivier Temam discussed neural networks in hardware, which focused on his work and efforts to better model or mimic the capabilities of the brain.

The F# Path to Relaxation -
There are opportunities to introduce new work toward relaxing and improving.  Or perhaps create opposing camps.  Thesis <-> Antithesis ==> synthesis.  Or Functional <=> Interop.  Back in 2003, functional languages were isolated, non-interoperable, using their own VMs.  F# (along with Scala, Swift, ...) instead seeks to have an exosystem, being the external industry-standard runtimes.  Another tension is between Enterprise and Openness.  So F# is open and cross-platform.  Tools are available for Android and iOS, as well as packages for Linux.

Functional <=> Objects
Thus embrace objects, without being object-oriented.  Some cases in the cross-product of the expected features for objects and functions requires particular care for synthesis.

Circularities and Modularity in the Wild
Lambdas, generics, etc are clearly being embraced in modern language design.  However, circular type dependencies are unfortunately also widely present.  Languages need to enforce acyclicity.

Pattern Matching <=> Abstraction
How does the language support the functional concept of pattern matching, when you want to include type abstraction?  Alas, the speaker skipped the solution quickly.

Code <=> Data
Most development is to providing tools for the information revolution.  There is exponential growth in Open APIs for accessing data from the internet.  This data then comes with dynamic types, where the types are only known once the data (or schema) has been accessed.  The type creation can also enable blending code for other languages into the F# environment.  For example, the support can allow opening csv or json files and having types for the data.  This feature is, by far, the most exciting and interesting of the presentation.  Not quite worth the price of admission, but clearly a great development.

Applied PL design comes from the synthesis at the heart of these contradictions.  This tension also is part of the proliferation of languages.

Conference Attendance FCRC - Day 4 - PLDI

PLDI starts off this morning with Concurrency.  As a student volunteer, I worked this session and was limited as to what I could note about the content itself.

Composing Concurrency Control - Introducing more diverse and finer-grained locking mechanisms.  The tool works to develop a locking strategy that will guarantee serializability, abort-safety, opacity, and deadlock-freedom.  It particularly works to integrate both locking schemes as well as transactional memory.

In the afternoon, I can dive into the semantics of the C language.

A Formal C Memory Model Supporting Integer-Pointer Casts - What optimizations are possible in the presence of pointers, pointer arithmetic, and integer-pointer casts?  For example, can constants be propagated or is their location potentially targetable by a pointer?  Other optimizations are explored in their paper.  In practice, as code can generate arbitrary addresses, how can the compiler reason about any specific location in memory.

Defining the Undefinedness of C - Extending their prior work that gave semantics to defined behavior of C programs, which required doubling the rules to describe the semantic behavior.  Fundamentally, any instance of undefined behavior that will be definitely encountered in an execution will invalidate that execution.  For example, dividing by zero after a printf is valid to crash before the printf.  The following code example is also undefined.
return (x = 1) + (x = 2);
Many of these cases are dependent on runtime behavior, and therefore a tool that can help identify them is valuable.

Monday, June 15, 2015

Conference Attendance FCRC - Day 3 - PLDI / ISCA

PLDI itself began this morning and after the welcome, we had three distinguished papers.  I am excited that two of these works focused on code performance and compilers, rather than higher-level programming language issues:

Automatically Improving the Accuracy of Floating Point Expressions - How do you address rounding error in your code?  Use formal numeric methods an expert can reduce the errors.  But rather than be an expert, they wrote a tool to use heuristics to apply these methods.  For example, what error do you have when evaluating the quadratic formula.  Based on just the value for b, there are different expressions that have much lower error.

The tool, Herbie, estimates the accuracy of the expression and then attempts to use algebraic transformations (from a database of 120 rules).  Having generated many candidate expressions, the tool then selects using dynamic programming an appropriate set of expressions across the input space.  First, it matches the example cases from the Hamming's Numeric Methods book.  And furthermore has found bugs in existing projects.

Diagnosing Type Errors with Class - SHErrLoc works to identify the likely cause of type errors.  Expressions are given constraints.  These constraints form a graph, which is analyzed for failing paths in the graph.  The tool then attempts to localize the failure and identify the minimal change to the constraints to satisfy the graph.  Even though it is not Haskell specific, it is more accurate at detecting type errors in Haskell programs than related work.

Provably Correct Peephole Optimizations with Alive - Compilers are buggy.  For example, LLVM's InstCombine is an LLVM pass that exploits the LLVM IR to improve performance, which contains many hand-rolled transformations.  Propose a DSL that describes Peephole Optimizations, where the the DSL is basically a simplified LLVM IR annotated with preconditions for the transformation.  Then the expression describing the transformation is passed through constraint checkers to verify it is correct.  And then generate C++ code for that transformation.

Correctness of the expression must not introduce new undefined behaviors, still produces the same result, and properly updates the memory state.  Initially proved the optimizations in InstCombine correct or identified bugs, and eventually could replace the pass with the generated version.  Furthermore, Alive was able to strengthen the post-conditions for many instructions (for example, identifying whether an operation will overflow).

In the afternoon, I was visiting the other side of the conference center with ISCA.  One paper of note there:

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing - They showed that (somehow, as I missed this point) integrating the simple core closer to the memory system, they pushed past the memory bandwidth of HMC (640GB/s) and instead to about 2.3TB/s. They focused on two pieces: an updated programming model for graphs and a prefetching system within their cores. The model introduced async remote procedure calls that are sent to the Tesseract core near the data. These messages accumulate in a queue until either a barrier or the queue is full. While they accumulate, the prefetcher is requesting the appropriate data so when the function fires, the data is available. The prefetcher is able to operate on the two separate streams: the local processing that is sequential and generating the remote requests, and then the remote requests received at this node.