Monday, February 20, 2017

Repost: Learn by Doing

I want to take a brief time to link to two of Mark Guzdial's recent posts.  Both including an important theme in teaching.  Students learn best by doing not hearing.  Oddly students commonly repeat this misconception.  If I structure our class time to place them as the ones doing something, rather than me "teaching" by speaking, the appraisal can be that I did not teach.  They may not dispute that they learned, but I failed to teach them.

Students learn when they do, not just hear.  And Learning in MOOCs does not take this requirement into account.

I have to regularly review these points.  So much so that I was able to give them to a group of reporters last week (part of new faculty orientation, but still).

Wednesday, February 8, 2017

Conference Attendance CGO (etc) 2017 - Day 3

Today is the last day for the conference.  I attended several more talks today and my student took 2nd in the student research competition.  So it has been a good attendance and I have received some interesting and valuable feedback on my own work, as well as finding some possible collaborations for the future.

Optimistic Loop Optimization
The compiler wants to optimize loops in the program.  However, C and LLVM IR have many complex characteristics that the Polyhedral model cannot represent, such as aliasing, wrapping, or out of bound accesses.  Rather than just assuming that these characteristics are not present, instead, the code can be analyzed to determine which violating characteristics may be present.  These assumptions are placed in the code, and can then be reduced to the set of preconditions for which the optimized loop can be executed.  Should the assumptions fail, the code instead can branch to the original version.  These optimizations can also be optimized (for example N < 127 implies N < 128).  For SPEC200x, the assumptions fail about 2% of the time and impose 4% runtime overhead.

Software Prefetching for Indirect Memory Accesses
What should we prefetch in software? A[x + N] is easy for hardware, A->next is hard for everyone, while A[B[x + N]] is easy for software and hard for hardware to predict.  So given a loop (such as exists in NAS is) that has this indirect structure, then prefetches can be inserted that will speedup the loop.  Yet, there are three key points for inserting these prefetch instructions.
- You have to prefetch both the A[B[]] and B[].  Otherwise, the prefetch will block on the B[].
- Having both prefetches requires that they are both offset from the actual access as well as each other.  Too close and they are not parallel.  Too far and the data is evicted before use.
- The first point raised that there is an actual load (not prefetch) of B[] and therefore needs to be bounds checked.

Tuesday, February 7, 2017

Conference Attendance CGO (etc) 2017 - Day 2

Several of the talks were great and very interesting.  Other talks particularly needed further presentation practice.  Unfortunately, sometimes this may come from English as a second language.  And so I am torn between wanting presenters to have practice and be open to a wider pool of researches, while also wanting to be able to easily understand the presentations.

Tapir: Embedding Fork-Join Parallelism into LLVM’s Intermediate Representation
Let's start by considering code that normalizes a vector.  This code takes 0.3s to run.  Then switch the "for" with a "cilk_for", and the execution time improves to 180s (w/ 18 cores).  When the compiler sees "cilk_for" or other parallel keywords, generally it converts these into runtime function calls that take in a function pointer for the parallel component.  (Similar to thread create routines taking in a function to execute).  With the function call, many optimization passes cannot cross the call, while previously being able to cross the "for".

Instead, let's propose three new instructions to include in the LLVM IR.  Supporting these lines required approximately 6000 lines of changes.  When the updated LLVM compiles a set of parallel programs, most can now reach 99+% work efficiency, which indicates that the parallel overhead is near 0.

Prior work would create parallel tasks symmetrically, for example each task would represent separate paths in the classic "diamond" CFG.  The problem is that the parallel program is actually taking both paths concurrently, which is not an expected behavior of the control flow.  Instead, the IR is asymmetric so that compilers can continue to reason about the basic blocks as a sequential code would appear.

Incremental Whole Program Optimization and Compilation
This covers the feature within Microsoft's Visual Studio compiler.  Each component stores hashes of the components on which it depends.  When a file is changed, it generates different hashes, which the compiler then can use to determine that its dependencies need to be re-analyzed and code gen'd.  These hash changes can then either propagate, if changed, or the compilation process will complete.

Optimizing Function Placement for Large-Scale Data-Center Applications
The common binaries for facebook are 10s-100s MBs in size.  These binaries have IPCs less than 1.0 (recall that processors can run above 2.0 and higher is better), and are experiencing frequent front-end stalls that are attributable to iTLB and I$ misses (as high as 60 per 1000, eww).  Hardware profilers can then determine the hot functions.  This information is then processed to determine the hot functions that should be clustered together.  These clusters are mapped to separate loader sessions that will load them using huge pages.

Minimizing the Cost of Iterative Compilation with Active Learning
There are too many possibilities for optimization.  Let's ask machine learning to figure this out.  The danger is always finding the right level of training to provide valuable insights without overfitting, etc.


Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work.  I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration.  Now that it is Monday, it is time to attend some conference talks.  In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered.  In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions.  For example, is a queue required to always return an element, or can it fail if an element was recently added?  Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant.  On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower.  To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology.  Each level of the hierarchy acts as a MCS-style queue lock.  How then can these threads abort from their queue?  In a single level, threads mark their status as aborted and are then skipped when handing off the lock.  In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads.  The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs.  Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size.  The concern is that the trace / analysis approaches are very slow and could potentially skew the results.  And when the results are derived from only one trace, there is additional question about their applicability.


Thursday, December 29, 2016

Book Review: The Art of Readable Code

I really enjoyed reading The Art of Readable Code.  I always enjoy reading books on style, both English writing as well as source code.  The earlier chapters were a greater highlight to me, as they focused on the basics of style and particularly were demonstrated through languages with which I am familiar.  Some of the later chapters instead were examples particular to languages that I do not use, such as illustrating style pitfalls with JavaScript.  The book also had value in showing several approaches and continuing to refactor the code to better meet style and readability.

While the topic is style, this is really more about fundamentally good practices, which may be implemented in one of several ways (e.g., where to put braces, camel case, etc) that is termed style.  Between this and the examples within the text, I want to start requiring it of students.  I want them to read and see why style actually matters.  Or maybe we will just have to wait until they experience why it matters and suffer for it.

Thursday, December 8, 2016

Practical TSX

I previously speculated about how Intel's TSX is implemented; however, I did not have access to any machines supporting TSX until this year.  I still have not done much testing personally, but I did direct two students who explored and measured this support.  As a reminder, TSX is an implementation of hardware transactional memory, which can simplify concurrency designs by avoiding the need for locks.  Being a hardware implementation, it has certain fixed costs and constraints.

Mario Dehesa-Azuara and Nick Stanley completed a 6 week project in the spring and the summary below is taken from their final project report.  Also, being students, their report may not be available indefinitely, so this link may be broken at some future date.
First, they reviewed the best practices for writing TSX-enabled code.  Particularly, there is a problem in that the TSX path and the fallback path (the fallback path is required to ensure that the code can make progress even with aborts) must be mutually exclusive.  This can require putting in additional operations versus a pure, transactional approach.

Second, they measured the cost of concurrent data structure updates.  Their work noted that writing a transactional implementation was significantly easier than a fine-grained locking approach.  However, their measurements revealed some counter-intuitive results.  For example, an AVL tree is a self-balancing data structure.  The self-balanced nature is a benefit in that fewer memory accesses should be required.  Yet, the rotations required to maintain this condition actually increased the set of locations accessed and therefore resulted in a high rate of aborts.

To understand this, we must turn to the actual implementation.  We know that TSX can only track a limited number of memory locations (at most, the size of the L1 data cache).  As soon as any transactional memory location (i.e., cache line) cannot be stored in the L1, the transaction must abort.  Thus limiting the size of the read and write sets of the transaction are vital for completing the transactions.  In Mario's and Nick's experiments, they observed that after 5 million insertions into an AVL tree, transactions were at a 50% failure rate (regardless of the tested number of threads).  In contrast, a treap with its probabilistic balancing, has relatively constant failure rates that depend on the number of threads (and not the total insertions).

Third, using TSX has an inherent cost that is significantly higher than other atomic operations.  It still remains the advice that simple atomic updates should utilize the appropriate instructions.  What if you need to perform several of these operations?  Again, we turn to the final report.  The measurements show that simple transactional operations on consecutive memory locations will be faster than the equivalent atomic operations on those locations when you access at least 6 locations per "transaction".  Nonetheless, if the program must obey another constraint, such as update all or none of the elements, then locks or transactions would be required.

It is important to remember that a major benefit of transactional memory is in design and implementation effort, not in the actual runtime of the program.

Friday, October 14, 2016

Conference Attendance Teaching and Learning Summit 2016 - Keynote

Critical Thinking: Why is it so hard to teach? - Dr Daniel T. Willingham

Critical thinking is intertwined with content knowledge.  We saw a sequence of four examples (If vowel then even number, if alcohol then 21, if gin, then haddock, if entering then cholera vaccine), for each example, there is a claim about a set of cards: If X then Y.  Given four cards, verify the claim.  If the problems were formulated based on permissions, then the success rate was high.  Each problem is technically, P -> Q, but having just completed a semester of logic has no impact on results.

Scientific reasoning is taught in two pieces scientific concepts and scientific method.  So consider designing a learning experiment.  The group is split into intervention and control.  How do you know that the random sample is valid?  Background knowledge is required to determine the appropriateness of the split.

Critical thinking occurs from learning at the deep level.  The surface story is say, "tumors and rays".  The deep question is whether it is modus pones, Netwon's third law, etc?  However, memory is focused on the surface facts.  Recall is based on those components.

Why not teach the deep structure immediately?  Abstractions are hard to understand.  Instead, learners have to see lots of surface structures all overlaying the same deep structure.

Sometimes failures in critical thinking are actually failures in basic knowledge.  Furthermore, there are also innate biases, such as words refer to objects and attributes, and the world is full of agents and purposes.

Takeaway 1: Most of critical thinking is domain-specific.
Takeaway 2: In each domain, faculty should identify what they consider the important critical thinking skills.
Takeaway 3: Select content with an eye toward teaching these skills.  Teach the critical thinking in the context of the content.
Takeaway 4: Critical thinking is a curricular issue.  These skills require more than 1 semester to acquire.
Takeaway 5: Certain foundational concepts may run counter to the mind's biases.  Students have functional knowledge that has worked so far.  For example, "equals sign means put answer here".

Q. Translating domain skills in interdisciplinary work?
A. Don't know.  Probably needing to know enough of the skills in the home domain to be able explore the other domain.

Q. If critical thinking is domain specific, how specific are domains?
A. Domains are nested.  Proper application requires domain knowledge.  Moving from cognitive psychology to social leaves [the speaker] less skilled, but still better than average.  Into clinical psychology, they have a common basis, but limited ability to apply.