Elegant C: presentation

Showing posts with label presentation. Show all posts

Thursday, March 7, 2019

Talk: Concurrent Data Structures for Non-Volatile Memory

Today, Michal Friedman, gave a talk on Concurrent Data Structures for Non-Volatile Memory.

Future systems will contain non-volatile memory. This is memory that exhibits normal DRAM characteristics, but can maintain its contents even across power failures. In current systems, caches update memory on either evictions and flushes. Flushes, however, impose overhead due to the memory access time and overriding the write-back nature of most caches.

Linearizability is one definition for concurrency governing the observation of the operations. This can be extended to durable linearizability being on a durable system, such that data is flushed before global visibility (initialization), flush prior operations (dependence), and persist operations before they complete (completion). But a further extension is required to know when a sequence of operations are complete, beyond just taking snapshots of the memory state.

Relaxed, durable, and log versions of lock-free queue that extend Michael and Scott's baseline queue implementation. Each version provides stronger guarantees: relaxed are the existing augmented with a sync operation to snapshot state, durable preserves the data structure across failures, log identifies the specific state. The main guarantee is that the data structure will be consistent for any set of thread crashes, which is stronger than the lock-free guarantee.

We do this by extending the prior lock-free versions that include memory flushes of key state, and that later update which see volatile state will flush that state before completing their operations. This meets the durable linearizability. And can be extended by also have a log of operations that are updated and maintained before the operations themselves execute. These logs are per-thread, so as to be unordered and to be individually stateful.

The relaxed version implements sync by creating a special object that indicates a snapshot is occurring. If other concurrent operations find this object, they take over the snapshot and continue persisting the state before completing its own operation. Thus a snapshot does not block other operations, but still occurs at that point in the sequence of operations.

Based on performance measurements, the relaxed performs similar to the baseline implementation, while the durable and log-based implementations run slower than the relaxed but with similar performance.

Finally, TSO provides us a guarantee that the stores will reach the cache line in a desired order and not require flushing between writes.

Friday, March 1, 2019

Conference Attendance: SIGCSE 2019 - Day 1.5

Back at SIGCSE again, this one the 50th to be held. Much of my time is spent dashing about and renewing friendships. That said, I made it to several sessions. I've included at least one author and linked to their paper.

Starting on day 2, we begin with the Keynote from Mark Guzdial

"The study of computers and all the phenomena associated with them." (Perlis, Newell, and Simon, 1967). The early uses of Computer Science were proposing its inclusion in education to support all of education (1960s). For example, given the equation "x = x0 + v*t + 1/2 a * t^2", we can also teach it as a algorithm / program. The program then shows the causal relation of the components. Benefiting the learning of other fields by integrating computer science.

Do we have computing for all? Most high school students have no access, nor do they even take the classes when they do.

Computing is a 21st century literacy. What is the core literacy that everyone needs? C.f. K-8 Learning Trajectories Derived from Research Literature: Sequence, Repetition, Conditionals. Our goal is not teaching Computer Science, but rather supporting learning.

For example, let's learn about acoustics. Mark explains the straight physics. Then he brings up a program (in a block-based language) that can display the sound reaching the microphone. So the learning came from the program, demonstration, and prediction. Not from writing and understanding the code itself. Taking data and helping build narratives.

We need to build more, try more, and innovate. To meet our mission, "to provide a global forum for educators to discuss research and practice related to the learning and teaching of computing at all levels."

Now for the papers from day 1:

Lisa Yan - The PyramidSnapshot Challenge

The core problem is that we only view student work by the completed snapshots. Extended Eclipse with a plugin to record every compilation, giving 130,000 snapshots from 2600 students. Into those snapshots, they needed to develop an automated approach to classifying the intermediate snapshots. Tried autograders and abstract syntax trees, but those could not capture the full space. But! The output is an image, so why not try using image classification. Of the 138531 snapshots, they generated 27220 images. Lisa then manually labeled 12000 of those images, into 16 labels that are effectively four milestones in development. Then, a neural network classifier classified the images. Plot the milestones using a spectrum of colors (blue being start, red being perfect). Good students quickly reach the complete milestones. Struggling students are often in early debugging stages. Tinkering students (~73 percentile on exams) take a lot of time, but mostly spend it on later milestones. From these, we can review assignments and whether students are in the declared milestones, or if other assignment structure is required.

For the following three papers, I served as the session chair.

Tyler Greer - On the Effects of Active Learning Environments in Computing Education

Replication study on the impact of using an active learning classroom versus traditional room. Using the same instructor to teach the same course, but using different classrooms and lecture styles (traditional versus peer instruction). The most significant factor was the use of active learning versus traditional, with no clear impact from the type of room used.

Yayjin Ham, Brandon Myers - Supporting Guided Inquiry with Cooperative Learning in Computer Organization

Taking a computer organization course with peer instruction and guided inquiry, can the peer instruction be traded for cooperative learning to emphasize further engagement and learning. Exploration of a model (program, documentation), then concept invention (building an understanding), then application (apply the learned concepts to a new problem). Reflect on the learning at the end of each "lecture". In back-to-back semesters, measure the learning gains from this intervention, as well as survey on other secondary items (such as, engagement and peer support). However, the students in the intervention group did worse, most of which is controlled by the prior GPA. And across the other survey points, students in the intervention group rated lower. The materials used are available online.

Aman, et al - POGIL in Computer Science: Faculty Motivation and Challenges

Faculty try implementing POGIL in the classroom. Start with training, then implementing in the classroom, and continued innovation. Faculty want to see more motivation, retaining the material, and staying in the course (as well as in the program). Students have a mismatch between their learning and their perceived learning. There are many challenges and concerns from faculty about the costs of adoption.

Thursday, November 29, 2018

Seminar Talk: Computer Science Pedagogy - Miranda Parker

This week I have been co-hosting Miranda Parker, from Georgia Tech, as part of our Colloquium on Computer Science Pedagogy. Her work is titled, Barriers to Computing: What Prevents CS for All.

The question is what are the barriers to accessing Computer Science at the high school level. Nation-wide, ~35% of high schools offer at least one CS course, although this is self-reported CS. In Indiana, the most popular courses are taken by ~5000 students, state-wide. In Austin Texas, about 6000 students take at least one CS course, out of ~110,000.

How do we know if students are successful (i.e., did they learn)?

For this, we need a validated assessment to ensure that we are measuring what we want to measure. An initial assessment, FCS1, worked fairly well and across multiple introductory programming languages; however, the assessment had limits in its use, which lead to the SCS1. This assessment correlated well with FCS1, so it can standin; however, it was an hour long. Assessments should: cover the content, vary in difficulty, and have good discrimination (so individual scores are good predictors for the overall performance). In analysis, most of the SCS1 questions were hard, and few provided good discrimination. The assessment was then adapted to focus on the medium difficulty problems that discriminate (in test scores), and expanded to a 30 minute test.

With a measurement of whether students are succeeding in Computer Science, we can walk back to investigate what things influence students to succeed in Computer Science.

Among prior studying factors, we know that students do better in CS if they have more prior experience with Computer Science, and students do better in CS (as well as STEM and general education) with higher socioeconomic status (SES). There are also known prior links between SES and access to computing (whether it is formal courses, or informally), and SES to spatial reasoning. And both of these later components are linked to CS achievement.

In exploratory study (large, public university with mainly high SES students), showed statistical correlation from spatial reasoning to CS achievement. There was not correlation from having access to achievement. In the interest of time, this point was not presented further.

Barriers to computing

There are three main sources of access to Computer Science courses: state policies, geography (such as, circumstances, industry, or neighboring schools with CS), and resources (such as, time and money or rural versus urban).

In partnership with Georgia Department of Education, CS enrollment data from 2012-2016, plus other public data (such as, characteristics of the county and of the school), for each of the 181 school districts in Georgia, where each district has at least one high school. Using the definition of CS courses, being those that count toward the graduation requirement in Georgia as computer science, versus other computing courses, such as web design or proficiency with office. In Georgia, out of 500,000 students, about 6000 took a CS course in a given year, where about 50% of schools offered computer science during at least one year in that time frame. And the 6000 students are actually student course events, where a student taking two CS courses would count twice.

For those schools, CS enrollment, total high school enrollment, and median income contribute to whether CS will be offered in the next year.

This work is yet ongoing, where the next steps are to visit schools and collect further data on these factors, such as why a school discontinued a course offering, or now offers one. Or what do students these courses go on to do? An audience question wondered whether the CS courses offered relates to the courses offered of other parts of STEM.

Wednesday, October 17, 2018

Thesis Defense: Practical Concurrency Testing

Ben Blum defended his dissertation work today on Practical Concurrency Testing. What follows are the notes from that defense.

To prove that a program is correct across arbitrary concurrency. There are three testing approaches:
unit testing of the most likely, stress testing that is not systematic, and verification that requires separate tools and techniques to describe.

Landslide is a proposed technique that is based on Stateless Model Checking (Godefroid '97), which tests a different execution interleaving on every iteration. However, the naive interleaving provides O(2^n) states to test. [Flanagan '05] identified equivalent interleavings and [Musuvathi '08] proposed heuristic orderings to identify the possible bugs faster. This approach can often require annotations, so adoption requires automated instrumentation. This space is addressing further concurrency problems such as weak memory models, but hardware transactional memory is still open.

This instrumentation requires preemption points. Finer-grained finds more bugs, but increases the states to test. Bugs / failures follow certain cases, such as use-after-free, deadlocks, assertion failures, and invalid memory accesses. Dynamic data-race analysis can help inform the necessary preemption points.

As a reminder, a data race:
- one or more accesses is write
- threads are not holding the same mutex
- Nor is there other ordering requirements (condition variable, etc)

Quicksand applies this analysis to select different smaller problem spaces using subsets of possible preemption points. Each subset also represents smaller parts of the larger possible problem space. If these subsets are all satisfied, then represents a full verification of the program. Prior work explored using APIs such as mutex_lock/unlock, or using every shared variable access as preemption points.

This tester is deployed in OS courses at CMU, PSU, and U Chicago. Manual annotation is not viable for students, especially those struggling for whom the traces would be valuable. That said, students regularly deploy ad-hoc synchronization, such as while (!ready) yield();, requires heuristics as the naive model checking must test every possible count of yielding and its interleaving.

When used by students, about 75% of tested kernels / libraries have identifiable bugs from the testing framework. For the tested submissions (7 semesters) of students at CMU, there is an improvement in grades, but it is not statistically significant when correcting for the opt-in bias. Most students are then able to fix their bugs found by the tool.

Hardware transactional memory poses a separate challenge for model checking. Aborted transactions are observationally equivalent to an immediately failed transaction. Furthermore, all transactions must be assumed to abortable, as there are many possible causes of aborts. As prior posts covered, this fact requires that any transaction have a valid abort path. And this abort path requires most of the verification.

Testing Landslide using hand-written tests, transactional data structures, and a TSX-based spinlock. Each set of tests has a concurrency or performance bug in the implementations. What about demonstrating that there are no bugs in implementation? With 10 hours of CPU time, verification is only possible for small cases on complex code. That said, practical testing so far only requires <4 preemptions to create the buggy scenario. There can be other bugs requiring an increasingly complex ordering, but generally those are very rare.

Abstraction reduction [Simsa '13], works to reduce primitives within implementations to verified components, such as mutual exclusion, etc. Using this technique then allows Landslide to verify the complex HTM implementations at higher thread counts.

In attendance are the recent instructors of Operating Systems and the TAs.

Wednesday, October 10, 2018

NASA Talk: Making Miracles Happen

Dr. Thomas H. Zurbuchen the Associate Administrator of NASA's Science Mission Directorate gave a talk today about some key lessons he has learned and observed from his years with NASA, as well as years before. The following is my quick notes from his talk.

Planning for the future, Voyager launched with the plan to visit Jupiter and Saturn. But the scientists still had plans for the possibility to visit Uranus and Neptune based on that launch window. And also, that in the lifetime of the probes, they may leave the Heliosphere. Imagine that over 40 years after launch, there are still papers to be written in Nature or Science from the mission. Voyager 2 is nearing this mark, which Voyager 1 already crossed.

Solar probe problem was to explore near to the Sun. The desire was there from 60 years ago. And eventually from decades of other work, the technology was there to support components such as solar panels that can operate close to Sun. Once the technology was available, the proposal was based on a $4 billion probe, which was outside the budget. But NASA offered to design / launch for $1 billion. This forced a complete redesign. The original mission used a single Jupiter flyby to bring the probe to 4 solar radii. For $1 billion, the flight path would instead use Venus to do repeated flybys and eventually lower the perihelion to 9 solar radii. While 9 > 4, the Venus flight path provides many flybys of the Sun, which provides a richer dataset from each flyby. This probe is the Parker Solar Probe, also exceptionally named for the still living scientist Eugene Parker.

Cassini reinforced the need for teams to have diverse backgrounds and training. That greater success is possible from having the great teams.

Hubble provided the research, just last year alone, for ~1000 publications. On the initial launch, there was the well known flaw in the mirror, when NASA had been predicting great images. After studying the results, a tech worked out that the mirror was slightly out of position, and by a small shift of the sensor, everything worked.

Shortly after finishing his PhD, Dr. Z came up with a new sensor in '98 to be possibly part of Messenger, at its launch in 2004, which was multiple times added and removed from the craft. And even after the launch, it took Messenger 7 years to complete the orbital path and remove sufficient energy so that the probe could enter orbit of Mercury. This requires patience.

When working in teams, he tells about being the Swiss army. Barbed wire had to be placed around the camp. This was the job given to those in trouble. But he talked with them on this duty and helped. So eventually, rather than just being the trouble job, the team was a good team, and some soldiers wanted to work on those teams. Better to be on a good team doing a bad task, than a bad team doing a good task.

The National Academy of Science sets the science priorities (this process from priority to mission I read significantly about in Chasing New Horizons), but ultimately the specific mission to meet the science priority is decided by Dr. Z. Then that mission moves through multiple steps and reviews. And one of the key steps is that while Dr. Z makes the decision, these decisions are based on the good advice from the experts.

For the missions flown, Dr. Z has about 5 failures for every success. These failures are things like being removed from the program or the mission being canceled. And sometimes it is for technical failures of the device or the probe. Things will fail. At every launch, he goes there to watch and has a large mission plan. Most of that mission plan covers what to do if things go wrong. Plan for the failure.

Wednesday, August 29, 2018

Thesis Defense: Responsive Parallel Computation

We can consider parallelism strategies as either competitive (such as pthreads) or cooperative (such as Cilk). Cooperative systems assume that the parallel computation is being equally distributed across the threads, such that other tasks are not prioritized.

Into this, Stefan Muller presented a Responsive Parallel Computation approach where the two models can be unified. Or put another way, "We can extend existing cooperative language and cost models to account for competitive threading constructs, and to design and implement efficient and responsive scheduling algorithms."

Cooperative systems, such as Cilk, rely on task queues and work stealing to implement the parallelism, but when IO is involved, the common implementations use blocking system calls. In the cooperative system, this blocking call then blocks one of the worker threads, rather than just the working task.

For operating system scheduling, there are many threads within the myriad of processes in the system that may need to run. The OS often uses a priority associated with any thread / process to determine these scheduling decisions. When programmers assign priorities (if they do at all), they often are trying to express a relative ordering between different tasks rather than absolute priorities. As some systems limit priorities to a handful to as many as 100, the mapping of priority classes to numeric levels can be unclear.

Propose an extension to ML, PriML, with two new types:
t cmd[p] - command p returning t
t thread[p] - thread at priority p returning t

priority and order declarations enable defining the priority and the specifying the partial order between those priorities. spawn[priority] {expression} to create the thread. sync to retrieve the result. Or poll to return an option that can be matched, so that we can select from multiple threads. For example, the following PriML sequence spawns two threads to sort emails with different methods and selects the faster one.

t1 <- spawn[p] {do (qsort date emails)};
t2 <- spawn[p] {do (mergesort date emails)};
do (let fun choose () =
cmd[p]
{
v1 <- poll t1;
v2 <- poll t2;
do (case (v1, v2) of
(SOME l, _) =>
cmd[p] {cancel t2; ret l}
| (_, SOME l) =>
cmd[p] {cancel t1; ret l}
| (NONE, NONE) => choose ())
}
in
choose ()
end)

These features then allow the language to enforce the priority constraints, and avoid any priority inversion, as sync must always be to an equal or higher priority, and poll can be to any.

Returning to parallel program analysis, we represent them with DAGs; however, the DAG must be extended with priority annotations and the work / span calculations. This analysis can explore different schedules, such that adding fairness for all thread priorities would extend the bound on time.

Friday, April 27, 2018

Thesis Defense: Systems Support for Intermittent Computing

Today I attended Alexei Colin's thesis defense titled, Systems Support for Intermittent Computing.

For small, embedded devices, batteries are expensive / difficult, so energy can be harvested from RF, light, temp gradients, motion, et cetera. In such a device, this direct energy source is insufficient to power the device, so a small capacitor (or other storage medium) retains this energy until the device can be powered for a short time. The discharge provides an intermittent period of execution before the power source drops below the threshold for execution. Programs can be annotated with latches or other progress points, such that execution after power failure can then resume at this point after the power is again available.

To model the computation, the program will be decomposed into tasks, where each task can only transfer control to other tasks, but contains arbitrary code. Tasks will communicate through channels. The channels provide the memory model, such that any internal updates within the task are ultimately exposed via the channels. However, this model while reducing the overhead required to execute the tasks, requires a greater quantity of the non-volatile memory.

How do we then get the tasks and thus the latches? Given a control flow graph (CFG), task boundaries will need to be inserted between specific basic blocks of the graph. The compiler can be extended to model (or receive model results) of the energy requirements for each block and thereby estimate which path segments will have sufficient energy for complete execution. Each block actually has not a single energy, but a PDF based on possible microarchitectural effects. Then the model combines these PDFs to compute the CDF to determine the probability that a given path will successfully execute given a specific amount of energy available. Note, each task boundary imposes overhead both in time and therefore energy, so we want the set of task boundaries to minimize overhead, while also accounting for task failures wasting energy. This compiler pass produces better task decompositions than are achieved via manual programmer annotations, as provided by prior work.

Other system support issues. This system should also have dynamic ability to select the stored energy necessary for task execution. This change first requires splitting the energy storage device into multiple banks in hardware. Also, debugging issues in the system is difficult, particularly where the device is expecting to regularly "fail", so a new debugger was prepared that can combine the program state of traditional debuggers, while still supporting the device to be intermittent. Such devices will also need further design for intermittent networking stacks, and then be built into a larger IoT hierarchy.

In conclusion, energy-harvesting embedded computers will form the edge of the IoT hierarchy. And the system stack will form the basis for support.

Tuesday, February 27, 2018

Conference Attendance SIGCSE 2018

I have just finished attending SIGCSE 2018 in Baltimore. In contrast to my earlier conference attendance, this time I have had higher involvement in its execution.

On Wednesday I went to the New Educator's Workshop (NEW). Even being faculty for two years, there was still a number of things that were either new or good reminders. Such as including or discussing learning objectives with each lecture and assignment, or being careful with increasing one's level of service. As a new faculty member, each service request seems exciting, as no one has asked me before! But many senior faculty emphasized that this is the time in which they are protecting us from lots of service opportunities such that we can spend time on our teaching and research.

On Thursday morning, I presented my recent work that updated a programming assignment in Introduction to Computer Systems, and from which we saw improvements in student exam scores. We did not research the specific action, and are therefore left with two theories. First, the improvement could be from using better style in the starter code and emphasizing this style in submissions. Second, we redesigned the traces to require submissions to address different cases and thereby implement different features. I lean toward the formed, but have no data driven basis for this hypothesis.

Let's discuss active learning briefly. I attended (or ran) several sessions focused on this class of techniques. The basic idea is that students have better engagement and learning by actively participating in class. There are a variety of techniques that work to help increase student activity. On Thursday afternoon, Sat Garcia of USD, presented Improving Classroom Preparedness Using Guided Practice, which showed how student learning improved from participating in Peer Instruction, which particularly requires students to come to class prepared. Shortly later, Cynthia Taylor joined Sat and I in organizing a Bird of Feather (BoF) session on using Active-learning in Systems Courses. We had about 30-40 attendees there split into two groups discussing some techniques they have used and problems they have observed. 5 years ago, a similar BoF had attendance around 15-20, so we are making progress as a field.

On Friday, I spoke with Brandon Myers who has done work on using POGIL in Computer Organization and Architecture. In POGIL, students are working in groups of 3-4 with specific roles through a guided learning, guiding students into discovering the concepts themselves. We had a nice conversation and may be merging our draft resources. This last point is often the tricky part of using active learning in that developing reasonable materials can be both time intensive and requires several iterations.

The Friday morning keynote presentation was given by Tim Bell, who spoke about K-12. This topic is rather distant from my own work and research, so I was skeptical. Yet, I came out quite enthused. It was interesting to think about presenting Computer Science concepts in non-traditional ways, based initially on having to explain your field at elementary school when the other presenters are a cop and a nurse (his example). How could you get 6 year olds to sort? Or see the advantage of binary search as the data grows?

In the afternoon, I was a session chair for the first time. I moderated the session on Errors, so obviously the AV system stopped working for a short duration. Beyond that incident, the session seemed to go well.

I always like going to SIGCSE. It is rejuvenating and exhausting. So many teachers to speak with about courses, curriculum, and other related topics. And then you find that you've been social for 16 hours or so hours.

Tuesday, October 24, 2017

PhD Defense - Low-level Concurrent Programming Using the Relaxed Memory Calculus

Today, I went to Michael Sullivan's thesis defense, who passed. The work was at a delightful intersection of my interests.

We want better (more usable, etc) semantics for low-level operations, those below the std::atomic<> and similar designs. Perhaps this is achievable with ordering constraints. Given the following simple example, what constraints are required?

int data, flag;

void send(int msg) {
data = msg;
flag = 1;
}

int recv() {
while (!flag) continue;
return data;
}

Two constraints: data ("visible") before flag, flag ("executed") before data. These constraints are explicitly programmer-specified, and that it is contended that this is practical.

rmc::atomic<T> - a variable that can be concurrently accessed
L(label, expr) - labels an expression
VEDGE and XEDGE - specify orders between labeled expressions, effectively V is write visibility ordering and X is execution of read ordering
rmc::push() or PEDGE - Pushes have a total order, and provide orderings between reads and writes which is not possible with just V and X.

In more advanced space, do we need to add constraints to spinlock_lock and spinlock_unlock? Let's add two special labels: pre, post. These serve for interface boundaries to denote that everything has executed before this point, or is visible.

Next problem is loop iterations. Do the constraints need to be within a single iteration or constraining every iteration? Extend the order specifiers, so in the following, the ordering constraint is just within the iteration, whereas the constraint outside the iteration (without "_HERE") is also between the iterations.

for (i = 0; i < 2; i++) {
VEDGE_HERE(before, after);
L(before, x = i);
L(after, y = i + 10);
}

Code extends LLVM and is on GitHub. The compiler takes the RMC extensions and puts the appropriate fence instructions into the IR, and then the existing compiler lowers this to assembly. The compiler uses an SMT solver to determine the minimal set of locations that need the necessary fences (or other instructions). Then in lowering, the lowering to assembly can better take advantage of the specific constraints required. Overall, the performance is better than the C++11 model on ARMv7, Power, and comparable on ARMv8. I suspect that x86's TSO model is not as interesting for finding performance benefits.

Usable / practical - Can Seqlocks Get Along With Programming Language Memory Models? argues that C++11 would require acquire semantics on unlock. Here it is stated that RMC is much more straightforward. Further, students in 15-418 found gains from RMC versus the C11 model.

Other future items include the exploration of whether there are additional consistency instructions that might provide a better nuance for the compiler to inform hardware about required orderings. Recall that the coarsest grained instruction is the full memory fence.

Thursday, July 13, 2017

PhD Defense - Automated Data-Driven Hint Generation for Learning Programming

Kelly Rivers defended her PhD work this afternoon. She will returning to CMU this fall as a teaching professor.

Student enrollment is increasing, so more work is needed to automate the support, as TAs / instructors are not scaling. Prior work (The Hint Factory) developed models based on prior student submissions, and then a current student's work can be found within the model thus providing suggestions for how to proceed. However, programming may not fit within this model due to the larger and more varied space for which students can solve the problems.

First, student code proceeds through a series of canonicalization steps - AST, anonymized, simplification. Such that the following python code is transformed:

import string
def any_lowercase(s):
lst = [string.ascii_lowercase]
for elem in s:
if (elem in lst) == True:
return True
return False

Becomes

import string
def any_lowercase(p0):
for v1 in p0:
return (v1 in string.ascii_lowercase)

Studies then went over 41 different problems with hundreds of correct solutions and thousands of incorrect solutions. The model can then generate the edits and chain these hints as necessary. In more than 99.9% of cases, the model could successfully generate a hint chain to reach a correct solution.

To further test this model and approach, the model started with the empty space (just teacher solution) and was compared against the final model. Ideally, the final model will propose fewer edits than the initial model. And for 56% of problems, this was true. 40% of problems were already optimal. And 3% are opportunities for improvement to the model.

Next, given this model exists, how do the hints impact student learning? Select half of the students to give them access to the hint model optionally. Using a pre / post assessment, the measurement was a wash. Instead, a second study was designed that required the students to use the system within a two hour OLI module. Hints would be provided with every submission and either before or after the midtest in the OLI module. Only 1/2 of the students actually proceeded through the module in order. However, most learning was just within the pretest->practice->midtest, so adding those students increased the population. The results show that the hints reduce the time required to learn the equal amount.

From interviews with students, students need and want targeted help on their work. However, the hints generated thus far were not always useful. Proposed another study based on different styles of hints: location, next-step, structure, and solution. This study found that participants with lower expertise wanted more detailed hints. Hint usage would sometimes be for what is wrong versus how to solve it. And often, students know what to do, and just need to reference (via example / prior work) how to do this, rather than hinting what to do.

Saturday, March 11, 2017

Conference Time: SIGCSE 2017 - Day 2

I started my morning by attending my regular POGIL session. I like the technique and using it in the classroom. However, I should probably make the transition, attend the (all / multi-day) workshop, and perhaps get one of those "ask me about POGIL" pins.

Lunch was then kindly provided by the CRA for all teaching-track faculty in attendance. There is the start of an effort to ultimately prepare a memo to departments for how to best support / utilize us (including me). One thing for me is the recognition of how to evaluate the quality of teaching / learning.

Micro-Classes: A Structure for Improving Student Experience in Large Classes - How can we provide the personal interactions that are valuable, which enrollments are large / increasing? We have a resource that is scaling - the students. The class is partitioned into microclasses, where there is clear physical separation in the lecture room. And each microclass has a dedicated TA / tutor. Did this work in an advanced (soph/ junior) class on data structures?

Even though the same instructor taught both the micro and the control class, the students reported higher scores for the instructor for preparedness, concern for students, etc. Yet, there was no statistical difference in learning (as measured by grades).

Impact of Class Size on Student Evaluations for Traditional and Peer Instruction Classrooms - How can we compare the effectiveness of peer instruction being using in courses of varying class sizes? For dozens of courses, the evaluation scores for PI and non-PI classes were compared. There was a statistical difference between the two sets and particularly for evaluating the course and instructor. This difference exists even when splitting by course. This difference does not stem from frequency of course, nor the role of the instructor (teaching, tenure, etc).

Thursday, March 9, 2017

Conference Attendance SIGCSE 2017 - Day 1

Here in Seattle, where I used to live, attending SIGCSE 2017.

Exposed! CS Faculty Caught Lecturing in Public: A Survey of Instructional Practices - Postsecondary Instructional Practices Survey (24 items), 7000 CS faculty invited, about 800 responses. If the evidence is clear that active-learning is better for instruction, then we should be doing that more. The overall split for CS was equal between student-centered and instructor-centered (exactly same avearge, 61.5). The survey showed clear differences between non-STEM (student) and STEM (instructor). So CS is doing better than its overall group.

Now, to dig into which differences there are in the demographics. The major difference in instructors is women, and those with 15 years of experience versus 30, both showing a 5+ point difference between student and instructor centered. However, 60s are still "whatever" and are not strongly committed. For those who are strongly committed, there are about 20% for each, while the remaining 60% are whatevers.

Investigating Student Plagiarism Patterns and Correlations to Grades - What are some of the patterns of the plagiarism, such as parts or all and how do students try to obfuscate their "work". Data from 2400 students taking a sophomore-level data structure course. After discarding those assignments with insufficient solution space, four assignments remained from six semesters. Used a plagiarism detector, to find likely cases of cheating.

First, even though the assignments remained unchanged, the rate of cases stayed constant. Most cases involved work from prior semesters. About two thirds of students who cheated, did so on only one assignment. Second, the rate of cheating on the individual assignments was similar to the partner assignment. Third, while students who cheated did better on those assignments, but they did not receive perfect scores and that those cheating did worse in the course than those who did not. And that those who took the follow-on course showed a larger grade difference (p=0.00019). Fourth, the analysis used the raw gradebook data that is independent of the detection and result of that detection.

Six detectors used. Lazy detector (common-case, no comments or whitespace), Token-based (all names become generic, sort functions by token length): identical token stream, modified token edit distance, and inverted token index (compute 12-grams and inversely weight how common these are). "Weird variable name" (lowercase, removed underscores). Obfuscation detector (all on one line, long variable names, etc). Fraction of total cases found by each detector: 15.69%, 18.49%, 49.71%, 72.77%, 67.35%, 0.38%.

Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work. I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration. Now that it is Monday, it is time to attend some conference talks. In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered. In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions. For example, is a queue required to always return an element, or can it fail if an element was recently added? Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant. On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower. To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology. Each level of the hierarchy acts as a MCS-style queue lock. How then can these threads abort from their queue? In a single level, threads mark their status as aborted and are then skipped when handing off the lock. In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads. The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs. Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size. The concern is that the trace / analysis approaches are very slow and could potentially skew the results. And when the results are derived from only one trace, there is additional question about their applicability.

Monday, January 18, 2016

Conference Attendance HiPEAC - Day 1 - MULTIPROG

It is once again, conference time. For North Americans, this might seem rather early as I am writing from Prague, Czech Republic (or at least when I started 12 hours ago). I am attending HiPEAC, which is the premier European computer architecture conference. HiPEAC is a dual-track conference. Throughout the three days there is the paper-track, where the accepted papers to TACO (such as mine) are presented. And simultaneously there are workshops. For the first day, I am starting with the MULTIPROG workshop, which is on Programmability and Architectures for Heterogeneous Multicores.

Let's start with the keynote, given by David Kaeli of Northeastern University.
- Concurrent execution of compute kernels
- Scheduling of kernels, deadlines
- Sharing / access to host memory (i.e., RAM)

The current model of using a GPGPU is that it runs 1 computation kernel; however, there are many problems that would better decompose into several separate kernels. It would also be valuable if there were further examples of these problems (i.e., benchmarks). Now, whenever you try running multiple anything on a computational resource, there is a runtime scheduling problem. Which should run to best complete the overall problem. A follow-on research question explores this question a cloud-based environment where the GPU may be shared across entirely independent compute kernels. This requires the kernels to be tagged with IDs to ensure that their memory is kept separate. All of this sounds as if we need an OS for the GPU.

Following the late-morning break, we heard next from MECCA (MEeting the Challenges in Computer Architecture) - 3Ps: parallelism, power, and performance. Consider parallel program annotations for describing the concurrency, runtime management of caches using the annotations to indicate the flow of data and transfer the data before it is required and with the appropriate coherence states and indicate when a block is dead and can be evicted from the cache.

Then there was lunch, resting from my flights, then networking, especially the part where I stood by my poster and discussed my research for 3 hours. Now to rest for day 2.

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly. I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters. These counters are not well documented and so the tools are low-level. Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue. Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body. For some applications, several loops account for most of the time. Then the loops can be perturbed in different ways that force certain resources to be exercised further. For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.

Tuesday, November 10, 2015

CSE Distinguished Lecture - Professor David Patterson - Instruction Sets want to be Free: The Case for RISC-V

Similar to every other talk on Computer Architecture, first we need to revisit history. Only by knowing from where we came, do we envision where to go.

History of ISA:
IBM/360 was proposed to unify the diverse lines of mainframes. Slowly the ISAs started adding more instructions to support more things (see below).

Intel 8086 was a crash ISA design program to cover for their original ISA design that was delayed. Then IBM wanted to adopt the Motorola 68000, but the chip was late, so the IBM PC used 8088s.

In the 1980s, did a study that found that if the compiled code only used simple instructions, then the programs ran faster than using all of the instructions. Why not design a simple ISA?

RISC (Reduced Instruction Set Computing) was that ISA. The processor is simpler and faster. Secretly, all processors are now RISC (internally), for example, the Intel and AMD processors translate from their x86 ISAs into their internal RISC ISA.

Maybe several simple instructions together could execute together, so the architecture could be simplified further and the compiler can find these instructions rather than spending time and energy when the program is running. This ISA is VLIW (very long instruction word), where many simple instructions are merged into the long instruction.

Open ISA:
Computer Architecture is reaching certain limits such that processor gains will soon come from custom and dedicated logic. IP issues limit the ability to do research on ISAs. We are forced to write simulators that may or may not mimic actual hardware behavior.

Proprietary ISAs are continuing to grow in size, about 2 instructions per month. This provides the marketing reason to purchase the new cores, rather than just the architectural improvements.

Instead, let's develop a modular ISA using many of the ideas from existing designs. For example, atomic instructions for both fetch-and-op, as well as load link / store conditional. Or, compressed instruction format so certain instructions can use a 16-bit format rather than 32-bits (see ARM).

RISC-V has support for the standard open-source software: compilers (gcc, LLVM), Linux, etc. It also provides synthesizable core designs, simulators, etc.

Tuesday, November 3, 2015

PhD Defense Samantha Lo - Design and Evaluation of Virtual Network Migration Mechanisims on Shared Substrate

PhD Candidate Samantha Lo defended her dissertation work today.

Virtual networks (VNs) may require migration due to maintenance, resource balancing, or hardware failures. Migration occurs when the assignment of virtual to physical network resources changes. VN assignment has two aspects: policy of where to assign, and mechanism of how to do so. Similar questions exist for migration or changing the assignment.

This dissertation work will focus on this last piece of the mechanism of how to change the assignment. When migrating virtual network nodes, the policy aspect has identified that a migration should occur and to where it should now be placed.

Chapter 3 explored scheduling algorithms in a simulated environment, where layers 2 and 3 can be changed. The goal is to determine a migration schedule that will minimize the overhead / disruption of the migration and the time required to perform the migration. For example, Local Minimum Cost First (LMCF) selects one node at a time to migrate. In contrast, Maximal Independent Set tries to identify multiple nodes to move at once to reduce the time to migrate.

Chapter 4 explored actual implementation in PlanetLab where there is access to layer 3. Virtual networks are placed within PlanetLab. When the network migrates, it experienced up to 10% packet loss. However, if the gateways for the VNs can be synchronized to migrate closer in time, then the loss is lessened.

Chapter 5 addressed the performance issues raised in the previous work through transport and application layer changes. When a VN migrates, the new location may have different physical characteristics. Analysis of the TCP traffic showed that on migration, the packet transmission rates dropped dramatically as the window size fell. How can this be avoided:

1) Controller notifies the applications to delay packet transmission to avoid packet loss.
2) Gateway pauses and buffers traffic.

Under the latter scheme, the gateway fools the user into thinking that the TCP connection is still working when it is instead being buffered. Furthermore, the network is also using Split TCP, such that each "->" is a separate connection in user->gateway->gateway->user. The Split TCP hides the RTT from the user, which potentially permits the gateway to send data faster on resume.

After the command to pause data transmission is sent, the system must wait a small amount of time before actually migrating the network. Otherwise, there are packets in flight that will be lost as the network migrates. These packets will force TCP to attempt to retransmit using its exponential backoff. This backoff can then delay the resumption of data transmission after migration imposing additional overhead. By placing a delay between pausing and migrating, the network is quiesced and will resume more quickly.

Friday, October 2, 2015

SCS Lecture Series: Solver-Aided Declarative Programming

Intersection of Programming Languages, Databases, and analytics (Machine Learning, etc)

Businesses are big systems that can then be modeled: financial, data, supply chain, consumer demand, etc. Simple models can be analytical formulas or spreadsheets prepared by the domain expert. Or an enterprise system of database + Java layer + .... Consider the thousands or millions of SKUs plus thousands of stores. The goal is that the domain expert can build and maintain the model.

For example, one client had 234 billion weekly forecasts (500k skus x 9000 stores x 52 weeks). Consider the impact of BOGO deals and placing items on endcaps. How much does this improve sales? Accuracy of predictions had lower error than competing forecasts. Resulted in millions of dollars in savings by optimizing inventory. And benefit to the customer by using cloud computing versus requiring the customer to buy the necessary computing resources. Customers trust the cloud more to be secure as companies such as Target, Home Depot, T-Mobile, ... have suffered data breaches. Who do you think has more resources to secure the data: google or Target?

Recall that Excel is the world's most popular IDE with the most popular programming language. Thus the audience for this work are the business people, not the computer scientists. So have a programming language of LogiQL, inheriting from datalog. Language is not Turing-complete, but this provides many valuable analyses that would otherwise be undecidable. Provide vastly more efficient clique finding in graphs.

Integrity constraints give possible states for the data and the relation. Therefore, the database knows what is possible in its facts. Or with weighted constraints, which facts violate the fewest constraints. Include an operator (ESO - existential second order) such that rather than a fact exists. ESO is quantified by a relation existing. For example, ESO can be invoked to find a valid color relation for graph coloring.

Wednesday, July 15, 2015

PhD Defense Yacin Nadji - Understanding DNS-based Criminal Infrastructure for Informing Takedowns

Yacin Nadji, a PhD candidate in security at Georgia Tech, successfully defended his dissertation work today.

How does one disable a botnet? It is difficult to identify and repair individually infected machines. Therefore, targeting the command and control servers can instead break the linkage between the infected machines and the malicious controller.

Manual identification is time-consuming and can lead to collateral damage. Automation is required to enumerate the machines, evaluate the threat, identify the takedown mechanism, and determine the potential collateral damage by the takedown. Using a dataset of DNS registrations over time, the tools were tested across this sample of the Internet over time (from Damballa).

APT (Advance persistent threats) are particularly troublesome as they are machines that persist and change their presence overtime according to the botnet controller. The C&C machines also attempt to go dark by changing their IP resolution to localhost (127.0.0.1), thereby minimizing their observed signature by only having network traffic when an attack is required. This leads to a suite of detection features that can lead to identifying the actual C&C machines, such as having short-lived IP addresses, changing the domain name to NULL or localhost, and varying the IP address across a diverse set of infrastructure and geographic locations.

Then develop a machine learning algorithm, initially with a ground truth of 50k records of APTs. The features are scored and then run through different models using the 90/10 on the ground truth dataset. The following results are only approximate, as I was trying to copy them during the presentation.

Model	Accuracy	True Positive Rate	False Positive Rate
Naive Bayes	70	91	40
General Linear Regression	98	93	1
Random Forest	99	97	0.04

Then apply to the full dataset of 300 million records. These are clustered to 1.1 million clusters, of which ~700 are above 0.8 confidence of being APTs. At 90% confidence, the clusters all contain less than 1000 domain names.

How then do botnets attempt to evade detection? The infected machines generally use DNS to lookup their C&C machines; however, the lookup can be occasionally spurious or to legitimate IPs. The machines could be peer to peer, but this requires active connections that are often blocked or restricted by networks (against "legitimate" uses such as bittorrent).

The suite of tools also operates on the malware running in VMs, whereby it works through possible takedown mechanisms and then observes the response of the infection to takedown thereby identifying other, possibly unused, communication approaches. For most infections, this takes on the order of hours to enumerate through the approaches; however, some can take days.

Open Problems:

Attributing the botnet to physical entities
Targeting P2P-based botnets

Wednesday, June 17, 2015

Conference Attendance FCRC - Day 5 - Plenary Summary

Plenary Talk today, which pulls together all of the conference attendees. Sunday's talk was based in databases, with Michael Stonebraker speaking on his Turing-award winning work. Monday's talk discussed interdisciplinary work, primarily centered in CS theory, and was given by Andrew Yao (a prior Turing Award winner). On Tuesday, Olivier Temam discussed neural networks in hardware, which focused on his work and efforts to better model or mimic the capabilities of the brain.

The F# Path to Relaxation -
There are opportunities to introduce new work toward relaxing and improving. Or perhaps create opposing camps. Thesis <-> Antithesis ==> synthesis. Or Functional <=> Interop. Back in 2003, functional languages were isolated, non-interoperable, using their own VMs. F# (along with Scala, Swift, ...) instead seeks to have an exosystem, being the external industry-standard runtimes. Another tension is between Enterprise and Openness. So F# is open and cross-platform. Tools are available for Android and iOS, as well as packages for Linux.

Functional <=> Objects
Thus embrace objects, without being object-oriented. Some cases in the cross-product of the expected features for objects and functions requires particular care for synthesis.

Circularities and Modularity in the Wild
Lambdas, generics, etc are clearly being embraced in modern language design. However, circular type dependencies are unfortunately also widely present. Languages need to enforce acyclicity.

Pattern Matching <=> Abstraction
How does the language support the functional concept of pattern matching, when you want to include type abstraction? Alas, the speaker skipped the solution quickly.

Code <=> Data
Most development is to providing tools for the information revolution. There is exponential growth in Open APIs for accessing data from the internet. This data then comes with dynamic types, where the types are only known once the data (or schema) has been accessed. The type creation can also enable blending code for other languages into the F# environment. For example, the support can allow opening csv or json files and having types for the data. This feature is, by far, the most exciting and interesting of the presentation. Not quite worth the price of admission, but clearly a great development.

Applied PL design comes from the synthesis at the heart of these contradictions. This tension also is part of the proliferation of languages.