Tuesday, January 19, 2016

Conference Attendance HiPEAC - Day 2 - Papers

Another conference day.  Much of my time is spent talking with other attendees and doing "work", such as preparing my presentation, send emails, etc.  However, I do take some time to actually sit in on other presentations, so here are two highlights:

PARSECs - This work explores rewriting some of the PARSEC benchmarks to use a task-based parallelism (OpenMP tasks), rather than pthreads.  For many workloads, these changes provide improved scaling.  For almost all workloads, the code size was reduced as the original thread pools, job queues, etc could be removed.  In the near future, these revised versions should be released.

HRF-Relaxed - The original OpenCL had no memory model; however, many vendors implemented one.  Now, C++ and other languages use SC for DRF (sequential consistency for data-race-free programs).  Unfortunately, if you use this consistency model in OpenCL, you will lose performance.  Instead, this work proposes a hierarchical race-free model, where the races are only checked at a certain scope of the program.

Monday, January 18, 2016

Conference Attendance HiPEAC - Day 1 - MULTIPROG

It is once again, conference time.  For North Americans, this might seem rather early as I am writing from Prague, Czech Republic (or at least when I started 12 hours ago).  I am attending HiPEAC, which is the premier European computer architecture conference.  HiPEAC is a dual-track conference.  Throughout the three days there is the paper-track, where the accepted papers to TACO (such as mine) are presented.  And simultaneously there are workshops.  For the first day, I am starting with the MULTIPROG workshop, which is on Programmability and Architectures for Heterogeneous Multicores.

Let's start with the keynote, given by David Kaeli of Northeastern University.
- Concurrent execution of compute kernels
- Scheduling of kernels, deadlines
- Sharing / access to host memory (i.e., RAM)

The current model of using a GPGPU is that it runs 1 computation kernel; however, there are many problems that would better decompose into several separate kernels.  It would also be valuable if there were further examples of these problems (i.e., benchmarks).  Now, whenever you try running multiple anything on a computational resource, there is a runtime scheduling problem.  Which should run to best complete the overall problem.  A follow-on research question explores this question a cloud-based environment where the GPU may be shared across entirely independent compute kernels.  This requires the kernels to be tagged with IDs to ensure that their memory is kept separate.  All of this sounds as if we need an OS for the GPU.

Following the late-morning break, we heard next from MECCA (MEeting the Challenges in Computer Architecture) - 3Ps: parallelism, power, and performance.  Consider parallel program annotations for describing the concurrency, runtime management of caches using the annotations to indicate the flow of data and transfer the data before it is required and with the appropriate coherence states and indicate when a block is dead and can be evicted from the cache.

Then there was lunch, resting from my flights, then networking, especially the part where I stood by my poster and discussed my research for 3 hours.  Now to rest for day 2.

Wednesday, January 13, 2016

Repost: Avoid Panicking about Performance

In a recent post, another blogger related how a simple attempt to improve performance nearly spiraled out of control.  The lesson is that always measure and understand your performance problem before attempting any solution.  Now, the very scope of your measurements and understanding can vary depending on the complexity of your solution.  And when your "optimizations" have caused the system to go sideways, it is time to take a careful appraisal of whether to revert or continue.  I have done both, and more often have I wished that I reverted rather than continued.  Afterall, it is better for the code to work slowly rather than not work.

Again, always measure before cutting.

Thursday, December 17, 2015

Teaching Inclusively in Computer Science

When I teach, I want everyone to succeed and master the material, and I think that everyone in the course can.  I only have so much time to work with and guide the students through the material, so how should I spend this time?  What can I do to maximize student mastery?  Are there seemingly neutral actions that might impact some students more than others?  For example, before class this fall, I would chat with the students who were there early, sometimes about computer games.  Does those conversations create an impression that "successful programmers play computer games"?  To these questions, I want to revisit a pair of posts from the past year about better including the students.

The first is a Communications of the ACM post from the beginning of this year.  It listed several seemingly neutral decisions that can bias against certain groups.  Maintain a tone of voice that suggests every question is valuable and not "I've already explained that so why don't you get it".  As long as they are doing their part in trying to learn, then the failure is on me the communicator.

The second is a Mark Guzdial post on Active Learning.  The proposition is that using traditional lecture-style advantages the privileged students.  And a key thing to remember is that most of us are the privileged, so even though I and others have "succeeded" in that setting, it may have been despite the system and not because of the teaching.  Regardless of the instructor, the teaching techniques themselves have biases to different groups.  So if we want students to master the material, then perhaps we should teach differently.

Active learning has a growing body of research that shows using these teaching techniques help more students to succeed at mastering a course, especially the less privileged students.  Perhaps slightly less material is "covered", but students will learn and retain far more.  Isn't that better?

Wednesday, December 16, 2015

PhD Defense - Diagnosing performance limitations in HPC applications

Kenneth Czechowski defended his dissertation work this week.

He is trying to develop a science to the normal art of diagnosing low-level performance issues, such as processing a sorted array and i7 loop performance anomaly.  I have much practice with this art, but I would really appreciate having more formalism to these efforts.

One effort is to try identifying the cause of performance issues using the hardware performance counters.  These counters are not well documented and so the tools are low-level.  Instead, develop a meta tool to intelligently iterate over the counters thereby conducting a hierarchical event-based analysis, starts with 6 performance counters and then iterates on more detailed counters that relate to the performance issue.  Trying to diagnose why the core is unable to retire the full bandwidth of 4 micro-ops per cycle.

Even if a tool can provide measurements of specific counters that indicate "bad" behavior, the next problem is that observing certain "bad" behaviors, such as bank conflicts, do not always correlate to performance loss, as the operation must impact the critical path.

The final approach is to take the program and build artificial versions of the hot code, such as removing the memory or compute operations from the loop body.  For some applications, several loops account for most of the time.  Then the loops can be perturbed in different ways that force certain resources to be exercised further.  For example, the registers in each instruction are scrambled so that the dependency graph is changed to either increase or decrease the ILP while the instruction types themselves are unchanged.  

Tuesday, December 1, 2015

Course Design Series (3 of N) - Features versus Languages

One problem I had in preparing the curriculum for this fall is how to balance teaching programming languages versus the features that exist in programming languages.  Some analogs to my course have generally been surveys of programming languages, for example - Michael Hewner's course.  In such a course, study is focused strongly on learning a diverse set of languages, particularly pulling from logic and functional programming paradigms.

In general, the problem is that the each feature needs examples from programming languages to illustrate how it is used; however, many of these languages have never been previously taught to the students.  Therefore, I could spend the first month teaching the basics of Ada, Fortran, Pascal, et cetera.  But in doing this, essentially the class starts as a language class.  I had considered this approach; however, the non-survey courses and the textbooks do not begin with the languages.  These examples all begin with the features.  Furthermore, I have taken the further approach to avoid focusing on specific syntax of languages and instead devote my time to teaching features and design.

Having then accepted that the textbooks had a purpose and were rational in design, I followed their structure.  And in retrospect I can see that having upper-level students provides a base of knowledge about programming languages such that the need to cover specifics is avoided.  I have still taken some class periods to delve into specific languages; however, this is the exception rather than a rule.  I do note that in the future, I would spend some time teaching / requiring specific languages.  Without this requirement, I have been faced with grading a myriad of languages and find myself unable to assign problems / questions based on specific languages.

Tuesday, November 10, 2015

CSE Distinguished Lecture - Professor David Patterson - Instruction Sets want to be Free: The Case for RISC-V

Similar to every other talk on Computer Architecture, first we need to revisit history.  Only by knowing from where we came, do we envision where to go.

History of ISA:
IBM/360 was proposed to unify the diverse lines of mainframes.  Slowly the ISAs started adding more instructions to support more things (see below).

Intel 8086 was a crash ISA design program to cover for their original ISA design that was delayed.  Then IBM wanted to adopt the Motorola 68000, but the chip was late, so the IBM PC used 8088s.

In the 1980s, did a study that found that if the compiled code only used simple instructions, then the programs ran faster than using all of the instructions.  Why not design a simple ISA?

RISC (Reduced Instruction Set Computing) was that ISA.  The processor is simpler and faster.  Secretly, all processors are now RISC (internally), for example, the Intel and AMD processors translate from their x86 ISAs into their internal RISC ISA.

Maybe several simple instructions together could execute together, so the architecture could be simplified further and the compiler can find these instructions rather than spending time and energy when the program is running.  This ISA is VLIW (very long instruction word), where many simple instructions are merged into the long instruction.

Open ISA:
Computer Architecture is reaching certain limits such that processor gains will soon come from custom and dedicated logic.  IP issues limit the ability to do research on ISAs.  We are forced to write simulators that may or may not mimic actual hardware behavior.

Proprietary ISAs are continuing to grow in size, about 2 instructions per month.  This provides the marketing reason to purchase the new cores, rather than just the architectural improvements.

Instead, let's develop a modular ISA using many of the ideas from existing designs.  For example, atomic instructions for both fetch-and-op, as well as load link / store conditional.  Or, compressed instruction format so certain instructions can use a 16-bit format rather than 32-bits (see ARM).

RISC-V has support for the standard open-source software: compilers (gcc, LLVM), Linux, etc.  It also provides synthesizable core designs, simulators, etc.