Showing posts with label C. Show all posts
Showing posts with label C. Show all posts

Friday, September 14, 2018

Is C low level?

A recent ACM article, C is Not a Low-Level Language, argues that for all of our impressions that C is close to hardware, it is not actually "low-level".  The argument is as follows, C was written for the PDP-11 architecture and at the time, it was a low-level language.  As architectures have evolved, C's machine model has diverged from hardware, which has forced processor design to add new features to attain good performance with C, such as superscalar for ILP and extensive branch prediction. 
Processors must also maintain caches to hide the memory latency, which require significant logic to maintain coherence and the illusion that the memory is shared between the threads of a process.  Furthermore, the compiler is also called upon to find optimization opportunities that may be unsound and definitely require programmer years to implement.

The author repeatedly contrasts with GPUs, while noting that they require very specific problems, or "at the expense of requiring explicitly parallel programs".  If you were not keeping track, a GPU requires thousands of threads to match the CPU's performance. The author calls for, "A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model."  Which generally sounds like the GPU design.

I appreciate the callouts to C's shortcomings, which it certainly has.  The notion that C has driven processor design is odd, yet it does reflect the fact that processors are designed to run current programs fast.  And with the programs being written in either C or a language built on C, that forces many programs into particular patterns.  I even spent some time in my PhD studies considering a version of this problem: how do you design a new "widget" for the architecture if no programs are designed for widgets to be available?

All to say, I think C is a low-level language, and while its distance from hardware may be growing, there is nothing else beneath it.  This is a gap that needs to be addressed, and by a language that has explicit parallel support.

Monday, February 6, 2017

Conference Attendance CGO (etc) 2017 - Day 1

I started the conference attendance this time on Saturday, with the LLVM-Performance workshop at which I presented an extension to my dissertation work.  I received some interesting and useful feedback from the other attendees, as well as saw additional possibilities of its usage and collaboration.  Now that it is Monday, it is time to attend some conference talks.  In the evening today, I will be being an advisor and watching one of my students present our work, which we practiced today so it should go great!

Checking Concurrent Data Structures Under the C/C++11 Memory Model
C/C++11 included additional keywords that allow specifying features of the memory model, previously covered.  In order to check data structure implementations, the data structures need to be further annotated so as to further describe valid and invalid executions.  For example, is a queue required to always return an element, or can it fail if an element was recently added?  Using these annotations, the authors were able to find issues and other identifications for the data structures.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems
The impact of NUMA can be significant.  On the largest shared-memory machines, the difference between accessing lock data that is local to an SMT thread versus the farther distance is over 2000x slower.  To avoid this overhead, there is a hierarchy of locks created that mirrors the system's topology.  Each level of the hierarchy acts as a MCS-style queue lock.  How then can these threads abort from their queue?  In a single level, threads mark their status as aborted and are then skipped when handing off the lock.  In the hierarchy, the requirement of waiting on the specific level is passed along to the appropriate thread, as can be determined by using the lower level links.

The implementation was model checked and then run on a HP Superdome with 576 hardware threads.  The results showed that the lock implementation performs best when it respects all 4 levels of the NUMA (and NUCA) hierarchy.

Thread Data Sharing in Cache: Theory and Measurement
In this work, they collected memory access traces using Pin to determine the thread data sharing in parallel programs.  Particularly they worked to show how the different metrics of data sharing can be derived from a single pass of the trace, and is linear in trace size.  The concern is that the trace / analysis approaches are very slow and could potentially skew the results.  And when the results are derived from only one trace, there is additional question about their applicability.


Tuesday, September 6, 2016

When is it signaled?

Signals are a mechanism for notifying a process of a simple event.  Most programmers and programs can ignore them, as the default behaviors are reasonable and taking steps to handle them would greatly increase the program complexity.  But when teaching future computer scientists, we want the programmers to know about these mechanisms and therefore properly understand the functioning of the system.

In working with signals, the developing programmers are often exposed to their first dose of concurrency.  The idea that execution can be happening in simultaneous, arbitrary orders except when action is taken by the program.  With signals, a program can do several things:

  • Provide and install a handler for one or more signals
  • Block the receipt of one or more signals
  • Send a signal to one or more processes

We are going to consider the problem of sending and receiving a signal.  After a signal is sent, when does the other process receive it?  Students make the assumption that when the sending function (i.e., kill) has returned, the signal has been sent *and received*.  However, I have found no text that explicitly guarantees this condition.  Instead, I prepared a simple program (source at the end of this post) to test this condition.

As signals are communicated through the operating system, we want a different mechanism for measuring simultaneity, in this case shared memory.  The experiment program will create and set up a small space of shared memory between two processes.  Next, it will wait until both programs are running in a known state (i.e., barrier).  Then one (the parent) will signal the other (the child), while measuring how long it takes to send, as well as receive the signal.  Finally, run this experiment a million times.

On my current Ubuntu box with Skylake processors, the experimental measurements show that 80% of the time, the child lasts for an average of 40 cycles after kill returns.  The maximum time is almost 1200 cycles.  Assuming that each core's clock has a smaller skew, this effectively means that the child can continue to run even after the function has returned.

Source code follows: (compiled with gcc version 4.8.3, with gcc -O3 -lrt)
#include <sys/mman.h>
#include <sys/types.h>

#include <sys/wait.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <stdint.h>

#include "rdtsc.h"

static const int ITER = 1000 * 1000;

int main(int argc, char** argv)
{
    int shm_fd = shm_open("squig", O_CREAT | O_RDWR, (S_IREAD | S_IWRITE));
    int count = 0, i;
    uint64_t zero = 0, min = ~0x0, max = 0, sum = 0;
    uint64_t minN = ~0x0, maxN = 0, sumN = 0;
    write(shm_fd, &zero, 8);  // Give the shared file "space"
    void* msh = mmap(NULL, 4 * 1024, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, 0);
    volatile uint64_t* tsc = msh;
    
    for (i = 0; i < ITER; i++)
    {
        *tsc = 0;
        pid_t chd = fork();
        
        if (chd == 0)
        {
            // Give the compiler something to think about
            while(zero < ~0x0) { *tsc = rdtsc(); zero++;}
        }
        else
        {
            // Wait for the child
            while (*tsc == 0) {}
            
            uint64_t st = *tsc; // When is it for the child?
            kill(chd, SIGINT); // Send SIGINT to child
            uint64_t k = rdtsc(); // Returned from kill
            wait(NULL); // Reap
            uint64_t delta = 0;
            uint64_t en = *tsc; // When did the child end?
            
            // K >, implies that kill returned after the child terminated
            if (k > en)
            {
                count ++;
                delta = k - en;
                if (delta < minN) minN = delta;
                if (delta > maxN) maxN = delta;
                sumN += delta;
            }
            else
            {
                delta = en - k;
                if (delta < min) min = delta;
                if (delta > max) max = delta;
                sum += delta;
            }
        }
    }
    
    printf("Min: %lx, Max: %lx, Avg: %lx\n", min, max, (sum / (ITER - count)));
    printf("Min: %lx, Max: %lx, Avg: %lx\n", minN, maxN, (sumN / (count)));
    printf("Percent Parent After: %lf\n", (count / (double)ITER));
    
    return 0;
}

Update: Results also hold when sending SIGKILL.


Monday, February 23, 2015

Compilers and Optimizations, should you care?

Compiler optimizations matter.  One time in helping a fellow Ph.D. student improve a simulator's performance, I did two things: first, I replaced an expensive data structure with a more efficient one.  Second, I turned on compiler optimizations.  Together, the simulator ran 100x faster.

A question posted in the stackexchange system asked, "Why are there so few C compilers?"  The main answer pointed out that any C compiler needs to be optimizing.  Lots of optimizations are occurring on every compilation, and each one gaining tiniest increments in performance.  While I enjoy discussing them in detail, I generally wave my hands and tell of how they are good, yet make debugging difficult.  These optimizations are lumped together as the "optimization level".

In "What Every Programmer Should Know About Compiler Optimizations", we revisit optimizations.  First, the compiler is no panacea and cannot correct for inefficient algorithms or poor data structure choices (although I am party to research on the later).  The article then suggests four points to assist the compiler in its efforts at optimizing the code. 

"Write understandable, maintainable code."  Please do this!  Usually, the expensive resource is the programmer.  So the first optimization step is to improve the programmer's efficiency with the source code.  Remember Review: Performance Anti-patterns and do not start optimizing the code until you know what is slow.

"Use compiler directives."  Scary.  Excepting the inline directive, I have used these less than a half dozen times in almost as many years of performance work.  Furthermore, the example of changing the calling convention is less relevant in 64-bit space where most conventions have been made irrelevant.

"Use compiler-intrinsic functions."  (see Compiler Intrinsics the Secret Sauce)  This can often dovetail with the first point by removing ugly bit twiddling code and putting in clean function calls.

"Use profile-guided optimization (PGO)."  This optimization is based on the dynamic behavior of the program.  Meaning that if you take a profile of the program doing X, and later the program does Y; executing Y can be slower.  The key is picking good, representative examples of the program's execution.

So you have dialed up the optimization level, and written understandable code sprinkled with intrinsics.  Now what?  The next step (with which I agree) is to use link time optimizations (LTO) / Link-Time Code Generation (LTCG).  This flag delays many optimizations until the entire program is available to be linked.  One of the principles of software performance is that the more of the program available to be optimized, the better it can be optimized.  (This principle also applies in computer architecture).  Thus, by delaying many optimization until the entire program is available, the linker can find additional and better opportunities than were present in individual components.

The article notes, "The only reason not to use LTCG is when you want to distribute the resulting object and library files."  And alas, I have fought several battles to overcome this point, as my work requires the use of LTO.  Perhaps in the next decade, LTO will be standard.

Tuesday, September 16, 2014

Atomic Weapons in Programming

In parallel programming, most of the time the use of locks is good enough for the application.  And when it is not, then you may need to resort to atomic weapons.  While I can and have happily written my own lock implementations, its like the story of a lawyer redoing his kitchen himself.  It is not a good use of the lawyer's time unless he's enjoying it.

That said, I have had to use atomic weapons against a compiler.  The compiler happily reordered several memory operations in an unsafe way.  Using fence instructions, I was able to prevent this reordering, while not seeing fences in the resulting assembly.  I still wonder if there was some information I was not providing.

Regardless, the weapons are useful!  And I can thank the following presentation for illuminating me to the particular weapon that was needed, Atomic Weapons.  I have reviewed earlier work by Herb Sutter and he continues to garner my respect (not that he is aware), but nonetheless I suggest any low-level programmer be aware of the tools that are available, as well as the gremlins that lurk in these depths and might necessitate appropriate weaponry.

Monday, March 3, 2014

Ongoing Goto Wars

Recently, Apple announced a security fix for their SSL implementation.  I would have taken little note here, except WIRED's summary indicated the cause being the use of goto.  Ah, goto, how useful you are.  How abused you are.

In practice, I would also have used goto, if I was writing a series of conditionals where I only have a single error block.  The code could be restructured to link all of the conditionals into a single clause, yet it also tries to retain the specific error code for the check that failed.  I would really dislike to see the following:
if ((e = foo) != 0 && (e = bar) != 0 && ...) goto fail;
So what then?  Do we call upon the software engineering folk to better analyze programs and find these mistakes?  Or write style guidelines that require braces for every if, and ban every goto?  And then the employee's manager is required to sign off on every exception.  These are some of the many proposals floated for addressing this, and often made by programmers touting the superiority of their proposal and how this would never happen to them.

Leave me to shrug.  I'll use my goto, sparingly.  And I'll continue to try to impress on students that they and all of us programmers will make mistakes so write elegant code to help find these bugs.  Until then, if anyone has a good construction for the error code that doesn't require goto, let me know.

Tuesday, April 16, 2013

When whitespace matters

In the process of grading student programming projects, I discovered that whitespace can matter in how C code is compiled.  Some programmers may be aware that strings can be concatenated together:

const char* t = "this string" " is one";

Because the whitespace is ignored.  Furthermore, programmers may encode strings via macros, and replace the instance with the macro:

#define A_STRING " is one"

Now, in the latest version of g++, new C++ support can treat the item following the string literal as either a macro or a user defined literal.  The compiler makes the determination based on whether there is whitespace.

const char* t = "this string"A_STRING; // compiler error
const char* t = "this string" A_STRING; // expected behavior

I like consistent use of whitespace in code for stylistic reasons, but introducing a singular dependency on whitespace is odd for C/C++, in contrast to python that is highly whitespace dependent.