Elegant C

Saturday, March 26, 2011

Functional vs Imperative Programming

I was recently provided a link to another "programming" blog. The blog, Existential Type, is written by a professor from my undergrad institution, Carnegie Mellon. In it, he appears to be chronicling his efforts of teaching functional programming to freshmen computer science students. I have been deeply engrossed in reading the posts thus far and they have revived knowledge of past debates in language choice, etc. You may look forward to several future posts delving deeper into this issue. But today is Saturday and I have more research to do.

Monday, March 7, 2011

A Rare Error

In the course of my research, I encountered the following error and was unable to find any reference to solutions. However, this error is likely confined to programmers writing their own types using reflection or perhaps a buggy compiler.

PEVerify - "call to .ctor only allowed to initialize this pointer ..."

This error signifies that a constructor (.ctor) should call the constructor for itself or for the type it is extending; however, it is directly calling a different routine. The following code is a rough sketch of the error condition.

class baseT  // implicitly derives from Object

{

    baseT() { }

}

class derT : baseT

{

    derT()

{

        Object();  // calling a different constructor than base class

}

}

For my work, changing the call from Object() to baseT() fixed the error.

Friday, March 4, 2011

Parallel Programming - First Pass

With my time dominated by preparing to take PhD qualifying exams (i.e., quals), I have been even more slack than usual with regards to preparing regular posts. Nonetheless, let's talk a little on parallel programming. In one aspect, the parallel paradigm is the future of computer science, even if I remain highly skeptical about what the specifics of this computing will be. But just because its usage in general computing may be occluded, the specific usefulness of parallel computing is not in doubt. This post will serve as an overview of several concepts in parallel programming.

First to distinguish between concurrent and parallel execution. Concurrent execution has the possibility or potential for executing simultaneously. Parallel execution is when this potential is realized. Concurrent execution is possible with a single core; however, parallel execution is not.

Synchronization is the main question when writing concurrent code. Synchronization introduces a specific ordering to what was otherwise independent execution. There are two common flavors: exclusion and notification. Exclusion consists of mutexes, spinlocks, and other constructs that guarantee a single instance of concurrent execution performing a specific set of operations. With notification, concurrent executions establish information with respect to each other, for example every instance has reached a specific point (e.g., barrier).

An ongoing quest with synchronization research is transactional memory (TM). TM provides the ability to make a set of memory updates atomicly. Processors provide the ability to make simple updates atomic (see Compiler Intrinsics), yet a series of updates requires the explicit exclusion guarantee provided by spinlocks, etc. TM brings the exclusion to the memory address itself, rather than the abstract object protected by the spinlock, and allows an arbitrary set of accesses to be encapsulated in the atomic operation. However, TM is not presently feasible.

Parallel patterns are formed based on the observation that parallel programs and algorithms can be classified into several distinct groups (i.e., patterns). An assembly line is a parallel operation and fits the "pipelined" pattern. By the programmer recognizing the pattern, certain common errors can be avoided. With the pipeline, the programmer recognizes that the data is to be passed through discreet stages.

Well, that's my prelude to what will likely be many more posts on parallel programming.

Thursday, February 24, 2011

VC++ Concurrency Runtime

I was not aware of any particular concurrency support in the VC++ environment, so I was delighted when a friend of mine posted about the VC++ Concurrency Runtime and lambda expressions. Therefore, while the article was about using lambda expressions, I learned about the concurrency support and especially learning of the parallel pattern library. Lambda expressions intrigue me, not for actually using them but rather I persist in imagining how cool they are. I shall save lambdas for another post.

Wednesday, February 9, 2011

Compiler Intrinsics - The Secret Sauce

Compiler intrinsics are one of the things that C programmers are often unaware. Sometimes they know that such a functionality should exist, but is suspected to be cloaked in assembly. Other times, the intrinsic does something otherwise unimagined by the developer. For this article, I will confine my notation and functionality to that provided by Visual Studio (see MSDN); however, I have every expectation that other platforms / compilers will operate in a similar manner.

The intrinsic provides two things to a programmer. First, compiler supported functionality might otherwise be unavailable. Second and related, the compiler has understanding of what each intrinsic does. The intrinsics are of three flavors: obscured functionality, replicated functionality, and compiler insights. And in discussing these flavors, I think the provisions will be better understood.

Architectures provide many instructions that are not immediately accessible to the average programmer. A simple example is counting the bits in a variable. Such functionality is quite simple for hardware and already provided by the processor, yet I am aware of no language that has notation for such an operation. Instead, a programmer can invoke __popcnt(var) and be done. Invoking this intrinsic provides two benefits: first, performance and second correctness. On Stack Overflow, the equivalent C implementation is suggested as:

int NumberOfSetBits(int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

Replicated functionality steps up one level in complexity. These operations may not have simple assembly equivalents, yet there still may be advantages from using the intrinsics. An example is memcpy (in a prior job, I worked extensively to improve the implementation of memcpy so this is particularly dear). There is a version of memcpy in the C runtime library, which is a good general purpose implementation. Yet, if the program is not copying variable length data, perhaps inlining the call can provide benefit. Beyond just avoiding the function call, having a fixed copy size also enables the compiler to generate instructions without the loops, etc that are in the memcpy implementation. By utilizing the memcpy intrinsic, the compiler can do all of this, whereas without the intrinsic the linker is limited to linking in the function from the separate library.

Many times I've heard someone cite the halting problem and correctly assert that a compiler cannot fully intuit the operation of a program. Beyond this, even limited analysis of source code is a difficult problem. To this end, there exist a limited set of intrinsics that provide the compiler with further information regarding program state. In particular, there is __assume. This intrinsic posits that some expression is true, which can result in further optimization of the subsequent code. It is recommended that this intrinsic is paired with asserts, as follows.

#ifdef DEBUG

#define INVARIANT(e) assert(e)

#else

#define INVARIANT(e) __assume(e)

#endif

Compiler intrinsics provide the programmer the opportunity to improve application performance by both simplifying algorithms with better hardware support and by guiding the compiler with a greater understanding of the code's operation.

EDIT: The intrinsics are also termed "builtins" for GCC, see here.

Wednesday, February 2, 2011

Ray Tracing Performance and Practice Part 3

When last we left the ray tracer, it could read in PLY files and then render a scene. But while it rendered, you'd probably want to get dinner, read some blogs, and write some code. Very slow. Improving performance is very straight forward: either reduce the work done or find a way to do the same work faster.

At the time of original development, I knew that there existed further performance opportunities via Oct Trees. The idea behind these and other similar structures is to partition the scene into sub-scenes that can be treated in the aggregate. Let's work through a simple example.

While the actual ray tracer is in 3D, we'll reason through a 2D tracer. We'll have lots of circles and triangles in standard XY coordinates. We can start partitioning the scene into quadrants: +X +Y, +X -Y, etc. Now, when we draw a line (i.e., the ray) through the scene, rather than checking whether the ray is intersecting every circle and triangle, we can first check whether it is passing through any quadrant. If the ray doesn't pass through the quadrant, then the ray will also not pass through any object in the quadrant. The Oct Trees work very similarly, except besides X and Y, the scene is also split by +/- Z.

If the scene is split just once, this can likely provide some benefit, but for real practicality each octant of the tree must also be split repeatedly until only a few scene objects (i.e., spheres and triangles) remain in each octant. This is all well and good; however, four particular problems arise: first, where should each octant be split. Second, how many scene objects should be in each octant. Third, how should objects that straddle octant boundaries be handled? Finally, how does one practically render with an Oct Tree?

Where? Two possibilities were tried: splitting in the "center" of the objects and splitting in the center of the octant. If the split point is based on the objects, then each octant should have a similar number of objects. This bounds the depth of the Oct Tree by making each octant have a subset of the parent octant. However, this also forces every ray to traverse several levels of the Oct Tree, as every octant should contain scene objects. Instead, if each octant is split at its center, then many octants may be empty, but the ones with scene objects may have significant number of levels in the tree before reaching individual objects.

How many? Ideally, this number would be reached based on considerable experimentation and analysis. At present, the code just splits an octant when it contains more than 16 scene objects. Given the 50k triangle scene from the previous post, this would create roughly 3000 final octants, 375 parent octants, 47 grandparents, 6 great grandparents, and 1 root. Ideally.

Yet, the split points are never ideal and so the octants are imbalanced. Especially as some scene objects will cross the octant boundaries. For example, a sphere placed at the origin is present in all 8 octants. So the Oct Tree places the scene object into every octant that it is present within. Rather than computing whether the object is precisely within the octant, the Oct Tree creates a bounding cube around the object. Back to the 2D space, draw a triangle from (1,1), (-1, 1), and (-1, -1). A square around this triangle would also include (1, -1), and therefore the triangle would also be "part" of the +X -Y quadrant. I would be happy to switch to computing whether the object intersects the octant boundary planes; however, that would be when I understand the math involved.

The final issue is how to put the Oct Tree into practice. In the immediate, the intersect code follows:

    if (No Octants) 

        foreach SceneObject in OctTree

{

            Compute Distance for Ray to Intersect Object

}

        Return closest object or none

    foreach Octant in OctTree {

        Compute Distance for Ray to intersect Octant

}

    Sort Distances

    Select closest octant 

    do {

        Recurse on octant

        if (Object returned) return object

        Select next closest octant

    } while (Octants remain)

    return none

A potential optimization is to handle shadows, reflected rays, etc from the lowest octant rather than starting at the root. However, implementing such an optimization requires significant state tracking, which also impedes making the ray tracer multi-threaded.

In a future part 4, I'll explore the hurdles of making the tracer multi-threaded rather than its current implementation of one thread.

Wednesday, January 19, 2011

Review: Measurement Bias (ASPLOS'09)

Given that I read 150 - 200 research papers a year, it seems only reasonable that I point out papers that I find particularly interesting. One of the first papers that I read in grad school (and still one of the most interesting) is the subject of this post, Producing wrong data without doing anything obviously wrong!

We know of many sources of variance in our measurements, like whether other applications or processing is occurring. Or second order items like, where is the data laid out on disk (inner versus outer tracks) or what are the specific pages of memory allocated (as it can influence caching)? But these variations are (usually) different from run to run, so by taking many measurements we can see an accurate performance where the events above occur with some frequency.

The paper tests the following comparison: what benefit does -O3 provide over -O2 in gcc? Beyond the variations above, what items may affect performance of which we aren't aware, particularly those that don't vary over runs. The danger is that these artifacts can result in this "wrong data" without our knowing it. Two artifacts are analyzed in the paper: linking order and environment size. Taking these in order.

The authors found that changing the order that the libraries are linked in the applications showed a performance variation of 15%. On further analysis, they found that certain performance critical sections of code would have different alignments depending on the linking order. Sometimes the code would be in one cache line, other times two. This bias persists in both gcc and ICC (Intel's compiler).

Environment size also has unpredictable effects on application performance. On the UNIX systems tested, environment variables are loaded onto the stack before the call into the application's main() function. As the variables increase in size, the alignment of the stack changes and this causes performance effects throughout the code. Some applications have minimal effects, others will vary by +/- 10%.

While these are but two cautionary tales of how small changes to program state can have significant performance impacts, the main take-away is that these effects will always be present. But they call us to action in preparing more diverse workloads that can address these biases, like conducting multiple runs address interrupts, context switches, etc.