Thursday, February 24, 2011

VC++ Concurrency Runtime

I was not aware of any particular concurrency support in the VC++ environment, so I was delighted when a friend of mine posted about the VC++ Concurrency Runtime and lambda expressions.  Therefore, while the article was about using lambda expressions, I learned about the concurrency support and especially learning of the parallel pattern library.  Lambda expressions intrigue me, not for actually using them but rather I persist in imagining how cool they are.  I shall save lambdas for another post.

Wednesday, February 9, 2011

Compiler Intrinsics - The Secret Sauce

Compiler intrinsics are one of the things that C programmers are often unaware.  Sometimes they know that such a functionality should exist, but is suspected to be cloaked in assembly.  Other times, the intrinsic does something otherwise unimagined by the developer.  For this article, I will confine my notation and functionality to that provided by Visual Studio (see MSDN); however, I have every expectation that other platforms / compilers will operate in a similar manner.

The intrinsic provides two things to a programmer.  First, compiler supported functionality might otherwise be unavailable.  Second and related, the compiler has understanding of what each intrinsic does.  The intrinsics are of three flavors: obscured functionality, replicated functionality, and compiler insights.  And in discussing these flavors, I think the provisions will be better understood.

Architectures provide many instructions that are not immediately accessible to the average programmer.  A simple example is counting the bits in a variable.  Such functionality is quite simple for hardware and already provided by the processor, yet I am aware of no language that has notation for such an operation.  Instead, a programmer can invoke __popcnt(var) and be done.  Invoking this intrinsic provides two benefits: first, performance and second correctness.  On Stack Overflow, the equivalent C implementation is suggested as:


int NumberOfSetBits(int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return ((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

Replicated functionality steps up one level in complexity.  These operations may not have simple assembly equivalents, yet there still may be advantages from using the intrinsics.  An example is memcpy (in a prior job, I worked extensively to improve the implementation of memcpy so this is particularly dear).  There is a version of memcpy in the C runtime library, which is a good general purpose implementation.  Yet, if the program is not copying variable length data, perhaps inlining the call can provide benefit.  Beyond just avoiding the function call, having a fixed copy size also enables the compiler to generate instructions without the loops, etc that are in the memcpy implementation.  By utilizing the memcpy intrinsic, the compiler can do all of this, whereas without the intrinsic the linker is limited to linking in the function from the separate library.

Many times I've heard someone cite the halting problem and correctly assert that a compiler cannot fully intuit the operation of a program.  Beyond this, even limited analysis of source code is a difficult problem.  To this end, there exist a limited set of intrinsics that provide the compiler with further information regarding program state.  In particular, there is __assume.  This intrinsic posits that some expression is true, which can result in further optimization of the subsequent code.  It is recommended that this intrinsic is paired with asserts, as follows.

#ifdef DEBUG
#define INVARIANT(e) assert(e)
#else
#define INVARIANT(e) __assume(e)
#endif

Compiler intrinsics provide the programmer the opportunity to improve application performance by both simplifying algorithms with better hardware support and by guiding the compiler with a greater understanding of the code's operation.

EDIT: The intrinsics are also termed "builtins" for GCC, see here.

Wednesday, February 2, 2011

Ray Tracing Performance and Practice Part 3

When last we left the ray tracer, it could read in PLY files and then render a scene.  But while it rendered, you'd probably want to get dinner, read some blogs, and write some code.  Very slow.  Improving performance is very straight forward: either reduce the work done or find a way to do the same work faster.

At the time of original development, I knew that there existed further performance opportunities via Oct Trees.  The idea behind these and other similar structures is to partition the scene into sub-scenes that can be treated in the aggregate.  Let's work through a simple example.

While the actual ray tracer is in 3D, we'll reason through a 2D tracer.  We'll have lots of circles and triangles in standard XY coordinates.  We can start partitioning the scene into quadrants: +X +Y, +X -Y, etc.  Now, when we draw a line (i.e., the ray) through the scene, rather than checking whether the ray is intersecting every circle and triangle, we can first check whether it is passing through any quadrant.  If the ray doesn't pass through the quadrant, then the ray will also not pass through any object in the quadrant.  The Oct Trees work very similarly, except besides X and Y, the scene is also split by +/- Z.

If the scene is split just once, this can likely provide some benefit, but for real practicality each octant of the tree must also be split repeatedly until only a few scene objects (i.e., spheres and triangles) remain in each octant.  This is all well and good; however, four particular problems arise: first, where should each octant be split.  Second, how many scene objects should be in each octant.  Third, how should objects that straddle octant boundaries be handled?  Finally, how does one practically render with an Oct Tree?

Where?  Two possibilities were tried: splitting in the "center" of the objects and splitting in the center of the octant.  If the split point is based on the objects, then each octant should have a similar number of objects.  This bounds the depth of the Oct Tree by making each octant have a subset of the parent octant.  However, this also forces every ray to traverse several levels of the Oct Tree, as every octant should contain scene objects.  Instead, if each octant is split at its center, then many octants may be empty, but the ones with scene objects may have significant number of levels in the tree before reaching individual objects.

How many?  Ideally, this number would be reached based on considerable experimentation and analysis.  At present, the code just splits an octant when it contains more than 16 scene objects.  Given the 50k triangle scene from the previous post, this would create roughly 3000 final octants, 375 parent octants, 47 grandparents, 6 great grandparents, and 1 root.  Ideally.

Yet, the split points are never ideal and so the octants are imbalanced.  Especially as some scene objects will cross the octant boundaries.  For example, a sphere placed at the origin is present in all 8 octants.  So the Oct Tree places the scene object into every octant that it is present within.  Rather than computing whether the object is precisely within the octant, the Oct Tree creates a bounding cube around the object.  Back to the 2D space, draw a triangle from (1,1), (-1, 1), and (-1, -1).  A square around this triangle would also include (1, -1), and therefore the triangle would also be "part" of the +X -Y quadrant.  I would be happy to switch to computing whether the object intersects the octant boundary planes; however, that would be when I understand the math involved.

The final issue is how to put the Oct Tree into practice.  In the immediate, the intersect code follows:
    if (No Octants) 
        foreach SceneObject in OctTree
        {
            Compute Distance for Ray to Intersect Object
        }
        Return closest object or none
    foreach Octant in OctTree {
        Compute Distance for Ray to intersect Octant
    }
    Sort Distances
    Select closest octant
    do {
        Recurse on octant
        if (Object returned) return object
        Select next closest octant
    } while (Octants remain)
    return none

A potential optimization is to handle shadows, reflected rays, etc from the lowest octant rather than starting at the root.  However, implementing such an optimization requires significant state tracking, which also impedes making the ray tracer multi-threaded.

In a future part 4, I'll explore the hurdles of making the tracer multi-threaded rather than its current implementation of one thread.