Thursday, December 30, 2010

A Pointer on Pointers Part 2

In the second part to the series on pointers, we'll be covering what they point to and how it relates to reality.  Let's consider four pointers: void*, char[], struct*, and (void *)f(int) (i.e., a function pointer).  With each pointer, we'll learn something further about C, assembly and how a processor interacts with memory.

To begin, there is the all purpose void* pointer (and yes, void* means pointer to void, but I'll keep writing void* pointer for emphasis).  A C type that means a pointer to nothing, or anything.  This pointer is important for allocating space and casting (changing the type of) the space.  This second piece is something that only has representation in the programming language, in assembly every pointer has no type information.  So by casting, the programmer tells the compiler that this block of memory is now to be treated differently.  Therefore, void* pointers are used when the type is not known, (e.g., pthread_create(..., void* arg) or CreateThread(..., LPVOID lpParameter, ...)) or to discard existing type information.

A char[] is the next C type that will be discussed here.  Every array in C is a pointer.  Learn this fact.  Internalize it.  They are so identical that you can use them interchangeably, like so:

char* foo = (char*) malloc(1024 * sizeof(char));
foo[0] = '\0'; // 1
*(foo + 0) = '\0'; // 2

Line 1 and 2 are equivalent.  So a pointer to many chars is an array of chars.  And we can access any array offset with the pointer.  Or we can use pointer arithmetic to access specific elements.  Now, next time you see char*, you can think of it as an array of characters.  (In a later post, we'll cover char** and other more complex pointers).

Working with the third type, a struct* pointer.  Structs are merely logical arrangements of data that a programmer specifies.  So accessing this data via a pointer will set up assembly to be at some offset from the base pointer.  If you want to ignore this compiler support, such things are your purview.  And in seeing how you can do this, we'll learn what the compiler is doing and what a struct* pointer is.  We want a struct with two shorts, followed by a long.  And we'll assign 'A', 'B', and 1024 to the three fields respectively.

typedef struct _pop2 {
    short a, b;
    long c;
} POP2, *PPOP2;

// Option 1
PPOP2 structP = (PPOP2) malloc(sizeof(POP2));
structP->a = 'A';
structP->b = 'B';
structP->c = 1024;

// Or option 2
char* no_struct = (char*) malloc(8);
*(short*)(no_struct + 0) = 'A';
*(short*)(no_struct + 2) = 'B';
*(long*)(no_struct + 4) = 1024;

You might be scratching your head, but option 1 and 2 do the same thing.  Option 1 is how a programmer should write the code.  Option 2 is roughly what the computer will actually be executing.  So just remember that structs and struct* pointers are just logical arrangements of the data for your conveince.

Lastly, the function pointer cannot be ignored.  A vital tool in modern architectures is the ability to call functions.  Most calls use built in addresses (established at compiler time), but sometimes where the code wants to go isn't known until runtime.  Function pointers are what enables inheritance in OO development, dynamic linking, and often finds its use in switch statements and other constructs.  (Especially fun to use them with runtime code generation, but that's another post).  And at their core, a programmer is merely telling the computer that execution should continue at some address.

To summarize, memory is just sequences of bytes that are given meaning by the types and usage that exist in the source code and consequently the program itself.  To be different types of pointers is to look at memory through different lenses.

Sunday, December 19, 2010

Ray Tracing Performance and Practice Part 2

In this post, we shall explore creating a scene worth rendering.  In my prior work, my ray tracers were creating simple assemblages of spheres and triangles, with some having textures and others mirrored surfaces, etc.  This time, I wanted to have a scene that others might enjoy.  So for my first step, I turned to the Stanford 3D Scanning Repository and selected the angel model, Lucy.

The first problem in rendering is that the model is about 116 million triangles, or over 500MB in size.  Perhaps slightly more detail than is required for all but the largest of images.  Some time was spent searching for a tool that would provide a simplified model, and I settled on MeshLab.  However, the GUI version would crash trying to load the initial model, so I had to create a script of the required transformations and have the command-line application create the reduced model.  This reduced down to a 250k model, and subsequent transformations in the GUI gave a 50k triangle model and oriented appropriately for the ray tracer.

The second problem in rendering is that the model is stored in the ply file format.  I extended my scene parser to handle ply files without really understanding them. While I would like to have a generalized parser, I confined my work to the specific format that MeshLab was generating and did not worry about whether there were triangles or rectangles, normals present, colors specified, etc.  Yet, even this single format took some work.

Initially, the parser took in a text version of the ply file; however, the time to parse 100k lines of text and convert to the appropriate representations was approaching the time to render the scene.  Switching to a binary representation seemed a reasonable step; however, how does one take a stream of bytes and convert to a specific type?  In C, you could do the following:

Vec3f v;  // Struct of three floats
fread(&v, sizeof(float), 3, sceneFile);

But in C#?  I eventually settled on reading each value and converting via the "BitConverter" class.

Support.Assert(4 == pFile.BaseStream.Read(data, 0, 4), 
    "INVALID Ply - Fail to parse vertex");
newV.x = BitConverter.ToSingle(data, 0);

Great!  The statue is now in the tracer.  However, 50k triangles is too much for my naive ray tracer.  So there was performance work to be done.  At the time, I would render a 5k triangle version as a test model, as seen below (pending while in flight).


Thursday, December 16, 2010

Ray Tracing Performance and Practice Part 1

As the MICRO conference came to a close, my thoughts were stimulated toward new projects and undertakings.  One task in particular was developing a ray tracer as an additional workload for my research.  The first cause for specifically a ray tracer is that I developed one in C many years ago that used now deprecated libraries.  Secondly, prior to my last paper deadline, another professor suggested ray tracers as having many useful characteristics.  So I've been porting my old code from C to C#.  This series of posts will discuss some design decisions relating to the porting, additional optimizations that have been introduced, and other interesting headaches / pitfalls.

On the first order, porting the implementations was fairly straight forward.  The original code had some level of object-oriented (OO) design with triangles and spheres both encapsulated within "ray objects" (see Object Oriented C Part1).  But much of the implementation was a mangled mess.  For example, both colors and positions were represented by float triplets.  And so, some glue logic was required to convert from the ARGB color mode into the float space used by the tracer.  But being object-oriented, I could setup type conversions and operator overloads that significantly reduced the inelegance of the original implementation.

Now with a reasonable working port of the original code, the question arose: how to make it perform well?  The original used all sorts of tricks some useful like bounding spheres around the triangles and some not, like preallocating the intermediate space.  As always, there are a variety of levels of abstraction to target for performance work.  My original implementation only focused on micro optimizations like avoiding memory allocations, etc.  (And performance profiles showed that the largest single contributor was sqrt, but I didn't have a "fix" for that.)  However, especially within a higher-level language, many of these choices are denied and so my improvements have taken first to the algorithmic level.

Two (or more) posts will yet follow.  The first on finding and handling a sufficiently interesting scene to render.  And the second will be focused on reducing the computations required to render this scene.

Wednesday, December 8, 2010

Conference Time: MICRO-43 Day 2

I went to several talks yesterday, but my main purpose was achieved by meeting and discussing my research with the other attendees.  Most conversations are positive and constructive; however, one person remained thoroughly convinced that my research is without purpose.  Nonetheless, these opinions provide me with further guidance for how to present my work to others and provide better motivation for my work.

Monday, December 6, 2010

Conference Time: MICRO-43 Morning Day 1

As of noon, I've spent most of my morning as a volunteer at MICRO; however, I did slip in to hear the keynote and AMD's talk about an ISA extension for supporting transactional memory.  Transactional memory (TM) is a technology of which I am deeply skeptical.  But given that AMD's talk was from the ISA perspective rather than the HW itself, I gave it a go.

Their approach was to try and find the minimal set of instructions required to provide some meaningful TM functionality for programs.  SPECULATE, COMMIT, and LOCK MOV are the three instructions.  Furthermore, in providing minimal support, programs are not burdened by unnecessary overhead.  For example, SPECULATE begins a speculative region (i.e., transaction), but only checkpoints the instruction and stack pointers.

Within this region, only the memory operations that are made with "LOCK MOV" are given transactional semantics.  All other operations are non-speculative and therefore have normal semantics.  This enables a speculative region to only incur the overhead from required operations and not from incidental work.  And the HW support for transactions is reduced as fewer operations need to be tracked to provide sufficient support for applications.

COMMIT closes the speculative region and clears the transaction bits from the pending memory operations.  If there is a conflict during the region, then the HW rolls back to the old RIP which has a trampoline to determine the appropriate fail behavior, like retry, etc.

This approach had far greater appeal to me than previous works.  I think the minimalist approach is favorable from a performance standpoint.  This approach also provides greater flexibility to use the TM support for other applications.

For more details see: AMD's ASF or the MICRO 43 proceedings (not yet available)

Wednesday, December 1, 2010

Elegance in Structure

Recently, I came across a request for an elegant implementation of linked lists (again on stack overflow).  And now that I'm finally through conference deadlines, etc, let's consider this request in further detail.

Specifically, the issue is a singly linked, circular list.  I'd love to hear the benefit of making a singly linked list circular, but I'll assume that there is a reason.  Therefore, we'll explore three design questions.
  1. Layout of the structure
  2. Inserting elements
  3. Deleting elements
Structure layout revolves around a key issue that will decide the rest of the design.  Is the head element tracked in a particular way?  To specifically track the head element, the data structure would need two pieces: a container and individual elements.  This is the questioner's design.  Alternatively, one would treat the head implicitly as the first element, and solely use the "node" structure.  The advantage is that there is only one data structure and the necessary code for managing the special "head" is already in place.  However, with an implicit constraint, the data structure can break if it isn't obeyed.  This constraint is the container structure, but without the compiler's support / guarantees.

The second issue is preparing an elegant insertion of elements.  Two design questions are present: whether to order, and if the list is unordered then is it a queue or a stack.  Based on the poster's design, we have an unordered queue.  Unfortunately, this gives an unpleasant O(n) insertion cost, unless we can include a tail pointer.  Since elegance encompasses more than simplicity, elegant tail insertion should have a pointer whenever possible.  The potential inelegance of the implementations (for insertion and removal) is dealing with the three different insertion cases: zero elements, one elements, many elements.  Can we do it in fewer?

typedef List struct list;
typedef Node struct node;
bool Insert(List* L, int value)
{
    Node* N = (Node*) malloc(sizeof(Node));

    if (N == NULL) return false;

    N->value = value;

    if (L->head == NULL)  // case zero elements
    {
        L->head = L->tail = N->next = N;
    }
    else  // cases 1+ elements
    {
        L->tail->next = N;
        L->tail = N;
        N->next = L->head;  // keep circular
    }

    return true;
}

The final elegance piece rests on element removal. Again, we're trying to reduce the inelegance by combining the different cases into fewer tests and code.  The original implementation has three cases: remove only element, remove head, and remove element.  The distinguishing step about removing the head element is that the head pointer needs to be updated.  Additionally, when any element is removed, the previous element needs to be updated.  However, we already have the previous element for the head, the tail!  Thus, element removal has a common case of finding the element, and then a couple of quick checks based on that element.

void Remove(List* L, int value)
{
    Node* T, E, P;

    if (L == NULL) return;

    T = L->head;
    P = E = L->tail;

    if (T == NULL) return;

    do {
        if (T->value == value) break;
        P = T;
        T = T->next;
    } while (T != E);

    // Ideally, the break would skip into this block
    if (T->value == value)
    {
        P->next = T->next;   // Remove elem
        if (T == E) L->tail = P;  // Removed tail
        if (P == E) L->head = P->next;  // Removed head
        if (T == P) {L->head = L->tail = NULL;} // Removed only elem
        free(T);  // Cleanup
    }
}

Initially, I wanted to treat the list implicitly, but I find myself liking the inclusion of a tail pointer. Yet on small lists, the trade-offs may be different.  I wouldn't give this as the be all and end all of circular lists, but I would present this as my generic implementation.