Tuesday, February 14, 2012

Bit Packing: A World Gone Mad

Let's start with a mythical program. It has some classes (structs, objects, etc) that are instantiated. Each field has a meaning given to it by the code. Many fields can hold far more data than is required of them. In our mythical program, these unused bits are of no concern, as memory is plentiful. But then the mythical program is run on a real machine and bad things happen. Maybe it runs slowly, or crashes when malloc() returns NULL. Now what?!

In this post, I'm going to discuss some tricks and ideas for reducing the memory usage of a program. Before trying these steps, you should know the program and its workload using profilers, etc. The program should be verified for no memory leaks.

First example. A program was running out of memory. This program went through a memory trace and counted where the memory request went. A count existed for every N locations of memory and each one was a unsigned 64-bit integer (uint_64). With 1 billion memory operations, the worst case is a single count of 1 billion (roughly 2^30). With a fairly uniform distribution, a billion counters would hold 10, perhaps 100. Switching to a 32-bit integer reduced the program's footprint by 4GB, allowing work to continue. Other proposals included, using a two-level counter scheme where the overflow of the small counter (16-bit) results in using a full-sized (64-bit) counter, or only tracking subsets of the counters via random sampling.

The lesson learned is that having some spare bits is acceptable, but counting to, say, "100" when the location can go to 1.84467441 × 10^19 might be excessive.

Second example. I've probably written about this before, and I'll likely do so again. Locks are often wastes of space. Consider the following:
atomic_add(int* loc, int val)
{
    lock()
    *loc += val
    unlock()
}

Looks fine, right? Consider that the lock is 4 bytes of space, of which 1 bit indicates whether it is acquired. Two compactions are possible. First, the lock could be integrated into the value itself. This would reduce the range for the value, as one bit is now dedicated to holding the lock state. Or second, many architectures (like x86) support the above operation as a single instruction. So using the instruction, removes the need for a lock bit / word entirely.

Why don't we always replace the locks with atomic instructions? Two reasons: first, if the locked section is ever more complex, then the atomic sequence becomes vastly more complex, as only the simplest of operations (add, subtract, or, and, etc) are supported with atomicity. Second, the determination of the appropriate transformations in the compiler is exceedingly complex and may not be solvable in the general case. Since the compiler wouldn't be able to do this generally, it looks instead to the programmer to introduce this change.

Third example. Storing lots of pointers to long lists of numbers (these could be strings or perhaps lists of edges in a graph). On a 64-bit system, each pointer is a 64-bit number, or is it? Current x86-64 architectures are limited to 48 bits of virtual addresses, to reduce the overhead of tracking the virtual memory space. So every pointer gives us 48 bits of information and 16 bits of 0s. (Many pointers are also aligned, which zeros some of the low order bits). And whenever the programmer knows the values at compile time, it is a constant!

A common operation is to ask for the length of a list. Ideally, this value is stored. But where? If the leading pointer has 16 available bits, then the length could be made part of the pointer. But what if the length could be greater than 65536 (2^16)? A hybrid floating point design could be used.

[I/F:1][LENGTH:15][ADDR:48]

The first bit indicates whether an integer or floating point is stored. The next 15 bits store the length. If it is a float, then there is an exponent and fractional component. 6 bits will indicate exponents for a 64-bit number (note that these floats should be greater than 2^15, due to the precise integer representation available). Or 4 bits would give 2^16 - 2^31. The remainder of the 15 bits are for the fraction.

Most programs will not need bit packing. Every time it is packed, there is the cost of extra instructions. But you will sometimes just need to save the space, even if it means making the design more complex.