Thursday, May 19, 2011

Book Review: Write Great Code Volume 1 (part 2 of 4)

+/- 1.F x 2^(Exp)

That's the modern floating point representation (with a few caveats).  It is stored bitwise as SignExpF.  As this representation uses a variable exponent, the point floats to different positions.  By doing so, the computer can store a much greater range of numbers than integer representations, but with a small loss of precision.  In chemistry, I first learned of "significant digits" and floating point limits the user to 5 - 10 significant decimal digits (which requires roughly 15 - 30 binary digits).  For this post and subsequent posts in this series, I'll be writing a combination of what I knew prior to reading, as well as information from the book's text.

There are four official formats for floating point numbers using 16, 32 (a C float), 64 (a C double), and 128 bits respectively.  Primarily, the different representations devote bits to the fraction, which provides increased precision.  Intel also has a 80-bit representation, which has 64 bits for the fraction thereby enabling existing integer arithmetic support when the exponents are the same.  Furthermore, ordering the bits: sign, exponent, fraction also enables integer comparisons between floats.

Given the precision limitations (even with 128-bit floats), there are some gotchas with arithmetic.  First, equality is hard to define.  Equality tests need to respect the level of error present in the floating points.  For example, equality is finding that two numbers are within this error, as follows.

bool Equal(float a, float b)
{
    bool ret = (a == b); // Don't do this

    ret = (abs(a - b) < error);  // Test like this

    return ret;
}

Another gotcha is preserving precision.  With addition and subtraction, the floats need to be converted to have the same exponent, which can result in loss of precision.  Therefore, the text recommends multiplication and division first.  This seems reasonable, and wasn't something I had heard before.

No comments: