What’s in a Number?

Credit: Jorge Franganillo
Credit: Jorge Franganillo

If we program in a low-level language like C, we have access to all sorts of numeric types of various sizes from unsigned char to long double. In JavaScript we have just the one: the IEEE 754 double precision floating point number (double for short). If we are to be limited to just the one number type, this is a pretty good one: it can represent whole numbers, fractional numbers, tiny numbers, enormous numbers and infinity.

A double is a 64-bit value arranged thus:

  • 1 sign bit
  • 11 bits of exponent
  • 53 bits of significand

If you’re sharp, you’ll notice that this sums up to 65 bits. This is possible because the numbers are normalized. As to what “normalized” means, consider the difference between 1.25×103 and 125×101. They both represent the same number but the first is normalized while the second is a bit weird. In base 10, the first digit of a normalized number is between 1 and 9 inclusive. In binary it’s always1. Since it’s always 1, we can leave it out without losing information. Hence, we get 53 bits of precision despite only having 52 bits to play with. An 11-bit exponent gives a range of 0-2047. Since we need to be able to represent small numbers, the actual exponent is obtained by subtracting 1023 from this, giving a range of -1023 to 1024. As we’ll see, the maximum and minimum exponent values are reserved so the usable range is -1022 to 1023.

In C it’s easy to get at the internals of a double:

#include <stdint.h>
...
double pi = 3.141592653589793;
uint64_t *p = (uint64_t *) &pi;
/* Twiddle bits through *p */
...

In JavaScript, we can’t access the internals directly, but a little bit of trickery opens them right up. Let’s see whether this is possible. Tell me, computer, will you allow me to peek inside the internals of a double?

Assuming that the required runtime support is available, the following function will split a double into its constituent parts as human-readable strings:

window.showNum = function(n) {
    var buf = new ArrayBuffer(8),
        dv = new DataView(buf);
    dv.setFloat64(0, n, true);
    var m = dv.getInt32(0, true),
        i = 0, j = 1, p = "", ret = {};
    for (; i < 64; ++i, j <<= 1) {
        if (i === 32) {
            // Done with the first half
            j = 1;
            m = dv.getInt32(4, true);
        }
        else if (i === 52) {
            // Built up the significand
            ret.m = p;
            p = "";
        }
        else if (i === 63) {
            // Built up the exponent
            ret.e = p;
            p = "";
        }
        if (m & j) {
            p = "1" + p;
        }
        else {
            p = "0" + p;
        }
    }
    // Set the sign bit
    ret.s = p;

    // Calculate the represented value as a
    // base-2 exponential value
    var sig = 1, f = 0.5, e = -1023;
    for (i = 0; i < 52; ++i, f /= 2) {
        if ('1' === ret.m.charAt(i)) {
            sig += f;
        }
    }
    j = 1;
    for (i = 10; i >= 0; --i, j <<= 1) {
        if ('1' === ret.e.charAt(i)) {
            e += j;
        }
    }
    ret.t = sig + "e" + e;
    if ('1' === ret.s) {
        ret.t = '-' + ret.t;
    }

    return ret;
};

Here’s how it works. An ArrayBuffer is a low-level generic data container that behaves roughly like a byte array. We get at the internals of the buffer by overlaying either a typed array (such as Int32Array) or a DataView. I chose a DataView because then I don’t have to worry about endianness. The true arguments to setFloat64 and getInt32 means “little-endian, please”. We stick the double into the buffer and then access it as two 32-bit integers. We may not be able to get at the bits in a double from JavaScript but we can do bitwise operations on an int. Had we used false as the endianness flag, we’d have to read out the ints in reverse order. Without a DataView, I would have to do some additional work to determine the endianness of your system. When we’re done calculating the bit strings, we build up a base-2 exponential representation that’s a bit more human-readable. The bits of the significand are all fractional with the first bit representing ½, the second bit ¼ and so on. The exponent is calculated working right-to-left along the bit string, remembering to take the bias of 1023 into account. The return property for the significand is “m” which stands for mantissa. Technically, this is wrong because the strict definition of “mantissa” is the fractional part of a logarithm. However, “s” is already used for the sign so “m” it is.

Let’s try it out. Tell me, computer, what does 1 look like internally?

That’s not a very interesting number apart from the exponent pattern which shows the 1023 bias. Tell me, computer, what does -√2 look like internally?

Limits

Let’s have a look at the limits of what is representable. Tell me, computer, what does the largest absolute value that you can handle look like internally?

In base 10, this is 1.7976931348623157×10308. What about the smallest?

That last one may be surprising if you recall that I said the minimum usable exponent was -1022. This value is denormalized and is, in fact, the smallest denormalized value. This may (or may not) be familiar to C++ programmers as std::numeric_limits::denorm_min. An exponent of -1023 signfies a subnormal number and the signficand must be interpreted as 0.f rather than 1.f. Therefore, the significand of the smallest number represents 2-51. When you add the exponents together, you get a value of 2-1074 which is the absolute limit of representability that is not zero. Using bits from the significand as additional exponent bits allows for gradual underflow at the cost of precision – at the limit, there is only a single bit. In base 10, it translates (very approximately) to 5×10-324.

The more familiar value to C/C++ programmers is DBL_MIN. Tell me, computer, what does DBL_MIN look like internally?

In base 10, this is 2.2250738585072014×10-308.

Infinity and NaN

A value with all 1s in its exponent signifies an abnormal condition. The first condition is overflow. Tell me, computer, what does the product of 10155 and 10154 look like internally?

This value is commonly represented as Infinity. The sign bit can be set and this represents minus infinity. The characteristic of the two infinities is that all exponent bits are 1 and all significand bits are 0.

The second condition is where the numeric representation of a value is undefined. Tell me, computer, what does 0/0 look like internally?

This value is commonly represented as NaN (aka, not a number). The characteristic of NaN is that all exponent bits are set along with 1 or more significand bits.

Zero

If the significand in a double is to be interpreted as 1.f, there is no possible representation of zero since there is no y such that 1.f×2y = 0. The solution is to reserve a pattern to represent 0. From the representation of the smallest denormalized number above, you’d be correct in guessing that this pattern is all exponent bits set to 0 along with all significand bits. However, the sign bit can be set or unset which means that there are two zero values: 0 and -0. -0 will be the result, for example, of truncating a value between 0 and -1.

Epsilon

Although we can represent minuscule numbers by reducing the exponent, there is a hard limit to the fractional values that can be represented by adjusting the significand. Since a double has 52 fraction bits, this value may trivially (and precisely) be calculated as Math.pow(2, -52). This value is known as machine epsilon and is denoted by 𝜀. It is the smallest value for which the following inequality is true:
1 + \epsilon > 1
Assuming that your browser supports it, the machine epsilon value is in Number.EPSILON

Largest integer values

The width of the significand also determines the limits on the maximum size of a whole number that can be represented without loss of precision. The largest values have all significand bits set to 1 and the exponent set to 10000110011b (that is, 52). In other words, ±253 – 1. Note the qualifier “without loss of precision”. Larger integer values can be represented but not uniquely. To illustrate:

253 - 2 → 9007199254740990
253 - 1 → 9007199254740991
253     → 9007199254740992
253 + 1 → 9007199254740992

Assuming browser support, the largest integers that can be represented in a double are given by the MAX_SAFE_INTEGER and MIN_SAFE_INTEGER properties of Number.

Playing

Assuming you saw the answer “Of course” to the first question I asked the computer, you can type numbers into the following input and the component parts will be displayed in the box below. Large and small numbers can be entered using ‘e’ notation, for example, 1.23e-50. You can also type properties of Math and Number, such as PI, E, SQRT2, MAX_SAFE_INTEGER, EPSILON, etc.


Leave a Reply

Your email address will not be published. Required fields are marked *