Category Archives: Programming

Language Showdown

'Pollice Verso' by Jean-Léon Gérôme
‘Pollice Verso’ by Jean-Léon Gérôme

Programming languages are the medium through which we turn our intentions into actions. Ideally, the choice of language to use should be a neutral decision. In particular, the language itself shouldn’t get in the way: easy things should be easy, obvious things should be obvious and it shouldn’t be too idiosyncratic. I am going to write a simple program in five different programming languages and comment on how easy and pleasant or otherwise it is to work with them.

The five languages chosen are, in alphabetical order: C++, Java, Node.js, Perl and Python. The program is a simplified version of the head utility. head prints the first few lines of its input or one or more files passed on the command line. Our utility will have some restrictions compared to the real version: it will only process a single file and will not have the ability to print all but the first few lines. The problem is what I’d call bread-and-butter programming: reading files, processing them line-by-line and producing output. I don’t think that this is a problem designed to favour one particular language over another – this is the sort of activity that any language should be suited for. At the end, I will also compare performance of the different solutions, just for fun.

I will point out that if I were to write a real implementation of head, I wouldn’t read input one line at a time. I’d read input in conveniently sized chunks, search for newline characters and do whatever needed to be done. However, in the more general case, we want to treat a file as a sequence of lines rather than a bucket of bytes so this is what I’ll do in the different programs.

The Basics

The default behaviour of head is to print the first ten lines of the file passed on the command line so this is the behaviour that we’ll implement as a first cut. File errors will be reported on stderr and will cause an abnormal program exit. I have plans for when no arguments are passed on the command line so this condition will result in a normal program exit.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using namespace std;

const int lines = 10;

int main(int argc, char **argv) {
    if (argc < 2) {
        return 0;
    }
    ios_base::sync_with_stdio(false);
    ifstream file(argv[1]);
    if (!file.is_open()) {
        cerr << "Unable to open " << argv[1] <<
            ": " << strerror(errno) << "\n";
        return 1;
    }
    string line;
    for (int i = 0; i < lines && getline(file, line); ++i) {
        cout << line;
        if (!file.eof()) {
            cout << "\n";
        }
    }
    return 0;
}

C++ is, by a small margin, the second most verbose of the solutions (585 bytes vs. 580 bytes for the Java version). The first few lines are fixed overhead, however, so I expect this to change. Note the following:

  • Line 5: I wouldn’t do this in a complex project but in a simple program, not having to qualify everything as std::cout, std::ifstream and so on is a hassle-saver
  • Line 13: Unless you explicitly disable the feature, C++ allows you to freely mix calls to the C++ native I/O objects, cin and cout, and legacy C stdio functions, scanf and printf. To make this possible, a lot of work is done under the hood to keep different stream pointers synchronized, which kills performance unless you turn this questionable feature off. I’ll show the performance benefit of this call at the end, but for the moment, -1 for requiring obscure knowledge and for having what is arguably a non-sensible default
  • Lines 14-19: On the other hand, C++ does make easy things easy – opening a file and checking for an error condition is straightforward. We are also able to access the operating system error and we have full control of how the error is reported
  • Line 21: We use the getline function from the string module. ifstream objects have a getline method as well but this does not handle arbitrarily long lines
  • Lines 23-25: The action of getline is to read up to a newline, and then advance the read pointer past it. This means that we have to take care not to report a newline character that does not exist if we read to the end of the file. The real head utility doesn’t and I would consider doing so a bug
  • Line 24: "\n" rather than endl since endl guarantees a flush of the underlying output handle which might hurt performance. Otherwise, it doesn’t have, say, a portability benefit over a newline character which will be translated to the line terminator pattern of the system automatically.
  • Line 27: Although it’s not obvious, C++ file objects close the underlying handle in their destructors when they go out of scope. This means that we don’t have to explicitly call the close method. This pattern is known as RAII (Resource Acquisition Is Initialization) and is a nice feature of well-written C++ classes.

Java

import java.io.*;

class Head {
    private static final int lines = 10;

    public static void main(String[] args) {
        if (0 == args.length) {
            System.exit(0);
        }
        try {
            BufferedReader reader =
                new BufferedReader(new FileReader(args[0]));
            String line = null;
            for (int i = 0; (line = reader.readLine()) != null
                && i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Although the Java solution has fewer lines than the C++ one, this is down to lower fixed overhead (one import statement as opposed to four include statements) since otherwise it is more verbose and long winded. Note the following:

  • The compiled binary cannot be run directly from the command line. Instead, I have to invoke it like this:
    $ java Head /path/to/file
  • Line 4: <qualifier> static final; that’s a very long-winded way to say const
  • Line 7: Java is the only language of my acquaintance that doesn’t treat 0 or null as false in an if statement. Hence, the explicit test
  • Line 12: Java is the only language that does unbuffered file I/O by default which means that we need to wrap an object that reads files inside an object that does buffering. Wrapping a thing inside another thing to do useful work is an unpleasant aspect of programming in Java
  • Line 16: Never mind the long-winded way of saying “print”, we have a bug that we cannot easily fix in the Java version. readLine behaves like the C++ getline in that it discards the newline and moves the file pointer. However, there is no way, as far as I can tell, to detect EOF other than readLine returning null by which time we’ve already emitted an erroneous newline. We could get around this by using read and finding the newline characters ourselves, but then Java would fail the “easy things should be easy” requirement utterly. -1 for no way to detect EOF condition other than attempting to read from the handle again
  • Line 20: Error comes out as something like “java.io.FileNotFoundException: doesnotex.ist (No such file or directory)” which is not very nice. Calling getMessage does not improve things greatly. -1 for limited control over error reporting
  • Line 23: Although it’s not obvious, objects that hold filehandles do not close them automatically, at least not in a deterministic manner. When the object goes out of scope, it will be marked for garbage collection. When the object is garbage collected, the file will be closed but that might not happen for a long time. As it happens, exiting the program closes all filehandles anyway.

Exiting the program means that filehandles are closed, but if we had to close files explicitly, Java would force a horrible pattern on us:

BufferedReader reader = null;
try {
    reader = new BufferedReader(...);
    // Do stuff and then close when we're done
    reader.close();
    reader = null;
}
catch(e) {
    // Handle error
}
finally {
    // Clean up if necessary
    if (reader != null) {
        try {
            reader.close();
        }
        catch(e) {
            // close threw an error; what exactly am
            // I supposed to do with it?
        }
    }
}

When I regularly programmed in Java, close throwing a checked exception used to drive me nuts!

Node.js

#!/usr/bin/env node

if (process.argv.length < 3) {
    process.exit();
}

var fs = require('fs'),
    readline = require('readline'),
    strm = fs.createReadStream(process.argv[2]),
    lines = 10,
    eof = false;

strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Node.js isn’t a programming language in itself, but rather a JavaScript engine taken out of the browser context and run as a script interpreter. This was by some measure the most difficult program to write which is reflected in both the greatest number of lines and the largest file size. There were also a couple of surprises and when it comes to software, surprising is never good. The twisted program structure is down to the way that Node.js works. It seems that file operations are done in a separate execution thread so that the main program thread is not blocked. Note the following:

  • Line 3: First surprise. Conventionally, arguments to your script or program start at offset 0 (for example, Java and Perl) or offset 1 (for example, C/C++ and Python). For a script invoked by Node.js, “node” is the first argument followed by the script name so that the arguments for your script start at offset 2
  • Line 13: This is rather a convoluted way of getting at a file error. -1 for being neither easy nor obvious
  • Line 14: The error comes out as something like “ENOENT: no such file or directory, open ‘doesnotex.ist'” which is detailed but not very attractive. -1 for no control over the output
  • Line 26: Again, strange but this seems to be the Node.js way. line does not include the newline
  • Lines 27-29: This is how a non-existent newline is suppressed. The “end” event on the stream object fires before the last “line” event on the reader so that the eof flag gets set before we receive the last line
  • Line 30: Long-winded way of saying “print”
  • Line 32: As far as I can tell, this statement doesn’t actually close anything, since the “line” events keep on coming in. What it does do is cause the “close” event to be fired. This is extremely surprising behaviour and not terribly useful so -1
  • Line 37: Need to exit the program to stop reading the file.

Perl

#!/usr/bin/perl

use strict;
use warnings qw/all/;

my $lines = 10;
exit 0 unless (@ARGV);
open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

I must confess to having a dog in this particular fight since I enjoy programming in Perl. This sample should show why since it is the shortest and simplest solution by a long way, even taking into account a couple of lines of fixed overhead. Note the following:

  • Line 7: This is idiomatic Perl. If you don’t like the “unless” idiom, you can rewrite this as if (!@ARGV)
  • Line 8: open ... or die is also idiomatic. We can access the system error via the special variable $! and have full control over the error output
  • Line 9: Perl has a dedicated operator, <>, to read lines of text from a filehandle. The line returned includes any newline character. Note that there’s a subtle bug that could bite us if the input ended in “0” without a newline. This would evaluate to false which means that we need the defined function to avoid this being a problem
  • Line 10: Now that’s how to name a print function! The Perl print function doesn’t do anything funny to its input like adding a newline
  • Line 11: It’s not obvious, but filehandles in Perl are closed when the last reference to them goes out of scope. This means that we do not need to explicitly call close.

Python

#!/usr/bin/python
import sys

if len(sys.argv) < 2:
    sys.exit(0)
lines = 10
try:
    with open(sys.argv[1]) as f:
        for line in f:
            sys.stdout.write(line)
            lines -= 1
            if not lines:
                break
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

I’m not such a fan of programming in Python, although it’s perfectly pleasant to do so. The syntax is very different from the other four since Java, JavaScript and Perl are all C-like languages while Python is somewhat idiosyncratic. Note the following:

  • Line 7: Python forces a try...catch structure like Java since I/O operations throw rather than return error values. If you don’t catch exceptions, the error output is very ugly
  • Line 8: Python objects are garbage collected which means that their destruction is non-deterministic. However, Python also offers an RAII pattern: the with keyword opens a “controlled execution” block which scopes the file object f. This ensures that the file is closed without having to explicitly close it
  • Line 9: File objects are iterable which makes one-line-at-a-time processing extremely easy. The lines include a trailing newline character when present
  • Line 10: Python does have a print function but that adds a newline character to the output which is behaviour that we don’t want
  • Line 15: We have full control of error output and can access the actual system error conveniently.

Part 2: Command-line Options

Well behaved programs can have their behaviour changed by reading options from the command line. Options can be short form (-?) or long form (--help). When there are more than a couple of options, parsing them becomes laborious and tedious so we’ll want to use a library routine to do so when available. The options we’ll accept are:

  • --help / -?: Print a usage message and exit
  • --count / -n <number>: Print the first <number> lines of the file instead of 10

In addition, the real head utility can take, for example, -2 as an option in place of -n 2 or --count 2 so we’ll do the same.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <regex>
#include <getopt.h>
using namespace std;

const int deflines = 10;

void usage(const char *name, const char *msg) {
    if (msg) {
        cerr << msg << "\n";
    }
    cerr << "Usage:\n  " << name << " [--count|-n <lines>] [FILE]\n";
}

int main(int argc, char **argv) {
    int lines = deflines, opt;
    const char *err = NULL;
    bool needHelp = false;
    string countOpt;
    regex numeric("\\d+");
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) {
        countOpt = &argv[1][1];
        argv[1] = argv[0];
        ++argv;
        --argc;
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                if (regex_match(optarg, numeric)) {
                    countOpt = optarg;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                }
                break;
            case '?':
            default:
                needHelp = true;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (argc == optind) {
        return 0;
    }
    if (!countOpt.empty()) {
        lines = stoi(countOpt);
    }

    ios_base::sync_with_stdio(false);
    ifstream file(argv[optind]);
    if (!file.is_open()) {
        cerr << "Unable to open " << argv[optind] <<
            ": " << strerror(errno) << "\n";
        return 1;
    }
    string line;
    for (int i = 0; i < lines && getline(file, line); ++i) {
        cout << line;
        if (!file.eof()) {
            cout << "\n";
        }
    }
    return 0;
}

Ouch! Our program has just grown 3x bigger! GNU getopt is the standard for parsing command-line options and it’s part of the standard C/C++ library, libc. Windows developers will have a harder time since Windows has a different option syntax (for example, /n rather than -n) and there is no standard library routine that I know of for parsing options. We could also have used argp which provides additional bells and whistles (such as generating a usage message for you), but the overhead is higher and the learning curve somewhat steeper. Having paid the high cost of entry for option parsing, the cost of adding additional options is low – typically one variable, one entry in the options array and one case statement. Note the following:

  • Lines 11-16: Using getopt requires us to write a usage function. If our program grew to take many options, the cost of using a more complex argument parser like popt or boost::program_options that takes care of generating a help screen may be worthwhile
  • Line 23: C++ 11 gives us native regular expressions. Hurrah! This regex is used to test for numeric options
  • Lines 24-28: Program options; note the “NULL terminator” at the end
  • Line 29: Check for an option that looks like “-number“. Note that getopt will choke on this so we need to remove it from the arguments array
  • Lines 30-33: Pointers for the win! While this is the sort of stuff that programmers unfamiliar with C/C++ find maddening, it is very much idiomatic and natural once you’re familiar with the basics. The result is that we splice the non-standard argument from the front of argv
  • Line 35: getopt_long returns -1 when it has processed all the options that it knows how to. The index of the first unprocessed option is available via optind
  • Lines 56-58: I still have plans for when no file argument is supplied, so exit normally if this is the case
  • Line 60: Override lines. stoi throws if fed bad input but we’ve already checked that input is good, so we don’t need to catch the exception
  • Line 64: Our file argument is at optind, not 1.

Java

import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

class Head {
    private static final int deflines = 10;

    private static void usage(String msg) {
        if (!(msg == null || msg.isEmpty())) {
            System.err.println(msg);
        }
        System.err.println("Usage:\n  java Head [--count|-n <lines>] [FILE]");
        System.exit(1);
    }

    public static void main(String[] args) {
        int optIdx = 0, lines = deflines;
        boolean needHelp = false;
        String err = null;
        Pattern customOption = Pattern.compile("\\-(\\d+)"),
            numericOption = Pattern.compile("^\\d+");

        if (args.length > 0 && args[0].startsWith("-")) {
            Matcher m = customOption.matcher(args[0]);
            if (m.find()) {
                lines = Integer.parseInt(m.group(1));
                ++optIdx;
            }
        }
        for (; optIdx < args.length; ++optIdx) {
            if (args[optIdx].equals("--help") || args[optIdx].equals("-?")) {
                needHelp = true;
            }
            else if (args[optIdx].equals("--count") ||
                args[optIdx].equals("-n")) {
                Matcher m = numericOption.matcher(args[optIdx + 1]);
                if (m.find()) {
                    lines = Integer.parseInt(args[optIdx + 1]);
                    ++optIdx;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                    break;
                }
            }
            else {
                break;
            }
        }
        if (needHelp) {
            usage(err);
        }

        if (optIdx == args.length) {
            System.exit(0);
        }
        try {
            BufferedReader reader =
                new BufferedReader(new FileReader(args[optIdx]));
            String line = null;
            for (int i = 0; (line = reader.readLine()) != null
                && i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Java doesn’t come out of this round terribly well. Java lacks a native command-line option parse routine. Given that it has a library to parse X.509 certificates and given that I have to work with certificates far less often than I have to handle command-line options, one wonders why. There are several dozen option parsing libraries on Github that claim to be GNU getopt compatible. A popular choice seems to be args4j but then Java throws another obstacle in our way. We have an option that a standard options parser will choke on. In all the other languages under consideration we can make this a non-issue by modifying the arguments array. In Java, arrays are immutable because, clearly, reasons. We could get round this by copying the array members that we want to keep to a container object then turning that back into an array or we could say that this is too much like pointless busywork for a recreational programming project, parse the options the hard way and mark Java down accordingly. Therefore, -2 for failing the “obvious should be obvious” and “the language itself shouldn’t get in the way” criteria. Even without the fixed overhead of using an options parser, the Java solution has overtaken the C++ one in terms of typing required if not in terms of line count. Note also the following:

  • Lines 20-21: Regular expressions are somewhat painful to use in Java but by now I’m not surprised
  • Lines 30-35: Custom option parsing code. Unlike the C++ solution, this will not scale at all well
  • Line 31: Who in their right minds wouldn’t want ‘==’ to do the obvious thing here? The Java language designers could have made testing a String object against a string literal do the right thing yet they very deliberately chose not to. Instead, the ugly x.equals(y) pattern is required. Interestingly, an ‘==’ test compiles; it just doesn’t work. -1 again for failing “the language itself shouldn’t get in the way” criterion.

Node.js

#!/usr/bin/env node

var lines = 10, i, optIdx = 2, needHelp = false, err = null,
    args = process.argv;
if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) {
    lines = parseInt(args[optIdx].substr(1));
    args.splice(optIdx, 1);
}

for (; optIdx < args.length; ++optIdx) {
    if (args[optIdx] == "--help" || args[optIdx] == "-?") {
        needHelp = true;
    }
    else if (args[optIdx] == "--count" || args[optIdx] == "-n") {
        if (/^\d+$/.test(args[optIdx + 1])) {
            lines = parseInt(args[optIdx + 1]);
            ++optIdx;
        }
        else {
            err = "Bad count argument";
            needHelp = true;
            break;
        }
    }
    else {
        break;
    }
}

if (needHelp) {
    if (err) {
        console.error(err);
    }
    console.error("Usage:\n  " + args[1] + " [--count|-n <lines>] [FILE]");
    process.exit(1);
}

if (optIdx === args.length) {
    process.exit();
}

var fs = require('fs'),
    readline = require('readline'),
    strm = fs.createReadStream(args[optIdx]),
    eof = false;

strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Node.js also lacks a standard option parser, so -1. I tried out yargs which parses options OK but doesn’t appear to allow you to specify “-n” as a shortened alias for “–count”. Handling short options after yargs did its thing took exactly as many lines as handling the options myself. There isn’t a lot of point in pulling in a third party library if it’s not buying you anything. The JavaScript code for handling options is a direct port of the Java code. It is a lot more readable by virtue of, for example, regular expressions being first class data types and by the JavaScript syntax itself being less obtrusive.

Perl

#!/usr/bin/perl

=head1 SYNOPSIS

headpl [--count|-n <lines>] [FILE]

=cut

use strict;
use warnings qw/all/;
use Getopt::Long;
use Pod::Usage;

my ($needHelp, $err, $lines) = (0, "", 10);

if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) {
    $lines = $1;
    shift(@ARGV);
}
GetOptions(
    "help|?" => \$needHelp,
    "count|n=s" => \$lines,
);
unless ($lines =~ /^\d+$/) {
    $err = "Bad count argument";
    $needHelp = 1;
}

pod2usage(
    exitval => 1,
    message => $err
) if ($needHelp);

exit 0 unless (@ARGV);
open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

The Perl solution remains admirably compact despite being formatted for readability. As you can see, Perl’s reputation as a “write-only” language is undeserved. Readability or otherwise of Perl code is entirely down to the author. Note the following:

  • Lines 3-7: This is POD (plain old documentation). The documentation serves double duty as the usage message which is very Perlish
  • Lines 20-23: The getopt implementation in Perl is very elegant and has a low cost of entry. The function perturbs the ARGV array so that what is left over represents non-option arguments. The arguments GetOptions are pattern-reference pairs. We could replace the “big-arrow” (=>) operators with “,”, although the former is more idiomatic
  • Line 22: We could have specified this option as “count|n=i” and then GetOptions would discard any non-numeric value with a warning, leaving $lines unmodified. However, the other solutions error on a non-numeric argument and since we need to reject a negative value anyway, I have chosen to check the value myself
  • Line 29: pod2usage makes usage messages easy and your messages are as good as your documentation.

Python

#!/usr/bin/python
import sys
import re
import argparse

def usage():
    return '''Usage:
  headpy[--count|-n <lines>] [FILE]
'''

lines = 10
if len(sys.argv) > 1:
    customOption = re.compile(r"\-(\d+)")
    m = customOption.match(sys.argv[1])
    if m:
        lines = int(m.group(1))
        del sys.argv[1]

parser = argparse.ArgumentParser(
    usage = usage(), add_help = False
)
parser.add_argument("-?", "--help", required = False, action="store_true")
parser.add_argument("-n", "--count", required = False)
parser.add_argument("argv", nargs = argparse.REMAINDER)
args = parser.parse_args()
err = None
needHelp = args.help
if args.count:
    numericOption = re.compile(r"^\d+$")
    m = numericOption.match(args.count)
    if m:
        lines = int(args.count)
    else:
        err = "Bad count argument"
        needHelp = True
if needHelp:
    parser.error(err)    

if not args.argv:
    sys.exit(0)
try:
    with open(args.argv[0]) as f:
        for line in f:
            sys.stdout.write(line)
            lines -= 1
            if not lines:
                break
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

Python has a number of facilities for parsing options, including a getopt implementation. The module recommended in the pydoc is argparse. I wouldn’t say that using argparse is easy but it’s usable. Note the following:

  • Line 16: Unlike Perl and JavaScript, Python requires an explicit cast to an integer. This halfway-house to strong typing makes Python a less friendly scripting language
  • Lines 18-20: This is the code that constructs the parser. argparse has the ability to generate a help screen in response to -h/--help but I didn’t like the appearance of the help. These overrides, along with the usage function make the usage output look similar to the output of the other four solutions
  • Line 22: If we added “type = int” to the argument list, argparse would coerce the count value to an integer. However, as with the Perl version, we also want to reject negative values so I’m choosing to parse the value myself
  • Line 23: This is a remarkably non-obvious way to get the option parser to not barf over non-option arguments, so -1. As coded, non-option arguments will appear in an array property of the parsed arguments called argv

Part 3: Reading from stdin

A well behaved program that takes its input from a file should also be able to read from stdin. This allows it to form part of a pipeline where the output of another program forms our program’s input. For example, we might want to take the output of the sort utility to show the top ten results. Conventionally, no file argument or an argument specified as “-” indicates that we should read from stdin.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <regex>
#include <getopt.h>
using namespace std;

const int deflines = 10;

void usage(const char *name, const char *msg) {
    if (msg) {
        cerr << msg << "\n";
    }
    cerr << "Usage:\n  " << name << " [--count|-n <lines>] [FILE]\n";
}

void printstream(istream &in, int lines) {
    string line;
    for (int i = 0; i < lines && getline(in, line); ++i) {
        cout << line;
        if (!in.eof()) {
            cout << "\n";
        }
    }
}

int main(int argc, char **argv) {
    int lines = deflines, opt;
    const char *err = NULL;
    bool needHelp = false;
    string countOpt;
    regex numeric("\\d+");
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) {
        countOpt = &argv[1][1];
        argv[1] = argv[0];
        ++argv;
        --argc;
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                if (regex_match(optarg, numeric)) {
                    countOpt = optarg;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                }
                break;
            case '?':
            default:
                needHelp = true;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (!countOpt.empty()) {
        lines = stoi(countOpt);
    }

    ios_base::sync_with_stdio(false);

    string fName;
    if (argc > optind) {
        fName = argv[optind];
    }
    if (fName.empty() || "-" == fName) {
        printstream(cin, lines);
    }
    else {
        ifstream file(fName);
        if (!file.is_open()) {
            cerr << "Unable to open " << fName <<
                ": " << strerror(errno) << "\n";
            return 1;
        }
        printstream(file, lines);
    }
    return 0;
}

Nothing difficult here. There’s a slight hindrance in that stream objects cannot be assigned. For example, this wouldn’t work:

istream p;
...
p = cin;

We could use pointers:
istream *p = NULL;
...
p = &cin;

However, I chose to refactor, moving the code that does the actual work into a function that takes an istream reference. It is then a simple matter of calling it with the correct input stream object.

Java

import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

class Head {
    private static final int deflines = 10;

    private static void usage(String msg) {
        if (!(msg == null || msg.isEmpty())) {
            System.err.println(msg);
        }
        System.err.println("Usage:\n  java Head [--count|-n <lines>] [FILE]");
        System.exit(1);
    }

    public static void main(String[] args) {
        int optIdx = 0, lines = deflines;
        boolean needHelp = false;
        String err = null;
        Pattern customOption = Pattern.compile("\\-(\\d+)"),
            numericOption = Pattern.compile("^\\d+");

        if (args.length > 0 && args[0].startsWith("-")) {
            Matcher m = customOption.matcher(args[0]);
            if (m.find()) {
                lines = Integer.parseInt(m.group(1));
                ++optIdx;
            }
        }
        for (; optIdx < args.length; ++optIdx) {
            if (args[optIdx].equals("--help") ||
                args[optIdx].equals("-?")) {
                needHelp = true;
            }
            else if (args[optIdx].equals("--count") ||
                args[optIdx].equals("-n")) {
                Matcher m = numericOption.matcher(args[optIdx + 1]);
                if (m.find()) {
                    lines = Integer.parseInt(args[optIdx + 1]);
                    ++optIdx;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                    break;
                }
            }
            else {
                break;
            }
        }
        if (needHelp) {
            usage(err);
        }

        String fName = "";
        if (optIdx < args.length) {
            fName = args[optIdx];
        }
        try {
            Reader r = null;
            if (fName.isEmpty() || fName.equals("-")) {
                r = new InputStreamReader(System.in);
            }
            else {
                r = new FileReader(fName);
            }
            BufferedReader br = new BufferedReader(r);
            String line = null;
            for (int i = 0; (line = br.readLine()) != null &&
                i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Java placed no obstacles in my way here. We just need to switch the Reader object that we use to construct the BufferedReader used to split the input into individual lines.

Node.js

#!/usr/bin/env node

var lines = 10, i, optIdx = 2, needHelp = false, err = null,
    args = process.argv;
if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) {
    lines = parseInt(args[optIdx].substr(1));
    args.splice(optIdx, 1);
}

for (; optIdx < args.length; ++optIdx) {
    if (args[optIdx] == "--help" || args[optIdx] == "-?") {
        needHelp = true;
    }
    else if (args[optIdx] == "--count" || args[optIdx] == "-n") {
        if (/^\d+$/.test(args[optIdx + 1])) {
            lines = parseInt(args[optIdx + 1]);
            ++optIdx;
        }
        else {
            err = "Bad count argument";
            needHelp = true;
            break;
        }
    }
    else {
        break;
    }
}

if (needHelp) {
    if (err) {
        console.error(err);
    }
    console.error("Usage:\n  " + args[1] + " [--count|-n <lines>] [FILE]");
    process.exit(1);
}

var fs = require('fs'),
    readline = require('readline'),
    fName = "", strm, eof = false;
if (optIdx < args.length) {
    fName = args[optIdx];
}
if (!fName || fName === "-") {
    strm = process.stdin;
}
else {
    strm = fs.createReadStream(fName);
}
strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Again, no problems. JavaScript’s weak typing makes this easier than the Java version since strm is just a reference to a thing that behaves in a stream-like manner.

Perl

#!/usr/bin/perl

=head1 SYNOPSIS

headpl [--count|-n <lines>] [FILE]

=cut

use strict;
use warnings qw/all/;
use Getopt::Long;
use Pod::Usage;

my ($needHelp, $err, $lines) = (0, "", 10);

if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) {
    $lines = $1;
    shift(@ARGV);
}
GetOptions(
    "help|?" => \$needHelp,
    "count|n=s" => \$lines,
);
unless ($lines =~ /^\d+$/) {
    $err = "Bad count argument";
    $needHelp = 1;
}

pod2usage(
    exitval => 1,
    message => $err
) if ($needHelp);

my $fh = *STDIN;
if (@ARGV && $ARGV[0] ne "-") {
    open($fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
}
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

The Unix model that everything is a file applies to Perl: stdin is simply a filehandle which means that all we have to do is change what $fh refers to. Note the following:

  • Line 34: The syntax may be unfamiliar. STDIN is an entry in the symbol table, but it doesn’t have a defined type (i.e., there isn’t a $STDIN or a %stdin). *STDIN is called a typeglob and means “the thing that STDIN refers to”
  • Line 35: Perl treats string values and numeric values interchangeably, depending on the context. Distinguishing numeric equality from string equality requires different operators: =/!= for numeric equality, eq/ne for string equality.

Python

#!/usr/bin/python
import sys
import re
import argparse

def usage():
    return '''Usage:
  headpy[--count|-n <lines>] [FILE]
'''

def printstream(f, lines):
    for line in f:
        sys.stdout.write(line)
        lines -= 1
        if not lines:
            break   

lines = 10
if len(sys.argv) > 1:
    customOption = re.compile(r"\-(\d+)")
    m = customOption.match(sys.argv[1])
    if m:
        lines = int(m.group(1))
        del sys.argv[1]

parser = argparse.ArgumentParser(
    usage = usage(), add_help = False
)
parser.add_argument("-?", "--help", required = False, action="store_true")
parser.add_argument("-n", "--count", required = False)
parser.add_argument("argv", nargs = argparse.REMAINDER)
args = parser.parse_args()
err = None
needHelp = args.help
if args.count:
    numericOption = re.compile(r"^\d+$")
    m = numericOption.match(args.count)
    if m:
        lines = int(args.count)
    else:
        err = "Bad count argument"
        needHelp = True
if needHelp:
    parser.error(err)    

fName = None
if args.argv and args.argv[0] != "-":
    fName = args.argv[0]
try:
    if fName:
        with open(fName) as f:
            printstream(f, lines)
    else:
        printstream(sys.stdin, lines)
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

As with the C++ solution, I refactored the Python script to move the core functionality into its own function that takes some kind of iterable object. Unlike the C++ solution, I’m not seeing other ways that I could make this work. The controlled execution block that controls the lifetime of the file object is not something that I can change to use sys.stdin, so we’re stuck with having to treat a file and stdin differently. Note the following:

  • Line 47: Things like unfamiliar logic operators make programming in a language more difficult than it needs to be. Given that Python is written in C, would it have killed GvR to have used the familiar “&&”?

Part 4: Relative Performance

This section is for fun. I don’t believe that performance should be the primary criterion for all but a few problem domains. Assuming your algorithms are good, most programs these days are fast enough. The correct target for early optimization is the programmer rather than the CPU, since CPU time, unlike programmer time, is always getting cheaper. Therefore, clearer code that takes a few more microseconds to execute is a worthwhile investment. That said, I/O is something you want to be fast since that is one of the performance limiters of any program.

The software versions are the ones hanging around on my MacBook:

$ clang --version
Apple LLVM version 7.3.0 (clang-703.0.31)
$ java -version
java version "1.6.0_51"
$ node --version
v4.2.2
$ perl --version
This is perl 5, version 18, subversion 2 (v5.18.2) ...
$ python --version
Python 2.7.10

The C++ implementation was built as follows:

$ g++ -o headcpp -O3 -Wall head.cpp

To gauge performance, I ran the following for each implementation:

$ time for x in {1..10}; do <HEADCMD> -200000 </usr/share/dict/words >/dev/null; done

This means that we are reading two million lines with the various getline implementations. Redirecting output to /dev/null means that we are not left waiting on the terminal. For each implementation, I did three runs and took the best of the three.

head

We’ll use the real head implementation for reference. As I said earlier, were I writing a serious implementation, I wouldn’t read one line at a time and neither does the actual head utility. Anyway:

real	0m0.296s
user	0m0.268s
sys	0m0.026s

C++

real	0m4.848s
user	0m3.947s
sys	0m0.872s

Remember that there were a couple of source level optimizations. Let’s see how performance changes if we undo them. First, using endl instead of “\n”:

real	0m4.860s
user	0m3.978s
sys	0m0.856s

No meaningful difference so not using endl was premature optimization. Now let’s see how keeping C++ streams synchronized with C stdio affects performance:

real	0m4.683s
user	0m3.841s
sys	0m0.819s

Again, no change. However, it did make a big difference on a Linux system, so this is an optimization worth keeping.

Java

real	0m8.358s
user	0m9.526s
sys	0m2.219s

Node.js

real	0m6.597s
user	0m5.592s
sys	0m1.099s

Perl

real	0m1.278s
user	0m1.140s
sys	0m0.106s

Python

real	0m1.448s
user	0m1.196s
sys	0m0.211s

The real surprise here is the performance of the C++ program which is around three times worse than the Python and Perl programs. Not so surprising is that Java is the worst performer, around twice as slow as the C++ version. This is despite Java compiling to binary bytecode and then using a JIT compiler to turn the bytecode into native code. The Node.js implementation lies somewhere between the C++ and Java implementations, which is not very impressive given that it, too, compiles to native code and considering also the demented event-driven file handling that was imposed on us. The Python and Perl performance is extremely impressive, not a million miles from the reference figures with Perl just edging out its Dutch rival.

I was so surprised at the terrible performance of the C++ program that I rewrote it as a straight C program using stdio:

#include <stdio.h>
#include <getopt.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

#define DEFLINES 10

void usage(const char *name, const char *msg) {
    if (msg) {
        fprintf(stderr, "%s\n", msg);
    }
    fprintf(stderr, "Usage:\n  %s [--count|-n <lines>] [FILE]\n", name);
}

int numericOption(const char *s) {
    const char *p = s;
    for (; *p; ++p) {
        if (!isdigit(*p)) {
            return -1;
        }
    }
    return atoi(s);
}

int main(int argc, char **argv) {
    int lines = DEFLINES, needHelp = 0, opt, nOpt;
    size_t len = 0;
    ssize_t read;
    const char *err = NULL;
    char *fn = NULL, *line = NULL;
    FILE *fp = stdin;
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-') {
        nOpt = numericOption(&argv[1][1]);
        if (nOpt >= 0) {
            lines = nOpt;
            argv[1] = argv[0];
            ++argv;
            --argc;
        }
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                nOpt = numericOption(optarg);
                if (nOpt >= 0) {
                    lines = nOpt;
                }
                else {
                    err = "Bad count argument";
                    needHelp = 1;
                }
                break;
            case '?':
            default:
                needHelp = 1;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (argc > optind && strcmp(argv[optind], "-")) {
        fn = argv[optind];
    }
    if (fn) {
        fp = fopen(fn, "r");
        if (!fp) {
            fprintf(stderr, "Unable to open %s: %s\n", fn, strerror(errno));
            return 1;
        }
    }
    while ((read = getline(&line, &len, fp)) != -1 && lines--) {
        fputs(line, stdout);
    }
    free(line);
    if (fn) {
        fclose(fp);
    }
    return 0;
}

Let’s see how it performs:

$ time for x in {1..10}; do ./headc -200000 </usr/share/dict/words >/dev/null; done

real	0m0.306s
user	0m0.270s
sys	0m0.026s

That’s more like what you’d expect from a compiled binary and isn’t noticeably slower than the native head utility. Importantly, it’s ten times faster than the C++ implementation and given that the only real difference between the two is the I/O routines, I have to conclude that the performance of C++ stream I/O is rather dismal. The C++ code is measurably faster when reading the file directly rather than redirecting stdin which points the finger at cin.

Part 5: Conclusions

C++ is a pleasant language to work with and the syntax is unobtrusive, if a little long-winded. The reward for the extra effort is that the compiled program runs at native speed. One goal of C++ is the “zero overhead principle”, by which it is meant that you won’t be able to get better performance by programming in another language. When it comes to I/O, however, that simply isn’t true. C++ I/O is not only greatly slower than the legacy C stdio but also slower than two scripting languages, Perl and Python. For bread-and-butter programming, plain C would appear to be the better choice.

In a former life, I spent around three years as a Java programmer. At the time it felt like a breath of fresh air, but given that I’d spent some previous years doing Windows development using COM, that’s not surprising. On reflection, the difference between them was like the difference between warm faeces and cold vomit. If forced to express an opinion, you might favour one over the other but you’d rather have neither in your mouth. Java is simply unpleasant to work with: fussy, verbose syntax combining the developer overhead of a statically-typed, compiled language with the poor performance of an interpreted one. Somewhere along the way, Java got entrenched as the “Enterprise” development language but I don’t quite understand how: if it’s no good at the basics, how can it really be any good for the enterprise? The supposed benefit of binary portability is questionable, since Perl and Python are every bit as portable and don’t have Java’s shortcomings. In fairness to Java, I will acknowledge that I am not using the latest and greatest version, although I doubt that a single version bump has suddenly made Java run quickly. I am also aware that Java 1.7 has introduced a “try-with-resources” statement that brings a touch of RAII to the Java world. This would have made working with Java slightly less unpleasant.

Programmers love novelty as much as the next person which is why, I’m guessing, Node.js has gained traction. Otherwise, I can’t quite see what it’s bringing to the table. Event-driven programming makes absolute sense for code running in the browser, since you’re responding to mouse clicks and form submissions and so on. It doesn’t make quite so much sense in a scripting language, adding needless complexity to basic tasks. Proponents would say that the problem domain for Node.js is not basic scripting but scalable network applications and, to be fair, a non-blocking event model makes a lot more sense for sockets than it does for files. However, I would reply that Python (with Twisted) and Perl (with POE) can do scalable network applications just fine and are really good at the basic stuff as well.

I presume that love of novelty is also why Perl has lost mindshare in the last decade-or-so. To many programmers, Perl is what that crotchety old Unix-guy in the corner reaches for when awk runs out of steam. Perl is, indeed, the Unix philosophy applied to language design. But that’s a good thing, because it helps make Perl economical and elegant. Perl’s manifesto is that easy things should be easy and hard things should be possible and it succeeds in its aims. Interestingly, the file size of the Perl solution is half that of its nearest rival which translates to greater programmer productivity: if I have to write less, I’m going to get more done. It gets better. If you’ve ever read ‘Code Complete’, you’ll know that the number defects per 1000 lines of code is roughly constant, so fewer lines of code means fewer defects. Get more done with fewer defects: who wouldn’t want that?

Python is also a very solid choice as a general-purpose programming language, being both economical and performant. There is a hidden cost, however: the very idiosyncratic syntax is a barrier to any programmer who has anything like a C background. Python fans have an adjective for idiomatic Python code, “pythonic”. If you’re not “pythonic”, writing in Python can be more of a chore than a pleasure. Given that performance compared to Perl is a wash and that it’s not as economical (file size is twice that of the Perl solution), I won’t be switching to Python any time soon. That said, if I had to use Python for the bulk of my work, I wouldn’t start looking for another job. You can’t say that about Java!

At the start of this article, I said that choice of programming language should be a neutral one. Of course, that is far from the truth. With the possible exception of target platform, choice of language is the single most important engineering decision you can make. Make the wrong choice and you’re halving team productivity while doubling the number of software defects. Don’t go for the latest fashionable craze (which this year is Rust) and try to avoid the sort of mindset that believes “enterprise” means Java. If raw speed is not a primary criterion (and generally, it isn’t), give very serious consideration to a good scripting language. You’ll be a happier engineer for doing so.

Seven Habits of Highly Ineffective Programmers

Credit: The People Speak!
Credit: The People Speak!

Some programmers are better than others. In fact, there’s a statistical distribution: a few are absolutely brilliant, some are good, most are at least competent, some are barely competent and a few are truly dire. It’s an interesting observation that the Microsofts and Googles of the world will have seriously incompetent people writing parts of their software. And the really bad software can live for years, long after the author has taken his mediocrity with him to another unsuspecting company because one characteristic of bad software is that it is fantastically baroque and no-one else wants to take ownership. So there it stays. It’s almost certainly inefficient and probably insecure but so long as it doesn’t explode, everyone just leaves it the heck alone.

I will point out that bad programmers are not necessarily stupid and are certainly not bad people (or if they are, it’s not related to them also being bad programmers), they’re just individuals who find themselves in a job that they’re unsuited for. That said, the worst programmers share a number of characteristics. There are a lot more than seven, of course, but here’s my list of seven of the worst. Bad programmers…

…Lack basic knowledge

Writing software is an art to be sure, but it’s also a craft and like any craft, there is certain basic knowledge that you need to be able to practise your craft effectively. I don’t expect you to be Donald Knuth but I do expect you to know the basics of data structures and algorithms. I don’t expect you to be able to code a hashtable but I do expect you to know that there is such a thing and why you might use one.

Not understanding the basics of algorithms and data structures will result in serious performance issues. Imagine a list whose members are in sort order that currently contains 1000 items. A method exists to insert new items into the list. The bad programmer’s implementation will add items to the end of the list and then sort it. This is O(nlog2n) or roughly 10000 comparisons. The competent programmer will do a linear search of the list looking for the insertion point. This is O(n/2) or roughly 500 comparisons. The good programmer will use binary search to find the insertion point. This is O(log2n) or roughly 10 comparisons. The worst implementation is three orders of magnitude slower than the best implementation. If there are 1,000,000 items in the list, the worst implementation is heading for disaster.

…Fail to understand idioms

Programming languages are the medium through which we express our intentions. Like human languages, computer languages have their idioms and idiosyncracies. The good programmer will master the idioms while the bad programmer will create anti-patterns.

The anti-patterns are typically irksome rather than disastrous. For example, if you don’t understand the const idiom in C++, you’re likely to write functions like this:

void writeData(char *data) {
    // Do stuff
}

Leaving aside the fact that char * is effectively a promise to modify the data, the API is a pain to use, requiring ugly const_cast invocations:
std::string data("...");
...
writeData(const_cast<char *>(data.c_str());

If we don’t own the code, we’re stuck with it. If we do own the code, we’re probably still stuck with it for two reasons:

  • If I fix the signature for writeData so that the argument is declared const, I need to fix all existing calls to writeData as well
  • The rest of the code in the file containing writeData is likely to be terribad. If I touch it, I own it and I don’t want it!

…Use the wrong tools

Imagine you’re looking at a C++ header file and you see something like this:

typedef struct LIST {
  struct LIST *next;
  void* payload;
} list, *plist;

If you don’t recognize that, because you weren’t around in the 1990s, that’s a linked list struct, probably the same one that the bad programmer lifted from Herb Schildt’s ‘Teach Yourself C’ back in 1995 and which he’s been using ever since. Whatever its origin, it’s a sure sign that the programmer is using the wrong tools. std::list would be an obvious replacement but looking at the code more closely might indicate that std::vector or std::deque would be better suited to the task at hand. One thing is certain, however: the right tool for the job is never going to be a hand-rolled linked list.

Windows developers are forced to try and do everything in their graphical apps by the sheer poverty of the operating environment. They can’t reuse tools to do what their application requires because there are none, or none worth using. Unix developers, on the other hand, have more tools to play with than some of them know how to actually use. Have you ever seen something like this?

#!/bin/bash
pattern=...
file=...
...
grep $pattern $file | awk '{ print $2 }'

In case you don’t recognize the syntax, a tool called grep is being used to pick out matching lines from a file and these lines are being piped to another tool called awk to print the second field of each line. As any fule kno, awk can search files for patterns just fine so grep is redundant:

awk "/$pattern/ {print \$2}" $file

…Ignore conventions

Well behaved Unix (and Windows) programs conform to well-established conventions:

  • Input is read (or at least can be read) from a special filehandle called stdin
  • Output is written (or at least can be written) to a special filehandle called stdout
  • Success is indicated by an exit status of 0
  • Failure is indicated by an exit status other than 0
  • Associated error text is written to a special filehandle called stderr
  • Program options are specified by command-line switches that look like -i <FILENAME> or --inputfile <FILENAME>

Following these conventions allows marvelous things to happen. The output of another program, such as grep, can be used as the input to my program, the output of which can be passed to, say, a PDF writer that produces beautifully formatted reports. My program doesn’t need to know how to do pattern matching and it certainly doesn’t need to know how to write PDFs. Instead, it can just get on with doing whatever it is that it does.

A long-departed colleague wrote dozens of Python scripts and compiled C++ programs that had the following characteristics:

  • The command line options were always two XML documents, one for actual options and one for “parameters”
  • The output was always an XML document on stdout, even in the case of failures
  • The exit status was always 0
  • The actual status was somewhere in the output XML, along with any error text
  • Progress text was written to a magic filehandle with a fileno of 3.

Needless to say, he was a terrible programmer.

…Are too clever by half

I said above that bad programmers are not necessarily stupid. Some are really quite clever. Too damned clever if you ask me. Have you ever seen something like this?

void majorfail(char *s) {
    char buf[256], *p = buf;
    while (*p++ = *s++);
    // Do stuff with buf
}

Bad programmers are insecure programmers as well by which I don’t mean that they’re secretly afraid that they’re not very good but rather that the software they write has security holes large enough to drive trucks through. That while loop may look 1337 hardcore but if I see something like it, I’m not praising the author’s mastery of the pointer but rather asking why he’s reinventing strcpy. And this message goes out to all the bad programmers of the world: 256 characters (or 512 or 1024) is not big enough to hold any possible input. The fix is not only safe, but way simpler and more comprehensible:
void minorfail(char *s) {
    std::string str(s);
    // Do stuff with str
}

Sometimes, a piece of clever code relies on some really obscure knowledge to make sense of it. Look at the following JavaScript:

var hardcoreMathFn = function(x) {
    if (x === x) {
        // Do loads of really cool stuff
    }
};

Because I’m a bit of a clever-clogs myself, I recognize that this is an NaN test (NaN is the only value which is not equal to itself). If you, the maintainer of this obscurantist nonsense, are not aware of that piece of arcane knowledge, then the code looks like this:
if (true) {
    // Do loads of really cool stuff
}

The correct fix is:
if (!isNaN(x)) {
    // Do loads of really cool stuff
}

but the more likely one is:
// Do loads of really cool stuff

This sort of uncommented esotericism led to one of the worst bugs ever and I personally hold the smartypants openssl devs responsible, not the well meaning Debian package maintainer.

…Reinvent wheels. Badly

Have you ever seen code that does its own command-line processing? Something like this:

int main(int argc, char **argv) {
    char *file = NULL, *host = NULL, ...
    for (int i = 1; i < argc; i += 2) {
        if (!strcmp(argv[i], "-file")) {
            file = argv[i + 1];
        }
        else if (!strcmp(argv[i], "-host")) {
            host = argv[i + 1];
        }
        ...
    }
    ...
}

This is ugly to read, a ballache to maintain and accommodating an option that doesn’t take an additional parameter, such as -verbose will be irksome. Also a variant argument form, -file rather than --file is used. If you don’t know already, you’ll probably guess that DIY command-line processing is quite unnecessary as well since the problem has been solved. The bad programmer is simply too ignorant to know that this is the case or too lazy to figure out how to do it properly.

Bad programmers don’t just reinvent, they invent as well. A sure sign of bad software design is the invention of (oxymoron alert!) a proprietary network protocol. Imagine that we are implementing a client/server system that receives JSON-encoded requests (no, not XML because this is the 21st century) and emits JSON-encoded responses. An exchange might look like this:

{"request":"Yo!"}
{"response":"Wassup?"}

Now imagine that our “protocol” is simply that each request and response terminates on a newline so that the actual network exchange is this (“\n” signifies an actual newline character rather than an escape sequence):

{"request":"Yo!"}\n{"response":"Wassup?"}\n

Why is this a problem? Because implementing the protocol requires us to work at an uncomfortably low level. We have to get down and dirty with send, recv and select. We have to handle the possibility that the other end might fall over halfway through a message ourselves. If, however, we’d chosen HTTP rather than inventing our own protocol, we could use something like libcurl on the client side, a library that is not only mature and robust but more importantly is maintained by someone else. The server side could be implemented as a simple CGI script.

…Are plain sloppy

Look at my code samples in this article. Even when I’m demonstrating manifestly bad practices, the code is neat and orderly. You may question my opening brace placement since I favour the one, true placement of Kernighan and Ritchie over the deviant form of the heretic Stroustrup but you can’t deny that my coding style is readable and consistent. Bad software isn’t neat. The bad programmer can’t decide whether or not a space should go after keywords like if and while (hint: yes it should). You can generate random numbers from how many spaces go after commas: 0, 1 or other (hint: the correct value is 1). Similar strong randomness seems to govern whether or not there is a space between a function name and its argument list (hint: there shouldn’t be but I’ll let it pass if you’re consistent). And tabs and spaces are intermixed almost whimsically.

You may retort that whitespace isn’t significant, so what’s the problem? I said above that programming is a craft as well as an art. It’s also a discipline, one that requires an orderly thought process and a methodical approach. Code spends most of its time being maintained, not being written. If you don’t take a few minutes to ensure that your indentation matches your nesting how is the person who comes after you supposed to follow your logic? If you can’t pay attention to details how are you going to avoid sneaking in defects? In my experience, the probability of untidy code also being good code is close to 0.

Extending JavaScript Built-ins

Credit Martin Burns
Credit: Martin Burns

One should neither split infinitives nor end sentences with prepositions. These statements are examples of prescriptive grammar. Prescriptive grammar means rules invented by self-appointed experts who presume to tell others how they should speak if they don’t want to sound like louts. As far as I can tell, the origin of these prescriptions is that you couldn’t split infinitives or strand prepositions in Latin. Except we speak English and English accommodates “to boldly go” just fine and if we were to ask someone “About what are you talking?” they would be unlikely to compliment us on the quality of our speech.

The Internet is full of self-appointed experts who presume to tell others how to program in a similarly prescriptive way. One of the prescriptions that you may or may not be aware of is that one should never extend the built-in JavaScript classes (i.e. String, Array, Date, etc.). This Stack Overflow post is a good example of the controversy complete with huffing and puffing and stern warnings against the evils of such a filthy habit. Yet the ability to extend the behaviour of classes without subclassing is a core feature of JavaScript, not a bug, so a prescription against such a useful ability had better have a really good rationale (and no, MDN, fulmination is not a valid argument). The arguments against go something like this:

  • The method might be implemented in a future standard in a slightly different way
  • Your method name appears in for...in loops
  • Another library you use might implement the same method
  • There are alternative ways of doing it
  • It’s acceptable practice for certain elites but not for the unwashed masses
  • A newbie programmer might be unaware that the method is non-standard and get confused.

The retort to that last argument is “git gud”, while the retort to the last but one is something rather more robust. We’ll examine the other arguments in turn.

Implemented in a Future Standard

It might happen, particularly if the functionality is obviously useful. I started writing JavaScript early in 2005 and it didn’t take me long to get fed up with doing this:

var a = "...", s = "...";
...
if (a.substr(0, s.length) === s) {
   // a starts with s
}

The obvious solution was to extend the String class:
String.prototype.startsWith = function(s) {
   return (this.substr(0, s.length) === s);
};

This allowed me to express myself rather more clearly:
if (a.startsWith(s)) {
   ...
}

This is obvious functionality implemented in an obvious way. No, my implementation didn’t test for pathological input but then I never fed it pathological input so it didn’t matter. Sometime around 2012, browsers containing a native implementation of startsWith started appearing and the only thing that needed changing was a test for the presence of the method before stomping over it. Otherwise, despite the addition of a feature-in-search-of-a-use-case (a start position other than 0), the official implementations behaved identically to my venerable extension.

For argument’s sake, let’s say I’d called the method beginsWith. The fix would take longer to type than to apply:

$ sed -i -e 's/beginsWith/startsWith/g;' *.js

Even if I’d been stupid enough to do something like return 0/-1 rather than true/false, fixing up my uses of startsWith to use a different return type would have been the work of a few minutes. We call this refactoring which means that the argument in favour of possible future implementations is actually an argument against refactoring which is no argument at all.

for…in Pollution

The problem here is that your method name appears in the enumerable properties of an object but this is actually a non-issue. Leaving aside the fact that it has been possible for some years to ensure that your method is not enumerable in the great majority of browsers in the world, would you ever enumerate the properties of a String or Date object? Hint: there are no enumerable properties. Furthermore, if you enumerate over an Array with a for...in loop, you’re doing it wrong.

Implemented by Another Library

The problem here is that the last library included wins and if utility library A behaves differently to utility library B, code may break. But again, this is something of a non-issue if certain safeguards are observed: test for the existence of an existing method and implement obvious functionality in an obvious way. If anything, this is an argument against using the monolithic jqueries of the world in favour of micro frameworks, since micro frameworks tend not to do a 1001 things that you don’t know about.

It’s Unnecessary

True, inasmuch as there is always more than one way to do anything. The question is: are alternatives better? Let’s say that I want a function to shuffle an Array. The function might look like this:

function shuffle(a) {
    var l = a.length, c = l - 1, t, r;
    for (; c > 0; --c) {
        r = Math.floor(Math.random() * l);
        t = a[c];
        a[c] = a[r];
        a[r] = t;
    }
    return a;
}

For the interested, this is the Fisher-Yates algorithm. How are we going to make this function available to our application? We can reject one possibility - subclassing Array - right away. The reason is that arrays are almost always instantiated using the more expressive literal notation:
var a = [1, 2, 3],   // like this almost always
    b = new Array(); // like this almost never

This makes using a MyExtendedArray class unfeasible - the overhead would simply be too great. The same applies to theoretical subclasses of String and Number.

Using a procedural approach is doable but we now come up against the problem of namespacing. As written, the identifier "shuffle" is placed into the global namespace. This isn't as fatal as some would have you believe if it's for private consumption. However, if you plan to create a library for consumption by others you should avoid using the global namespace because collision with identically named entities is a non-negligible risk. One could imagine, for example, another "shuffle" function that works on a "DeckOfCards" object or a "Songlist". Utility functions are problematic because the namespace isn't obvious. One could choose something like "ArrayUtils" but then you're liable to be competing with everyone else's ArrayUtils. So you might end up doing this:

if (!("CCWUtils" in window)) {
    CCWUtils = {};
}
CCWUtils.Array = {
    shuffle: function(a) {
        ...
    }
};

...

var cards = ["AS", "2S", "3S", ... , "JC", "QC", "KC"];
CCWUtils.Array.shuffle(cards);

Remember that we're doing it this way because we believe that adding a "shuffle" method to the native Array class is somehow sinful. If that feels like the tail wagging the dog, compare the sinful approach:
if (!("shuffle" in Array.prototype)) {
    Array.prototype.shuffle = function() {
        ...
    };
}

...

var cards = ["AS", "2S", "3S", ... , "JC", "QC", "KC"];
cards.shuffle();

To my eyes, cards.shuffle() is both more idiomatic and more elegant. Namespacing isn't a problem and I take care to play nicely with any existing shuffle method.

Doing it Right

I believe that extending built-in classes is a valid practice but there are some guidelines that you might follow:

  • Add useful functionality. For example, the concat method of String objects implements the same functionality as the + operator. Don't do something similar
  • Use an obvious name and implement obvious functionality
  • Be careful to avoid overwriting a method with the same name
  • If possible, ensure that your method is non-enumerable, if only to silence critics who might otherwise complain that you're polluting their for...in loops
  • Take some pains to ensure that your method behaves like other methods. For example, methods of built-in objects are typically implemented generically and mutator methods (methods that modify the object) typically return the mutated object as well.

With that in mind, here is the complete shuffle extension:

(function() {
    "use strict";
    var _shuffle = function(a) {
        if (null == a) {
            throw new TypeError("can't convert " + a + " to an object");
        }
        var _a = Object(a), len = _a.length >>> 0, c = len - 1, t, r;
        for (; c > 0; --c) {
            r = Math.floor(Math.random() * len);
            // Swap the item at c with that at r
            t = _a[c];
            _a[c] = _a[r];
            _a[r] = t;
        }
        return _a;
    },
    obj = Array.prototype, mName = "shuffle",
        m = function() { return _shuffle(this); };
    if (!(mName in obj)) {
        try {
            Object.defineProperty(obj, mName, {
                enumerable : false,
                value : m
            });
        }
        catch(e) {
            obj[mName] = m;
        }
    }
})();

Note the following:

  • Lines 4-6: Following other Array methods, raise a TypeError if passed undefined or null. And yes, that is == rather than === since I don't care to distinguish the two
  • Line 7: Our method can work generically on array-like objects (objects with a length and numeric property names) and it won't barf if passed a primitive value. You can pass non-Array objects to the method by invoking it as Array.prototype.shuffle.call(obj). Note, however, that a TypeError will be raised if the method is passed an immutable object, such as a String. That is also true of, say, reverse so it's not a problem
  • Line 15: Our method returns the mutated Object in the manner of similar methods such as sort and reverse
  • Line 17: Use the obvious name "shuffle". If we used something like "$shuffle" to ensure that we didn't conflict with a future implementation, we wouldn't automatically benefit from the native implementation. As it is, if "shuffle" ever becomes a standard method on Array, our extension simply becomes a shim
  • Line 19: Don't overwrite an existing method
  • Lines 21-24: This is how you ensure that the new method is not an enumerable property
  • Line 27: Fallback position for the off-chance that our code is running in Internet Explorer <= 8 or something similarly inadequate.

But Object.prototype is Forbidden, Right?

I said earlier that forbidding the use of a useful feature needed a good rationale. Well, here goes. Imagine that I wanted a method to count the number of properties of an Object and I implemented it like this:

Object.prototype.count = function() {
    var p, ret = 0;
    for (p in this) {
        if (this.hasOwnProperty(p)) {
            ++ret;
        }
    }
    return ret;
};

The problem is that Object is a first-class data type used to store key-value pairs which means that I can also do this:
var stat = {
    value: 1,
    count: 4   // Oops! count method is now shadowed
};

Rather than adding value, my count method has stolen a perfectly good property name. Even if we think we can do what we want by picking really obscure method names (breaking the "obvious" guideline in the process), Object is the base class of everything and the practice is simply hazard-prone. Illustratively, adding methods directly to Object.prototype breaks the "behaves like other methods" guideline as well, since most Object methods are static:
var a = 1;
...
if (Object.is(a, 1)) {
    // Not "if (a.is(1))"
}

So, yes, Object.prototype is forbidden.

Erm, What About "Host" Objects?

By "host" objects, we mean the objects that the browser makes available to your script that represent the visible and not-so-visible parts of your user interface: Document, Element, Node, Event and so on. These were the subject of an essay some years ago. Most of the issues considered are no longer really issues (applet elements, anyone?) and now this is a reasonable strategy:

if (!("Element" in window)) {
    // Don't try and do anything at all
}

With that in place, I can't see that the sky would fall if you did something like this:

Element.prototype.show = function(style) { // e.g. "inline-block"
    style = style || "";
    this.style.display = style;
};
Element.prototype.hide = function() {
    this.style.display = "none";
};

A possible problem might be one of scope. The behaviours that a String object doesn't have that you wish it did are distinctly finite: while you might want it to return a reversed copy of itself, you're probably not hankering for it to calculate a SHA256 hash of itself. The behaviours that we might want to add to objects that represent interface elements are not so limited. For example, we might want a method to set colours. Does our setColor method take two arguments and set the background colour as well? Or does it simply set the foreground colour, leaving the background colour to be specified by a call to setBackgroundColor? What about a method to set position and z-order? You'll quickly find yourself in danger of violating both the "useful" and "obvious" guidelines.

I'm a great believer in the Unix philosophy of doing one thing well. User-interface toolkits have a bad habit of trying to do flipping everything. My personal feeling is that extending interface objects is a bit like pulling a thread on your jumper: at some point you're going to wish you hadn't started. But if it works for you who am I to say you mustn't?

Conclusions

TL;DR: my views on extending built-in classes are:

  • Array, String, Number, Date: Yes; apply common sense
  • Object.prototype: Seriously, no
  • "Host" objects: not to my taste, but if it floats your boat...

Imperfect Forward Secrecy

Credit: Nick Carter
Credit: Nick Carter

Perfect Forward Secrecy is a desirable characteristic of secure communications protocols (such as TLS, the protocol that puts the “s” in “https://…”) which means that even if I manage to steal your private key, I still can’t read the terabytes of encrypted traffic that I’ve managed to capture. The reason for this is that the private key used to assure your identity was not also used to secure the session keys used to encrypt the data sent back and forth. Instead, some method of key exchange was used which uses clever mathematics to allow you and your correspondent to independently calculate the same session key while I, despite eavesdropping on all your conversations, can’t. The method that I’m going to discuss is called Ephemeral Diffie-Hellman. If you’re not interested in the mathematics, you can skip this bit. Otherwise, the actors in the drama are an Activist (A), Wikileaks (W) and the CIA (C). A is submitting a bunch of top secret documents to W and C would quite like to be able to listen in.

W has a set of fixed Diffie-Hellman parameters (fixed because generating them from scratch takes ages) consisting of a generator, g (which is typically 2 or 5) and a modulus, p, which is a large prime number such that for any integer a, there is another integer, k, with the property that:
g^k=a\bmod p
If you don’t understand the notation, “mod” stands for “modulo”, which you may have learnt at school as “clock arithmetic”: if the time is 22:00 hours and we wind the clock forward 8:00 hours, the clock reads 06:00 hours, not 30:00 hours. This is addition modulo 24.

W now picks a secret random number, x and sends A the following values: p, g and a calculated value w where:
w=g^x\bmod p
In the TLS protocol, W signs this particular message using its private key so that A can verify that the parameters are sound. A now picks a random number, y and sends W a calculated value a where:
a=g^y\bmod p

Now here’s the clever part. W derives a secret key, s:
s=a^x\bmod p
Meanwhile, A derives the same secret value:
s=w^y\bmod p
This works because, modulo p:
w^y\bmod p = g^{xy}\bmod p = g^{yx}\bmod p = a^x\bmod p

C, who has been privy to these exchanges, knows p, g, w and a but he can’t do a lot with them because, for example, to obtain W’s secret number x he would have to work backwards from w to get it. This problem is known as the discrete logarithm problem (you’re looking for an exponent and if this weren’t modular arithmetic, you’d simply take the logarithm of w to get the answer) and is fantastically hard to solve when p is a large number.

So much for the beautiful theory, the devil is in the implementation details! If you’re programming support for ephemeral Diffie-Hellman, you’re likely to be using openssl to do so and a combination of poor documentation and boneheaded behaviour makes it really easy to get this wrong.

The first thing you need to do is generate parameters. Fortunately, openssl makes this easy to do from the command line. For example, to generate a 2048-bit prime (the recommended minimum at the time of writing), you’d use:

$ openssl dhparam -out dh2048.pem 2048

If you do it like this, you need to load the parameters from the file at runtime using the PEM_read_DHparams function. I prefer not to do it this way, because using an external file gives me a bunch of possible error conditions that I need to handle and test. Instead, you can get openssl to write you a nice C-function (actually, it’s a bit ugly) that you can simply paste in your code:

$ openssl dhparam -noout -C 2048

If you go down this route, remember to refresh your parameters every so often. I use a Perl script that generates a header file with fresh parameters whenever I do a clean build.

However you get your parameters you have to tell openssl to use them. One way is to use a fixed-strength set of parameters for every connection:

#include <openssl.h>

SSL_CTX ctx = SSL_CTX_new();
/* Load Diffie-Hellman parameters */
DH *dh2048 = ...
...
if (SSL_CTX_set_tmp_dh(ctx, dh2048) != 1) {
   /* Handle the error */
   ...
}

This is OK, but it’s not very agile. If you look at this table (the relevant numbers are the “L”-values in the FF column), you’ll see that a 2048-bit prime is a match for protecting a 3DES key, since both give you 112-bits of effective security. 112-bits means that you’d expect to perform around 2112 (or 5×1033) computations to break the cryptography. Since Sweet32, 3DES is right out. This means that, realistically, you’re using AES (or maybe Camellia if you really don’t want to sell software to the US government) with a minimum strength of 128-bits. The corresponding prime is now 3072-bits. It is quite possible when using TLS to negotiate a cipher suite that uses AES-256 as the data encryption algorithm. The matching prime is a whopping 15360-bits (good luck generating that!). Since using a larger prime than required means doing more work in your overburdened server code, you’d ideally want a way of matching the parameters to the ciphersuite. openssl appears to give you the means to do so via a callback function: SSL_CTX_set_tmp_dh_callback. However, if you use this, there are a couple of serious pitfalls, one that may be of your making and one that most definitely is not.

Since this will be code that you probably don’t revisit terribly often, your code may look something like this:

#include <openssl/ssl.h>
#include <openssl/dh.h>

DH *dh512 ...;
DH *dh1024...;
DH *dh2048...;

DH *DhCallback(SSL *ssl, int isExport, int keyLength)
{
    DH *ret;
    switch (keyLength) {
        case 512:
            /* Oops, we're vulnerable to Logjam */
            ret = dh512;
            break;

        case 1024:
            /* Not good enough since 2010 */
            ret = dh1024;
            break;

        case 2048:
        default:
            /* Just right but never used */
            ret = dh2048;
            break;
    }
    return ret;
}

SSL_CTX ctx = SSL_CTX_new();
SSL_CTX_set_tmp_dh_callback(ctx, DhCallback);

If your code does look like this, then that’s the problem of your making. You have two code paths that don’t do the right thing (and, indeed, one that does a very wrong thing indeed). However, there’s another problem. You might reasonably expect the keyLength parameter to be a “sensible” value for the ciphersuite, but it isn’t. It’s always 1024 (unless isExport is non-zero, but then you have other problems to worry about!). At least the most recent documentation tells you to ignore the parameter, but somehwat older documentation is quite misleading.

If you want the agility, you can base it on the strength of the private key since, referring back to the earlier table, the “FF” and “IF” (meaning RSA keys) columns line up. Your code might look something like this:

/* Use decently strong parameters */
DH *dh2048...;
DH *dh3072...;

DH *DhCallback(SSL *ssl, int isExport, int keyLength)
{
    DH *ret;

    EVP_PKEY *pkey = SSL_get_privatekey(ssl);
    int pBits = pKey ? EVP_PKEY_bits(pkey) : 0;

    if (pBits <= 2048) {
        ret = dh2048;
    }
    else {
        ret = dh3072;
    }

    return ret;
}

SSL_CTX ctx = SSL_CTX_new();
SSL_CTX_set_tmp_dh_callback(ctx, DhCallback);

Note the test for a NULL private key. If it’s NULL, you’re using anonymous Diffie-Hellman which means you’re in something of a state of sin!

If you used the former method (SSL_CTX_set_tmp_dh), openssl used to immediately generate a session key and use for it every single client connection for the lifetime of your server process unless you explicitly told it not to be silly. Fortunately, it doesn’t do that anymore.

So that’s how to use ephemeral Diffie-Hellman safely in your server-side code. However, there’s a clearly superior alternative, elliptic curve Diffie-Hellman (you can find a gentle introduction to elliptic curve cryptography or ECC here).

Uptake of ECC was hindered until quite recently by Red Hat’s stubborn insistence on stripping ECC functionality out of openssl. I’m not going to question Red Hat’s legal concerns (especially since they got really shirty about it) but their stance badly affected Linux developers while Microsoft, Google, Mozilla, Canonical et al. didn’t seem at all bothered.

There’s another issue too: FUD about the numerology behind it all. To all intents and purposes, the world uses two curves called P-256 and P-384 which provide 128 and 192 bits of effective strength respectively (look at the “EC” column in that table I referred to earlier). The reason that everybody uses these two curves is twofold: their use is required if you want to sell software to the US government and coming up with your own curves is beyond the grasp of regular mortals. The likes of Daniel J. Bernstein can but you and I almost certainly can’t! Those two curves were published by the perfectly benevolent National Institute of Standards and Technology (NIST). Unfortunately, they had assistance from the distinctly less benevolent National Security Agency (NSA) and anything the NSA touches is tainted by association. My take on the FUD is that P-256 and P-384 are part of “suite-B” (algorithms approved for securing up to Top Secret information). If the NSA had a magic back-door for P-384 (the Top Secret one), they would have to assume that the Chinese and Russians knew about it as well. Since presumably the NSA don’t want the Chinese and Russians reading Top Secret US communications, it’s a reasonable assumption that the NIST curves are currently secure from snooping.

Therefore, my parting advice is to ditch all that code supporting DHE ciphersuites and just use elliptic curve cryptography instead. In a novel move, openssl not only makes this easy but also provides a function that does – apparently – the right thing:

#include <openssl.h>

SSL_CTX ctx = SSL_CTX_new();
SSL_CTX_set_ecdh_auto(ctx, 1);

What’s in a Number?

Credit: Jorge Franganillo
Credit: Jorge Franganillo

If we program in a low-level language like C, we have access to all sorts of numeric types of various sizes from unsigned char to long double. In JavaScript we have just the one: the IEEE 754 double precision floating point number (double for short). If we are to be limited to just the one number type, this is a pretty good one: it can represent whole numbers, fractional numbers, tiny numbers, enormous numbers and infinity.

A double is a 64-bit value arranged thus:

  • 1 sign bit
  • 11 bits of exponent
  • 53 bits of significand

If you’re sharp, you’ll notice that this sums up to 65 bits. This is possible because the numbers are normalized. As to what “normalized” means, consider the difference between 1.25×103 and 125×101. They both represent the same number but the first is normalized while the second is a bit weird. In base 10, the first digit of a normalized number is between 1 and 9 inclusive. In binary it’s always1. Since it’s always 1, we can leave it out without losing information. Hence, we get 53 bits of precision despite only having 52 bits to play with. An 11-bit exponent gives a range of 0-2047. Since we need to be able to represent small numbers, the actual exponent is obtained by subtracting 1023 from this, giving a range of -1023 to 1024. As we’ll see, the maximum and minimum exponent values are reserved so the usable range is -1022 to 1023.

In C it’s easy to get at the internals of a double:

#include <stdint.h>
...
double pi = 3.141592653589793;
uint64_t *p = (uint64_t *) &pi;
/* Twiddle bits through *p */
...

In JavaScript, we can’t access the internals directly, but a little bit of trickery opens them right up. Let’s see whether this is possible. Tell me, computer, will you allow me to peek inside the internals of a double?

Assuming that the required runtime support is available, the following function will split a double into its constituent parts as human-readable strings:

window.showNum = function(n) {
    var buf = new ArrayBuffer(8),
        dv = new DataView(buf);
    dv.setFloat64(0, n, true);
    var m = dv.getInt32(0, true),
        i = 0, j = 1, p = "", ret = {};
    for (; i < 64; ++i, j <<= 1) {
        if (i === 32) {
            // Done with the first half
            j = 1;
            m = dv.getInt32(4, true);
        }
        else if (i === 52) {
            // Built up the significand
            ret.m = p;
            p = "";
        }
        else if (i === 63) {
            // Built up the exponent
            ret.e = p;
            p = "";
        }
        if (m & j) {
            p = "1" + p;
        }
        else {
            p = "0" + p;
        }
    }
    // Set the sign bit
    ret.s = p;

    // Calculate the represented value as a
    // base-2 exponential value
    var sig = 1, f = 0.5, e = -1023;
    for (i = 0; i < 52; ++i, f /= 2) {
        if ('1' === ret.m.charAt(i)) {
            sig += f;
        }
    }
    j = 1;
    for (i = 10; i >= 0; --i, j <<= 1) {
        if ('1' === ret.e.charAt(i)) {
            e += j;
        }
    }
    ret.t = sig + "e" + e;
    if ('1' === ret.s) {
        ret.t = '-' + ret.t;
    }

    return ret;
};

Here’s how it works. An ArrayBuffer is a low-level generic data container that behaves roughly like a byte array. We get at the internals of the buffer by overlaying either a typed array (such as Int32Array) or a DataView. I chose a DataView because then I don’t have to worry about endianness. The true arguments to setFloat64 and getInt32 means “little-endian, please”. We stick the double into the buffer and then access it as two 32-bit integers. We may not be able to get at the bits in a double from JavaScript but we can do bitwise operations on an int. Had we used false as the endianness flag, we’d have to read out the ints in reverse order. Without a DataView, I would have to do some additional work to determine the endianness of your system. When we’re done calculating the bit strings, we build up a base-2 exponential representation that’s a bit more human-readable. The bits of the significand are all fractional with the first bit representing ½, the second bit ¼ and so on. The exponent is calculated working right-to-left along the bit string, remembering to take the bias of 1023 into account. The return property for the significand is “m” which stands for mantissa. Technically, this is wrong because the strict definition of “mantissa” is the fractional part of a logarithm. However, “s” is already used for the sign so “m” it is.

Let’s try it out. Tell me, computer, what does 1 look like internally?

That’s not a very interesting number apart from the exponent pattern which shows the 1023 bias. Tell me, computer, what does -√2 look like internally?

Limits

Let’s have a look at the limits of what is representable. Tell me, computer, what does the largest absolute value that you can handle look like internally?

In base 10, this is 1.7976931348623157×10308. What about the smallest?

That last one may be surprising if you recall that I said the minimum usable exponent was -1022. This value is denormalized and is, in fact, the smallest denormalized value. This may (or may not) be familiar to C++ programmers as std::numeric_limits::denorm_min. An exponent of -1023 signfies a subnormal number and the signficand must be interpreted as 0.f rather than 1.f. Therefore, the significand of the smallest number represents 2-51. When you add the exponents together, you get a value of 2-1074 which is the absolute limit of representability that is not zero. Using bits from the significand as additional exponent bits allows for gradual underflow at the cost of precision – at the limit, there is only a single bit. In base 10, it translates (very approximately) to 5×10-324.

The more familiar value to C/C++ programmers is DBL_MIN. Tell me, computer, what does DBL_MIN look like internally?

In base 10, this is 2.2250738585072014×10-308.

Infinity and NaN

A value with all 1s in its exponent signifies an abnormal condition. The first condition is overflow. Tell me, computer, what does the product of 10155 and 10154 look like internally?

This value is commonly represented as Infinity. The sign bit can be set and this represents minus infinity. The characteristic of the two infinities is that all exponent bits are 1 and all significand bits are 0.

The second condition is where the numeric representation of a value is undefined. Tell me, computer, what does 0/0 look like internally?

This value is commonly represented as NaN (aka, not a number). The characteristic of NaN is that all exponent bits are set along with 1 or more significand bits.

Zero

If the significand in a double is to be interpreted as 1.f, there is no possible representation of zero since there is no y such that 1.f×2y = 0. The solution is to reserve a pattern to represent 0. From the representation of the smallest denormalized number above, you’d be correct in guessing that this pattern is all exponent bits set to 0 along with all significand bits. However, the sign bit can be set or unset which means that there are two zero values: 0 and -0. -0 will be the result, for example, of truncating a value between 0 and -1.

Epsilon

Although we can represent minuscule numbers by reducing the exponent, there is a hard limit to the fractional values that can be represented by adjusting the significand. Since a double has 52 fraction bits, this value may trivially (and precisely) be calculated as Math.pow(2, -52). This value is known as machine epsilon and is denoted by 𝜀. It is the smallest value for which the following inequality is true:
1 + \epsilon > 1
Assuming that your browser supports it, the machine epsilon value is in Number.EPSILON

Largest integer values

The width of the significand also determines the limits on the maximum size of a whole number that can be represented without loss of precision. The largest values have all significand bits set to 1 and the exponent set to 10000110011b (that is, 52). In other words, ±253 – 1. Note the qualifier “without loss of precision”. Larger integer values can be represented but not uniquely. To illustrate:

253 - 2 → 9007199254740990
253 - 1 → 9007199254740991
253     → 9007199254740992
253 + 1 → 9007199254740992

Assuming browser support, the largest integers that can be represented in a double are given by the MAX_SAFE_INTEGER and MIN_SAFE_INTEGER properties of Number.

Playing

Assuming you saw the answer “Of course” to the first question I asked the computer, you can type numbers into the following input and the component parts will be displayed in the box below. Large and small numbers can be entered using ‘e’ notation, for example, 1.23e-50. You can also type properties of Math and Number, such as PI, E, SQRT2, MAX_SAFE_INTEGER, EPSILON, etc.


Putting the Reporting Into Perl

Credit: Adrian
Credit: Adrian

One of many backonyms for Perl is “Practical Extraction and Reporting Language”. The ability to create reports complete with headers, pagination and justified output is a feature built into the language yet I’m going to hazard a guess that few Perl programmers have ever used it. While you can go a long way with sprintf, producing justified output, in particular, is non-trivial. If you need it, you may as well let Perl do the grunt work for you.

Declaring and using formats

Formats are declared with the format keyword and formatted output is produced using the write function (note: not the same function as the system call; that’s syswrite). A format
declaration takes the following form:

format NAME = 
@field @field ...           "picture" line
$variable, $variable ...    argument line
# Comment line
.

There can be any number of picture and argument lines. Each pair defines the pattern for a line of output. The @field is the ‘@’ character followed by by one or more ‘<‘, ‘>’, ‘|’ or ‘#’ characters. Details (and examples) are:

  • @<<<<<
    Six characters, left justified
  • @>>>>
    Five characters, right justified
  • @|||
    Four characters, centre justified
  • @###.##
    Decimal number with four digits in front of the decimal point and two after.

Note that the ‘@’ character is included in the character count.

The second line is a list of variable names (or constants) that hold the field values. You may see examples that show these space-separated like the field declarations. Follow this example if you like deprecation warnings, otherwise comma separate them. The format declaration is terminated by a lone ‘.’ at the start of the line. Here’s a simple script that will print its input centre-justified:

#!/usr/bin/perl -w
use strict;
format STDOUT =
@|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
$_
.
write while (<>);

Note the name “STDOUT”; this is the default name for a format and may be omitted. To define a page header create a format with the name <NAME>_TOP, for example, “STDOUT_TOP”.

Full program

We’re going to create a program that will create a formatted report of the errno definitions for your system. Each line in the output will show the numeric value, right justified, the constant declaration and the error description.

errno definitions

These are found in /usr/include/errno.h:

$ cat /usr/include/errno.h
...License text
#include <sys/errno.h>

Seems like it’s not going to be that easy. sys/errno.h includes a bunch of other files each of which include other files and so on. We could do this the hard way or we could note that there is already a program that can handle files which include other files which include yet other files: the C preprocessor. If you are using gcc, the invocation that shows preprocessor output is gcc -E. If you are using the Microsoft compiler this will be cl /E:

$ echo '#include <errno.h>' | gcc -E -
...Some output
# 1 "/usr/include/errno.h" 1 3 4
# 23 "/usr/include/errno.h" 3 4
# 1 "/usr/include/sys/errno.h" 1 3 4
# 72 "/usr/include/sys/errno.h" 3 4
...More output

The interesting pattern is:

# <LINENO> "<FILENAME>" 1 ...

This is emitted when the preprocessor opens a file. Inside each include file, we’re looking for two types of line:

#define EAGAIN  35 /* Blah blah */
...
#define EWOULDBLOCK EAGAIN  /* Blah blah */

We now have enough information to write the bit of our program that gathers data:

#!/usr/bin/perl
# Produces formatted report of errno definitions

use strict;
use warnings qw/all/;

# errno.h is typically a whole bunch of other 
# includes; rather than trying to process
# it ourselves, let's get the C-preprocessor 
# to tell us where the includes are
my $child;
open($child, "-|", 
    "echo '#include <errno.h>' | gcc -E -") 
    // die("Can't exec gcc: $!");
my @includes;
while(<$child>) {
    if (/^# \d* "(\/[^"]*errno[^"]*)" 1/) {
        # For example:
        # 1 "/usr/include/errno.h" 1 3 4
        # 1 after the filename means "new file"
        push(@includes, $1);
    } 
}
close($child);
my @errno;
my %no = ();
my ($fh, $line);

# As we grind over the error definitions, let's 
# keep track of the maximum field length;
# this is for the purpose of constructing 
# formats later
my ($maxN, $maxC, $maxD) = (0, 0, 0);
my ($const, $val, $desc);

for my $inc (@includes) {
    open($fh, "<", $inc) or 
        die("Unable to read $inc: $!");
    while ($line = <$fh>) {
        if ($line =~ /^#define\s*([A-Z]*)\s*(\d*)\s*\/\*\s*([^\*]*)\*/) {
            # For example:
            # define EPERM 1 /* ... */
            ($const, $val, $desc) = 
                ($1, $2 + 0, $3);
            # Lose whitespace from the end 
            # of the description field
            $desc =~ s/\s*$//;
            $no{$const} = $val;
            push(@errno, { 
                const => $const, 
                errno => $val, 
                desc => $desc 
            });
        }
        elsif ($line =~ /^#define\s*([A-Z]*)\s*([A-Z]*)\s*\/\*\s*([^\*]*)\*/) {
            # For example:
            #define EWOULDBLOCK EAGAIN  /* ... */
            ($const, $val, $desc) = ($1, $2, $3);
            if (exists($no{$val})) {
                # Resolve the errno
                $val = $no{$val};
                $desc =~ s/\s*$//;
                push(@errno, { 
                    const => $const, 
                    errno => $val, 
                    desc => $desc 
                });
            }
        }
        else {
            next;
        }
        if (length($val) > $maxN) {
            $maxN = length($val);
        }
        if (length($const) > $maxC) {
            $maxC = length($const);
        }
        if (length($desc) > $maxD) {
            $maxD = length($desc);
        }
    }
    close($fh);
}
@errno = sort { 
    $a->{errno} <=> $b->{errno} 
} @errno;

Apologies for those regular expressions that pick out the #define statements. The principal cause of line noise is that we’re looking for C comments, /* ... */, which means four backslash escapes.

Creating the format definitions

Although we know our field widths, we face a problem: the picture lines in a format declaration require you to know the field widths ahead of time. Fortunately, we can get round this by building up the format declaration programmatically and then eval‘ing it:

# Build up formats; # field is right aligned
my $fmt = '@' . ('>' x ($maxN - 1)) . 
    "  @" . ('<' x ($maxC - 1)) . 
    "  @" . ('<' x ($maxD - 1)) .
    "\n\$val, \$const, \$desc\n.";
    
my $fmtHdr = '@' . ('>' x ($maxN - 1)) . 
    "  @" . ('<' x ($maxC - 1)) . 
    "  @" . ('<' x ($maxD - 1)) .
    "\n'#', '#define', 'Description'\n" . 
    ('-' x ($maxN + $maxC + $maxD + 4)) . 
    "\n.";

eval("format STDOUT =\n$fmt\n");
eval("format STDOUT_TOP =\n$fmtHdr\n");

What we’re eval‘ing looks something like this:

format STDOUT =
@>>  @<<<<<<<<<<<<<<  @<<<<<<<<<<<<<<<<<<<<<<<
$val, $const, $desc
.

The header format that we build up is similar with the addition of an underline.

Producing output

Perl paginates report output adding a page break and a header after every sixty lines of output unless told otherwise. For our purposes, this is not desirable behaviour. We want a header but not paginated output. We can get round this by changing the variable that controls the number of lines per page, $=:

$= = 2 + scalar(@errno);

We need to add 2 because the header is two lines. Now it’s just a matter of looping over our error information:

for my $errInf (@errno) {
    $val = $errInf->{errno};
    $const = $errInf->{const};
    $desc = $errInf->{desc};
    write();
}

Let’s try it out:

$ ./errno2fmt
  #  #define    Description
-------------------------------------------
  1  EPERM      Operation not permitted
  2  ENOENT     No such file or directory
...
106  ELAST      Must be equal largest errno

The output will be different depending on your operating system. You can download the complete script here.

Finding out more

This post has barely scratched the surface of Perl’s reporting capabilities. For the full lowdown, check out the perldoc.

Naive Implementations

Credit: The Guardian
Credit: The Guardian

With ES6, the JavaScript math library gained a level in badass with the addition of seventeen new functions. While I was polishing my UsefulJS library for publication I decided I’d spend a couple of fun evenings implementing shims for them since, if I want to calculate the hypotenuse, I do not want to have to do this:

var x = 3, y = 4,
    r = ("hypot" in Math) ? Math.hypot(x, y) :
        Math.sqrt(x*x + y*y);

The C++ documentation for hypot has an interesting statement in the notes:

POSIX specifies that underflow may only occur when both arguments are subnormal and the correct result is also subnormal (this forbids naive implementations).

By naive implementations, they mean something like this:

(function() {
    if (!("hypot" in Math)) {
        Math.hypot = function(x, y) {
            return Math.sqrt(x*x + y*y);
        };
    }
})();

As it happens, this wouldn’t be up to standard anyway since Math.hypot takes any number of arguments but that’s a detail irrelevant to this discussion. Since this is the formula for Pythagoras’s theorem, what’s naive about it? Tell me, computer, what is the hypotenuse when the two shorter sides are 10155 and 1?

That’s why it’s naive: even though the correct result (which is 10155) can be represented, there’s an overflow in the intermediate calculations that causes the wrong value to be returned. Moler and Morrison published a non-naive algorithm for calculating the hypotenuse in 1983. Not only does it avoid underflow and overflow in intermediate products, it also avoids taking a square root (which was presumably an attractive quality back then). Let’s give it a whirl:

var mm = function(x, y) {
    // Use abs values
    if (x < 0) {
        x = -x;
    }
    if (y < 0) {
        y = -y;
    }
    var tmp, r, s;
    // Ensure that x >= y
    if (x < y) {
        tmp = x;
        x = y;
        y = tmp;
    }
    do {
        tmp = y / x;
        r = tmp * tmp;
        s = r / (4 + r);
        x = x + 2 * s * x;
        y *= s;
    }
    // While y is significant
    while (1 + y > 1);
    return x;
};

Tell me, computer, what, according to the Moler-Morrison algorithm, is the hypotenuse when the two shorter sides are 3 and 4?

Damn and blast! Everyone knows that’s 5! Can we do better? Let’s rearrange the formula a little starting from the familiar form:
r = \sqrt{x^2+y^2}
The sum inside the square root can be expressed as:
x^2(1 + (y/x)^2)
Why should we care? Because we can take x2 outside the square root to give:
r = |x|\sqrt{1 + (y/x)^2}
As long as we ensure that x >= y, (y/x)2 will not overflow. It might underflow, but then the result is simply |x| which to all intents and purposes is correct. Let’s give it a try:

var hypot = function(x, y) {
    var tmp, r;
    if (x < 0) {
        x = -x;
    }
    if (y < 0) {
        y = -y;
    }
    if (x < y) {
        tmp = x;
        x = y;
        y = tmp;
    }
    if (!x) {
        // hypot(0, 0)
        return 0;
    }
    r = y / x;
    return x * Math.sqrt(1 + r * r);
};

The only gotcha here is to avoid the division 0/0 which would result in NaN, but otherwise it’s straightforward. Now tell me, computer, what is the hypotenuse when the two shorter sides are:

  • 3 and 4
  • 0 and 0
  • 10155 and 10155

Works like a charm! You may well ask whether there's any point in being able to calculate the hypotenuse of extremely large numbers. I can't see myself ever needing to do it yet I went to the trouble of making it possible. The reason is simple: pride. I do not want someone looking at my code and thinking "n00b!".

Several of the new Math functions, such as expm1 and log1p, can give the wrong result if they're not carefully implemented. Tell me, computer, does your Math library contain naive implementations?

If you're seeing "Dunno", your browser doesn't support ES6 Math functions. "Looks like it" means that the area hyperbolic sine (asinh) of 10-16 was given as 0. To machine precision, this should be 10-16. In fairness to the engineers, the deficiency probably lies in the underlying C libraries. "Not as far as I can tell" means that this particular Math function at least has been carefully implemented. The difference between 0 and 10-16 may seem trivial but it's a 100% loss of accuracy, an example of what is known as catastrophic cancellation.

As to how naive or not my own math code is, see for yourself! (Math code starts around line 1250).

I Know Where You Live

Credit: Hope Abrams
Credit: Hope Abrams

With a reasonable chance of success, I can obtain your location to a fair degree of accuracy. Let’s give it a go.

Obtaining your location...

Hopefully, you’ll see a table of scarily-accurate information rather than an error message. If the info is miles out (or even thousands of miles out) then you’re probably reading this at work.

IP addresses are not given out willy-nilly. They are assigned hierarchically: you are assigned an address by your ISP who buy blocks of addresses from a national registrar who has been given larger blocks of addresses by the big regional registrars (for example, ARIN in North America and RIPE NCC in Europe and most of Asia). These assignments are kept in a distributed database called the whois database. If you have a Mac OS X or Linux system, you can query the database yourself:

$ whois <IPADDRESS>

This gives you most of the information you see above. Other information may be obtained by various forms of data mining; for example, when you get a weather forecast for your home town, some organization now has a fair idea of which town is your home town.

Question: is the ability to locate you by your IP address a good thing, a bad thing or just a thing? I’d say the latter. It’s a necessary side-effect of how the Internet works. Data has to be routed from one host on the Internet to another. The problem of how to do this is only tractable because of the hierarchical block assignment of IP addresses. Connecting to a local exchange ties you down further to a fairly narrow radius. On the minus side, disreputable sites may make false promises of women in your area. On the plus side, an attempt to buy goods using your Amazon account from an IP address located in Russia will throw up a red flag.

Technical Details

I’m using a geolocation API provided by ip-api.com (yes, this is an endorsement – their service is excellent and their fair use terms are very generous). They provide a JSONP interface into their database. If you’re not familiar with JSONP, it’s a way to access third party webservices, taking advantage of the fact that a JSON-encoded object is valid JavaScript and that src attributes in script tags, unlike AJAX requests, are not subject to a same-origin policy. Typically, a JSONP URL will take a callback parameter in its query string and it’s output will be the encoded JSON object passed as a parameter to your callback function. Specifically, my code looks this:

<script>
window.geo = function(obj) {
    ...
};
</script>
<script src="http://ip-api.com/json/?callback=window.geo" defer></script>

Callback function details are as follows:

window.geo = function(obj) {
    var gEle = document.getElementById("geoinfo");
    gEle.innerHTML = "";
    // Map obj properties to human readable values
    var pM = ["IP address", "query",
        "Hostname", "reverse",
        "Country", "country", "Region", "regionName",
        "City", "city", "Postcode", "zip",
        "Latitude", "lat", "Longitude", "lon",
        "Organization", "org", "Error", "msg"],
        i, tbl = document.createElement("table"),
        tr, rowIdx = 0, td, cellIdx, n, prop;
    for (i = 0; i < pM.length; i += 2) {
        n = pM[i];
        prop = pM[i + 1];
        if (!obj[prop]) {
            continue;
        }
        tr = tbl.insertRow(rowIdx++);
        cellIdx = 0;
        td = tr.insertCell(cellIdx++);
        td.appendChild(document.createTextNode(n));
        td = tr.insertCell(cellIdx++);
        td.appendChild(
            document.createTextNode(obj[prop]));
    }
    gEle.appendChild(tbl);
    if ("success" === obj.status) {
        var mapCtr = document.getElementById("map");
        mapCtr.style.height = "250px";
        var ll = {lat: obj.lat, lng: obj.lon},
        map = new google.maps.Map(mapCtr, {
            center: ll,
            zoom: 13
        }),
        marker = new google.maps.Marker({
            position: ll,
            map: map
        });
    }
};

The latitude and longitude are fed into Google Maps (whose API is a whole lot easier to use than it is to sign up for) to visualize the data.

Obtaining location information from the browser

Depending on whether your browser supports the geolocation API, whether you click the “Get Location” button, whether you give permission for your location to be used and where you are, you should see accurate information below. This information is obtained from a variety of sources. If your device has a GPS chip, that will be used. Otherwise, something like Google’s map of wi-fi networks may be used to obtain your location by triangulation. If you’re not using wi-fi, then the accuracy will be less impressive.

The API is very easy to use. You basically define a success callback and an error callback and then pass them to navigator.geolocation.getCurrentPosition, like this:

gEle = document.getElementById("geoinfo2");
var pre = document.createElement("pre");
gEle.appendChild(pre);
if (!("geolocation" in navigator)) {
    pre.innerHTML = "Geolocation is unavailable";
    return;
}

var showGeo = function() {
    var locSuccess = function(pos) {
        var lat = pos.coords.latitude,
            lon = pos.coords.longitude,
            tbl, tr, td, rowIdx, cellIdx, mapCtr;
        gEle.innerHTML = "";
        tbl = document.createElement("table");
        rowIdx = 0;
        tr = tbl.insertRow(rowIdx++);
        cellIdx = 0;
        td = tr.insertCell(cellIdx++);
        td.appendChild(
            document.createTextNode("Latitude"));
        td = tr.insertCell(cellIdx++);
        td.appendChild(document.createTextNode(lat));
        tr = tbl.insertRow(rowIdx++);
        cellIdx = 0;
        td = tr.insertCell(cellIdx++);
        td.appendChild(
            document.createTextNode("Longitude"));
        td = tr.insertCell(cellIdx++);
        td.appendChild(document.createTextNode(lon));
        gEle.appendChild(tbl);
        mapCtr = document.getElementById("map2");
        mapCtr.style.height = "250px";
        showLocation(lat, lon, mapCtr);
    },
    locError = function(e) {
        pre.innerHTML = "";
        pre.appendChild(
            document.createTextNode(
            "Unable to obtain location information: " +
            e.message));
    };
    navigator.geolocation.getCurrentPosition(
        locSuccess, locError);
};
var btn = document.createElement("button");
btn.appendChild(document.createTextNode("Get Location"));
btn.onclick = showGeo;
pre.appendChild(btn);

The geolocation API can also give you periodic updates. See the
MDN documentation
for further information.



Hexadecimal English

Credit: Ninfaj on Flickr
Credit: Ninfaj on Flickr

3735928559 is a familiar number among programmers. “Really?”, you may be asking. Perhaps you’re more familiar with its hexadecimal representation, 0xDEADBEEF. It’s the sort of number you use to initialize a block of memory so that you recognize it in a debugger or somesuch. This made me wonder what other English words can be formed from hexadecimal digits:

$ grep '^[a-f]*$' /usr/share/dict/words

Meh. I quite like effaced and defaced but there’s nothing with eight characters (8 hex digits => 32-bit int). Let’s extend it a bit. We’ll allow the following numeric substitutions:

1 → i
5 → s
0 → o

i, s and o are not valid hex digits so we need to convert them to numbers. Also, everything will look more gratifying in uppercase. We can do this with standard command-line tools, no custom scripting required:

$ grep '^[a-fiso]*$' /usr/share/dict/words | sed 's/s/5/g;s/i/1/g;s/o/0/g' | tr 'a-z' 'A-Z'

That’s better. From now on, when I allocate a block of memory, I’ll initialize it to 0xBADA55ED and before freeing it, I’ll overwrite it with 0xDECEA5ED. My code has just become cooler.

Rounding Numbers: How Hard Can It Be?

Rules for rounding half up
Credit: tes.com

How hard? Hard enough, apparently, that your computer gets it badly wrong. Tell me, computer, what is 1.015 rounded to two decimal places?

If you see “1.01”, you are using a standards-compliant browser. An eleven year old child knows that “1.01” is wrong and when a kid can see that something’s wrong, it’s just not good enough! So how hard can it be when you put your mind to it? Let’s see.

Algorithms

Before implementing a precise rounding algorithm, we need to figure out what we mean by rounding. For most cases, it’s simple: you just round towards the nearest digit. The tricky case is what to do with those pesky halves. Turns out there’s more than one way to handle them.

Round half towards positive infinity

This is the method we learn at school: 1.5 rounds up to 2. But what do we do about negative numbers? If we apply the round-half-up rule consistently we have to round them up too so that -1.5 rounds up to -1.

Round half away from zero

This is what we get if we apply the round-half-up rule symmetrically: positive numbers are rounded up to become more positive, while negative numbers are rounded down and become more negative.

Round half to even

The problem with the round-half-up algorithms is that they introduce a bias. What this means is that if you round a set of random numbers and take the average, it will not be the same as the average of the originals. This matters a lot in some applications: if you’re a major bank those half pennies mount up! The solution to this bias is to round down as often as you round up. In the case of round half towards even (aka banker’s rounding), halves are rounded to the nearest even digit so that 1.5 and 2.5 both round to 2 while 0.5 and -0.5 both round to 0.

Other algorithms

There are other algorithms which are (almost) never used. Round-half-down or round half to zero simply makes your results look consistently wrong. Round half to odd has the disadvantage that it doesn’t round to 0. Alternate rounding requires that we keep track of the last direction while stochastic rounding (randomly up or down) requires a reliable source of randomness which as any fule kno is hard!

Internet Explorer

If you see “1.02” above, you are probably using Internet Explorer. Presumably, some Microsoft engineer looked at the output from the specification and thought “that’s not good enough” as well.

Implementation

When reading the code snippets in this section, assume the following module wrapper:

(function(window) {
    "use strict";
    ...
})(this);

Which algorithm to use? We’ll give the programmer a choice:

var _ralgs = {
    HALF_UP : 0,
    AWAY_FROM_0 : 1,
    TO_EVEN : 2
};

We can note that a naive implementation – scale up, round, scale back down – isn’t going to cut the mustard. The reason for this is the finite precision of your computer hardware. Tell me, computer, what do you get when you multiply 1.015 by 100?

We can also note that rounding to n decimal places is not an arithmetic operation, it is a presentation operation. Therefore, our rounding algorithm is going to work on the string representation of the number, a bit like humans do. Like all Objects, numbers in JavaScript have a toString operation. The only question is, is it fit for purpose? Tell me, computer, what is the stringified form of 0.0000001?

Tarnation! Admittedly, I don’t have to work with very small numbers very often, but an algorithm that didn’t work in the general case would bother the heck out of me! Looks as though we’re going to have to start by implementing toString ourselves and before we can do that we have to write a function that will zero-pad a string. This really should be a standard method on String, but it’s not hard to do:

// Padding function
var _pad = function(s, padTo, padWith, right) {
    s = s + '';
    var pad = "",
        toPad = Math.max(padTo, s.length) - s.length;
    while (toPad) {
        if (toPad & 1) {
            pad += padWith;
        }
        toPad >>>= 1;
        if (toPad) {
            padWith += padWith;
        }
    }
    return right ? s + pad : pad + s;
};

Don’t be too alarmed reading that while loop. String concatenation is relatively expensive (because strings are immutable) so adding one character at a time is sub-optimal. Instead of doing n concatenations we do, at most, log2 n. Now we can re-implement toString:

// Stringifies numbers, paying attention to very large and very small quantities
var _toString = function(n) {
    var parts = n.toExponential().split("e"),
        mag = parseInt(parts[1], 10),
        _n = parts[0];
    parts = _n.split("."),
    var iPart = parts[0],
        dPart = parts[1] || "",
        sig, ret;
    if (n < 0) {
        iPart = iPart.substring(1, iPart.length);
        sig = '-';
    }
    if (mag > 0) {
        ret = iPart + dPart;
        if (ret.length < mag + 1) {
            // Pad to the right with zeroes
            ret = _pad(ret, mag + 1, '0', true);
        }
        else {
            // Need to insert the decimal place at the
            // right point
            iPart = ret.substring(0, 1 + mag);
            dPart = ret.substring(1 + mag, ret.length);
            ret = iPart;
            if (dPart) {
                ret += '.' + dPart;
            }
        }
    }
    else if (mag < 0) {
        mag = -mag;
        // 0.0 followed by mag-1 x '0' followed
        // by iPart + dPart
        ret = _pad("0.", mag + 1, '0', true) +
            iPart + dPart;
    }
    else {
        return _n;
    }
    if (sig) {
        ret = sig + ret;
    }
    return ret;
};

toExponential gives us a consistent string representation of the number which we can then split to obtain the signficand and exponent. Then it’s just a matter of zero-padding the significand accordingly and sticking the decimal point in the right place. Taking a minus sign into account when padding is bothersome so we remove it and restore it afterwards.

The next bit we have to implement is the function where we “increment” a string, propagating any carry left:

// The numeric value of the '0' character
var _zero = '0'.charCodeAt(0),
// For suppressing -0
_likeZero = /^0(?:\.0*)?$/,
// Numerically increments a string
_incStr = function(s, pos) {
    if (pos >= 0) {
        var p1 = s.substring(0, pos),
            p2 = s.substring(pos, pos + 1);
        if ('.' === p2.charAt(0)) {
            // Skip past '.'
            return _incStr(p1, pos - 1) + '.';
        }
        var d = s.charCodeAt(pos) - _zero + 1;
        if (d >= 10) {
            d -= 10;
            // Carry 1
            p1 = _incStr(p1, pos - 1);
        }
        p2 = String.fromCharCode(d + _zero);
        return p1 + p2;
    }
    // We've incremented off the left edge of the string
    // so do the final carry
    return '1' + s;
};

That’s the boilerplate done. Now we can implement our rounding functions. All take a number and the number of desired fractional digits as parameters. First up, half-up:

// Implementation of round to plus infinity (exact)
var _round2PlusInfinity = function(n, fd) {
   var s = _toString(Math.abs(n)),
       pos = s.indexOf('.');
   // _toString gives x or x.yyy not .yyy
   if (pos >= 1) {
       var nd = s.length, rpos = pos + fd + 1;
       if (rpos < nd) {
           var d = s.charCodeAt(rpos) - _zero,
               up = d >= 5;
           if (d === 5 && n < 0) {
               up = false;
           }
           if (up) {
               // Round up
               s = _incStr(s.substring(0, rpos),
                   rpos - 1);
           }
           else {
               // Simply truncate
               s = s.substring(0, rpos);
           }
       }
   }
   if (n < 0 && !_likeZero.test(s)) {
       s = '-' + s;
   }
   return s;
};

Here we see what that mystery _likeZero regex was for. For negative numbers, we get rid of a bothersome minus sign by working with the abs value. If a value rounds to 0 and we stick the minus sign back on the front, we’ll end up with “-0” which just looks weird. Round half towards even is similar with a little bit of added complexity in deciding whether we round up or down:

// Implementation of round half to even (exact)
var _roundTowardsEven = function(n, fd) {
    var neg = false;
    if (n < 0) {
        // Round half to even is symmetric
        neg = true;
        n = -n;
    }
    var s = _toString(Math.abs(n));
    var pos = s.indexOf('.');
    if (pos >= 1) {
        var nd = s.length, rpos = pos + fd + 1;
        if (rpos < nd) {
            var d = s.charCodeAt(rpos) - _zero, 
                up = d > 5;
            if (d === 5) {
                // Check the previous digit
                var c, _rp = rpos, _d;
                do {
                    c = s.charAt(--_rp);
                }
                while (c === '.');
                _d = s.charCodeAt(_rp) - _zero + 1;
                // Up if the previous digit will be even
                up = (0 === _d % 2);
            }
            if (up) {
                // Round up
                s = _incStr(s.substring(0, rpos),
                    rpos - 1);
            }
            else {
                // Simply truncate
                s = s.substring(0, rpos);
            }
        }
    }
    if (neg && !_likeZero.test(s)) {
        s = '-' + s;
    }
    return s;
};

Implementation of the final algorithm is a doddle:

// Implementation of round half away from zero
var _roundAwayFromZero = function(n, fd) {
    if (n >= 0) {
        return _round2PlusInfinity(n, fd);
    }
    var ret = _round2PlusInfinity(-n, fd);
    if (!_likeZero.test(ret)) {
        ret = '-' + ret;
    }
    return ret;
};

Now we can join them together:

// Rounding function with algorithm specified
var _round = function(n, fd, alg) {
    fd >>>= 0;
    switch (alg) {
        case _ralgs.HALF_UP:
            return _round2PlusInfinity(n, fd);

        case _ralgs.AWAY_FROM_0:
            return _roundAwayFromZero(n, fd);

        default:
            return _roundTowardsEven(n, fd);
    }
};

There’s one last bit of boilerplate before implementing our public API. One thing I dislike about toFixed is that you get trailing zeroes in your output whether you want them or not. You asked for six decimal places in your output? Here you go! Our API will, by default, suppress useless zeroes but the caller can specify a minimum number as an optional argument. We need a function to make that possible:

// Applies minimum / maximum fractional digits
var _trimFraction = function(s, minFd, maxFd) {
    var splitPoint = s.indexOf('.'), i = s, j, f = "";
    if (splitPoint === -1 && minFd === 0) {
        return s;
    }
    if (null == minFd) {
        minFd = 0;
    }
    if (null == maxFd) {
        maxFd = 2;
    }
    if (splitPoint >= 0) {
        f = s.substring(splitPoint + 1, s.length);
        i = s.substring(0, splitPoint);
    }
    if (minFd) {
        // Add trailing zeroes
        for (j = f.length; j < minFd; j++) {
            f += '0';
        }
    }
    else {
        // Trim excess seroes
        f = f.substring(
            minFd, maxFd).replace(/0+$/, "");
    }
    if (f) {
        return i + '.' + f;
    }
    return i;
};

Our public API consists of some properties and a method added to Number.prototype:

// Algorithm specification
window.Rounding = {
    HALF_TO_PLUS_INFINITY : _ralgs.HALF_UP,
    HALF_AWAY_FROM_ZERO : _ralgs.AWAY_FROM_0,
    HALF_TO_EVEN : _ralgs.TO_EVEN,
    // Default rounding method is
    // ROUND_HALF_TO_EVEN
    algorithm : _ralgs.TO_EVEN
};

// Add a roundTo method to Number
var rt = function(maxFd, minFd) {
    if (!isFinite(this)) {
        return this + '';
    }
    // Require numbers >= 0
    maxFd = Number(maxFd);
    minFd = Number(minFd);
    if (!(isFinite(maxFd) && maxFd >= 0)) {
        maxFd = 0;
    }
    if (!(isFinite(minFd) && minFd >= 0)) {
        minFd = 0;
    }
    if (minFd > maxFd) {
        minFd = maxFd;
    }

    return _trimFraction(
        _round(this, maxFd, window.Rounding.algorithm), minFd, maxFd);
}, mName = "roundTo", obj = window.Number.prototype;
if (!(mName in obj)) {
    try {
        Object.defineProperty(obj, mName, {
             value : rt,
             enumerable : false
        });
    }
    catch(e) {
        // Do it the old fashioned way
        obj[mName] = rt;
    }
}

Because of the way we’ve implemented our rounding function, we don’t have to worry that someone might set Rounding.algorithm to “Dave”; this means that we don’t have to use a getter / setter for the property. Some Internet experts expend a lot of hot air on the evils of extending built-in types, but Number is absolutely the right place to add our method since we can now use it in place of toFixed and the ability to extend Objects is a central feature of JavaScript anyway. Do note, however, how we go about adding our method – this is the nice way to do it; stomping over someone else’s roundTo method is not polite!

Let’s try it out. Tell me, computer, what is the result of rounding 1.015 to two decimal places?

Hurrah! Remember that it’s generic enough to handle very small numbers? Tell me, computer, what is 1.25×10-26 to 27 decimal places?

Note, this is not wrong – it is the round half to even algorithm. If we set the algorithm to round half to plus infinity and ask again, we get a slightly different answer:

If you don’t want to use a different rounding function, there’s no reason why we can’t simply fix toFixed! Here’s a patch:

(function() {
    if ((0.1).toFixed(20) ===
        "0.10000000000000000000") {
        // Not broken
        return;
    }
    var m = function(fd) {
        return this.roundTo(fd, fd);
    }, obj = Number.prototype, mName = "toFixed";
    try {
        Object.defineProperty(obj, mName, {
            enumerable : false,
            value : m
        });
    }
    catch(e) {
        obj[mName] = m;
    }
})();

And the answer to the original question of how hard can it be? The answer is: harder than you think!

You can download the complete library or, if you prefer, a a minified version.