Language Showdown

'Pollice Verso' by Jean-Léon Gérôme
‘Pollice Verso’ by Jean-Léon Gérôme

Programming languages are the medium through which we turn our intentions into actions. Ideally, the choice of language to use should be a neutral decision. In particular, the language itself shouldn’t get in the way: easy things should be easy, obvious things should be obvious and it shouldn’t be too idiosyncratic. I am going to write a simple program in five different programming languages and comment on how easy and pleasant or otherwise it is to work with them.

The five languages chosen are, in alphabetical order: C++, Java, Node.js, Perl and Python. The program is a simplified version of the head utility. head prints the first few lines of its input or one or more files passed on the command line. Our utility will have some restrictions compared to the real version: it will only process a single file and will not have the ability to print all but the first few lines. The problem is what I’d call bread-and-butter programming: reading files, processing them line-by-line and producing output. I don’t think that this is a problem designed to favour one particular language over another – this is the sort of activity that any language should be suited for. At the end, I will also compare performance of the different solutions, just for fun.

I will point out that if I were to write a real implementation of head, I wouldn’t read input one line at a time. I’d read input in conveniently sized chunks, search for newline characters and do whatever needed to be done. However, in the more general case, we want to treat a file as a sequence of lines rather than a bucket of bytes so this is what I’ll do in the different programs.

The Basics

The default behaviour of head is to print the first ten lines of the file passed on the command line so this is the behaviour that we’ll implement as a first cut. File errors will be reported on stderr and will cause an abnormal program exit. I have plans for when no arguments are passed on the command line so this condition will result in a normal program exit.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
using namespace std;

const int lines = 10;

int main(int argc, char **argv) {
    if (argc < 2) {
        return 0;
    }
    ios_base::sync_with_stdio(false);
    ifstream file(argv[1]);
    if (!file.is_open()) {
        cerr << "Unable to open " << argv[1] <<
            ": " << strerror(errno) << "\n";
        return 1;
    }
    string line;
    for (int i = 0; i < lines && getline(file, line); ++i) {
        cout << line;
        if (!file.eof()) {
            cout << "\n";
        }
    }
    return 0;
}

C++ is, by a small margin, the second most verbose of the solutions (585 bytes vs. 580 bytes for the Java version). The first few lines are fixed overhead, however, so I expect this to change. Note the following:

  • Line 5: I wouldn’t do this in a complex project but in a simple program, not having to qualify everything as std::cout, std::ifstream and so on is a hassle-saver
  • Line 13: Unless you explicitly disable the feature, C++ allows you to freely mix calls to the C++ native I/O objects, cin and cout, and legacy C stdio functions, scanf and printf. To make this possible, a lot of work is done under the hood to keep different stream pointers synchronized, which kills performance unless you turn this questionable feature off. I’ll show the performance benefit of this call at the end, but for the moment, -1 for requiring obscure knowledge and for having what is arguably a non-sensible default
  • Lines 14-19: On the other hand, C++ does make easy things easy – opening a file and checking for an error condition is straightforward. We are also able to access the operating system error and we have full control of how the error is reported
  • Line 21: We use the getline function from the string module. ifstream objects have a getline method as well but this does not handle arbitrarily long lines
  • Lines 23-25: The action of getline is to read up to a newline, and then advance the read pointer past it. This means that we have to take care not to report a newline character that does not exist if we read to the end of the file. The real head utility doesn’t and I would consider doing so a bug
  • Line 24: "\n" rather than endl since endl guarantees a flush of the underlying output handle which might hurt performance. Otherwise, it doesn’t have, say, a portability benefit over a newline character which will be translated to the line terminator pattern of the system automatically.
  • Line 27: Although it’s not obvious, C++ file objects close the underlying handle in their destructors when they go out of scope. This means that we don’t have to explicitly call the close method. This pattern is known as RAII (Resource Acquisition Is Initialization) and is a nice feature of well-written C++ classes.

Java

import java.io.*;

class Head {
    private static final int lines = 10;

    public static void main(String[] args) {
        if (0 == args.length) {
            System.exit(0);
        }
        try {
            BufferedReader reader =
                new BufferedReader(new FileReader(args[0]));
            String line = null;
            for (int i = 0; (line = reader.readLine()) != null
                && i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Although the Java solution has fewer lines than the C++ one, this is down to lower fixed overhead (one import statement as opposed to four include statements) since otherwise it is more verbose and long winded. Note the following:

  • The compiled binary cannot be run directly from the command line. Instead, I have to invoke it like this:
    $ java Head /path/to/file
  • Line 4: <qualifier> static final; that’s a very long-winded way to say const
  • Line 7: Java is the only language of my acquaintance that doesn’t treat 0 or null as false in an if statement. Hence, the explicit test
  • Line 12: Java is the only language that does unbuffered file I/O by default which means that we need to wrap an object that reads files inside an object that does buffering. Wrapping a thing inside another thing to do useful work is an unpleasant aspect of programming in Java
  • Line 16: Never mind the long-winded way of saying “print”, we have a bug that we cannot easily fix in the Java version. readLine behaves like the C++ getline in that it discards the newline and moves the file pointer. However, there is no way, as far as I can tell, to detect EOF other than readLine returning null by which time we’ve already emitted an erroneous newline. We could get around this by using read and finding the newline characters ourselves, but then Java would fail the “easy things should be easy” requirement utterly. -1 for no way to detect EOF condition other than attempting to read from the handle again
  • Line 20: Error comes out as something like “java.io.FileNotFoundException: doesnotex.ist (No such file or directory)” which is not very nice. Calling getMessage does not improve things greatly. -1 for limited control over error reporting
  • Line 23: Although it’s not obvious, objects that hold filehandles do not close them automatically, at least not in a deterministic manner. When the object goes out of scope, it will be marked for garbage collection. When the object is garbage collected, the file will be closed but that might not happen for a long time. As it happens, exiting the program closes all filehandles anyway.

Exiting the program means that filehandles are closed, but if we had to close files explicitly, Java would force a horrible pattern on us:

BufferedReader reader = null;
try {
    reader = new BufferedReader(...);
    // Do stuff and then close when we're done
    reader.close();
    reader = null;
}
catch(e) {
    // Handle error
}
finally {
    // Clean up if necessary
    if (reader != null) {
        try {
            reader.close();
        }
        catch(e) {
            // close threw an error; what exactly am
            // I supposed to do with it?
        }
    }
}

When I regularly programmed in Java, close throwing a checked exception used to drive me nuts!

Node.js

#!/usr/bin/env node

if (process.argv.length < 3) {
    process.exit();
}

var fs = require('fs'),
    readline = require('readline'),
    strm = fs.createReadStream(process.argv[2]),
    lines = 10,
    eof = false;

strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Node.js isn’t a programming language in itself, but rather a JavaScript engine taken out of the browser context and run as a script interpreter. This was by some measure the most difficult program to write which is reflected in both the greatest number of lines and the largest file size. There were also a couple of surprises and when it comes to software, surprising is never good. The twisted program structure is down to the way that Node.js works. It seems that file operations are done in a separate execution thread so that the main program thread is not blocked. Note the following:

  • Line 3: First surprise. Conventionally, arguments to your script or program start at offset 0 (for example, Java and Perl) or offset 1 (for example, C/C++ and Python). For a script invoked by Node.js, “node” is the first argument followed by the script name so that the arguments for your script start at offset 2
  • Line 13: This is rather a convoluted way of getting at a file error. -1 for being neither easy nor obvious
  • Line 14: The error comes out as something like “ENOENT: no such file or directory, open ‘doesnotex.ist'” which is detailed but not very attractive. -1 for no control over the output
  • Line 26: Again, strange but this seems to be the Node.js way. line does not include the newline
  • Lines 27-29: This is how a non-existent newline is suppressed. The “end” event on the stream object fires before the last “line” event on the reader so that the eof flag gets set before we receive the last line
  • Line 30: Long-winded way of saying “print”
  • Line 32: As far as I can tell, this statement doesn’t actually close anything, since the “line” events keep on coming in. What it does do is cause the “close” event to be fired. This is extremely surprising behaviour and not terribly useful so -1
  • Line 37: Need to exit the program to stop reading the file.

Perl

#!/usr/bin/perl

use strict;
use warnings qw/all/;

my $lines = 10;
exit 0 unless (@ARGV);
open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

I must confess to having a dog in this particular fight since I enjoy programming in Perl. This sample should show why since it is the shortest and simplest solution by a long way, even taking into account a couple of lines of fixed overhead. Note the following:

  • Line 7: This is idiomatic Perl. If you don’t like the “unless” idiom, you can rewrite this as if (!@ARGV)
  • Line 8: open ... or die is also idiomatic. We can access the system error via the special variable $! and have full control over the error output
  • Line 9: Perl has a dedicated operator, <>, to read lines of text from a filehandle. The line returned includes any newline character. Note that there’s a subtle bug that could bite us if the input ended in “0” without a newline. This would evaluate to false which means that we need the defined function to avoid this being a problem
  • Line 10: Now that’s how to name a print function! The Perl print function doesn’t do anything funny to its input like adding a newline
  • Line 11: It’s not obvious, but filehandles in Perl are closed when the last reference to them goes out of scope. This means that we do not need to explicitly call close.

Python

#!/usr/bin/python
import sys

if len(sys.argv) < 2:
    sys.exit(0)
lines = 10
try:
    with open(sys.argv[1]) as f:
        for line in f:
            sys.stdout.write(line)
            lines -= 1
            if not lines:
                break
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

I’m not such a fan of programming in Python, although it’s perfectly pleasant to do so. The syntax is very different from the other four since Java, JavaScript and Perl are all C-like languages while Python is somewhat idiosyncratic. Note the following:

  • Line 7: Python forces a try...catch structure like Java since I/O operations throw rather than return error values. If you don’t catch exceptions, the error output is very ugly
  • Line 8: Python objects are garbage collected which means that their destruction is non-deterministic. However, Python also offers an RAII pattern: the with keyword opens a “controlled execution” block which scopes the file object f. This ensures that the file is closed without having to explicitly close it
  • Line 9: File objects are iterable which makes one-line-at-a-time processing extremely easy. The lines include a trailing newline character when present
  • Line 10: Python does have a print function but that adds a newline character to the output which is behaviour that we don’t want
  • Line 15: We have full control of error output and can access the actual system error conveniently.

Part 2: Command-line Options

Well behaved programs can have their behaviour changed by reading options from the command line. Options can be short form (-?) or long form (--help). When there are more than a couple of options, parsing them becomes laborious and tedious so we’ll want to use a library routine to do so when available. The options we’ll accept are:

  • --help / -?: Print a usage message and exit
  • --count / -n <number>: Print the first <number> lines of the file instead of 10

In addition, the real head utility can take, for example, -2 as an option in place of -n 2 or --count 2 so we’ll do the same.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <regex>
#include <getopt.h>
using namespace std;

const int deflines = 10;

void usage(const char *name, const char *msg) {
    if (msg) {
        cerr << msg << "\n";
    }
    cerr << "Usage:\n  " << name << " [--count|-n <lines>] [FILE]\n";
}

int main(int argc, char **argv) {
    int lines = deflines, opt;
    const char *err = NULL;
    bool needHelp = false;
    string countOpt;
    regex numeric("\\d+");
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) {
        countOpt = &argv[1][1];
        argv[1] = argv[0];
        ++argv;
        --argc;
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                if (regex_match(optarg, numeric)) {
                    countOpt = optarg;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                }
                break;
            case '?':
            default:
                needHelp = true;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (argc == optind) {
        return 0;
    }
    if (!countOpt.empty()) {
        lines = stoi(countOpt);
    }

    ios_base::sync_with_stdio(false);
    ifstream file(argv[optind]);
    if (!file.is_open()) {
        cerr << "Unable to open " << argv[optind] <<
            ": " << strerror(errno) << "\n";
        return 1;
    }
    string line;
    for (int i = 0; i < lines && getline(file, line); ++i) {
        cout << line;
        if (!file.eof()) {
            cout << "\n";
        }
    }
    return 0;
}

Ouch! Our program has just grown 3x bigger! GNU getopt is the standard for parsing command-line options and it’s part of the standard C/C++ library, libc. Windows developers will have a harder time since Windows has a different option syntax (for example, /n rather than -n) and there is no standard library routine that I know of for parsing options. We could also have used argp which provides additional bells and whistles (such as generating a usage message for you), but the overhead is higher and the learning curve somewhat steeper. Having paid the high cost of entry for option parsing, the cost of adding additional options is low – typically one variable, one entry in the options array and one case statement. Note the following:

  • Lines 11-16: Using getopt requires us to write a usage function. If our program grew to take many options, the cost of using a more complex argument parser like popt or boost::program_options that takes care of generating a help screen may be worthwhile
  • Line 23: C++ 11 gives us native regular expressions. Hurrah! This regex is used to test for numeric options
  • Lines 24-28: Program options; note the “NULL terminator” at the end
  • Line 29: Check for an option that looks like “-number“. Note that getopt will choke on this so we need to remove it from the arguments array
  • Lines 30-33: Pointers for the win! While this is the sort of stuff that programmers unfamiliar with C/C++ find maddening, it is very much idiomatic and natural once you’re familiar with the basics. The result is that we splice the non-standard argument from the front of argv
  • Line 35: getopt_long returns -1 when it has processed all the options that it knows how to. The index of the first unprocessed option is available via optind
  • Lines 56-58: I still have plans for when no file argument is supplied, so exit normally if this is the case
  • Line 60: Override lines. stoi throws if fed bad input but we’ve already checked that input is good, so we don’t need to catch the exception
  • Line 64: Our file argument is at optind, not 1.

Java

import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

class Head {
    private static final int deflines = 10;

    private static void usage(String msg) {
        if (!(msg == null || msg.isEmpty())) {
            System.err.println(msg);
        }
        System.err.println("Usage:\n  java Head [--count|-n <lines>] [FILE]");
        System.exit(1);
    }

    public static void main(String[] args) {
        int optIdx = 0, lines = deflines;
        boolean needHelp = false;
        String err = null;
        Pattern customOption = Pattern.compile("\\-(\\d+)"),
            numericOption = Pattern.compile("^\\d+");

        if (args.length > 0 && args[0].startsWith("-")) {
            Matcher m = customOption.matcher(args[0]);
            if (m.find()) {
                lines = Integer.parseInt(m.group(1));
                ++optIdx;
            }
        }
        for (; optIdx < args.length; ++optIdx) {
            if (args[optIdx].equals("--help") || args[optIdx].equals("-?")) {
                needHelp = true;
            }
            else if (args[optIdx].equals("--count") ||
                args[optIdx].equals("-n")) {
                Matcher m = numericOption.matcher(args[optIdx + 1]);
                if (m.find()) {
                    lines = Integer.parseInt(args[optIdx + 1]);
                    ++optIdx;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                    break;
                }
            }
            else {
                break;
            }
        }
        if (needHelp) {
            usage(err);
        }

        if (optIdx == args.length) {
            System.exit(0);
        }
        try {
            BufferedReader reader =
                new BufferedReader(new FileReader(args[optIdx]));
            String line = null;
            for (int i = 0; (line = reader.readLine()) != null
                && i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Java doesn’t come out of this round terribly well. Java lacks a native command-line option parse routine. Given that it has a library to parse X.509 certificates and given that I have to work with certificates far less often than I have to handle command-line options, one wonders why. There are several dozen option parsing libraries on Github that claim to be GNU getopt compatible. A popular choice seems to be args4j but then Java throws another obstacle in our way. We have an option that a standard options parser will choke on. In all the other languages under consideration we can make this a non-issue by modifying the arguments array. In Java, arrays are immutable because, clearly, reasons. We could get round this by copying the array members that we want to keep to a container object then turning that back into an array or we could say that this is too much like pointless busywork for a recreational programming project, parse the options the hard way and mark Java down accordingly. Therefore, -2 for failing the “obvious should be obvious” and “the language itself shouldn’t get in the way” criteria. Even without the fixed overhead of using an options parser, the Java solution has overtaken the C++ one in terms of typing required if not in terms of line count. Note also the following:

  • Lines 20-21: Regular expressions are somewhat painful to use in Java but by now I’m not surprised
  • Lines 30-35: Custom option parsing code. Unlike the C++ solution, this will not scale at all well
  • Line 31: Who in their right minds wouldn’t want ‘==’ to do the obvious thing here? The Java language designers could have made testing a String object against a string literal do the right thing yet they very deliberately chose not to. Instead, the ugly x.equals(y) pattern is required. Interestingly, an ‘==’ test compiles; it just doesn’t work. -1 again for failing “the language itself shouldn’t get in the way” criterion.

Node.js

#!/usr/bin/env node

var lines = 10, i, optIdx = 2, needHelp = false, err = null,
    args = process.argv;
if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) {
    lines = parseInt(args[optIdx].substr(1));
    args.splice(optIdx, 1);
}

for (; optIdx < args.length; ++optIdx) {
    if (args[optIdx] == "--help" || args[optIdx] == "-?") {
        needHelp = true;
    }
    else if (args[optIdx] == "--count" || args[optIdx] == "-n") {
        if (/^\d+$/.test(args[optIdx + 1])) {
            lines = parseInt(args[optIdx + 1]);
            ++optIdx;
        }
        else {
            err = "Bad count argument";
            needHelp = true;
            break;
        }
    }
    else {
        break;
    }
}

if (needHelp) {
    if (err) {
        console.error(err);
    }
    console.error("Usage:\n  " + args[1] + " [--count|-n <lines>] [FILE]");
    process.exit(1);
}

if (optIdx === args.length) {
    process.exit();
}

var fs = require('fs'),
    readline = require('readline'),
    strm = fs.createReadStream(args[optIdx]),
    eof = false;

strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Node.js also lacks a standard option parser, so -1. I tried out yargs which parses options OK but doesn’t appear to allow you to specify “-n” as a shortened alias for “–count”. Handling short options after yargs did its thing took exactly as many lines as handling the options myself. There isn’t a lot of point in pulling in a third party library if it’s not buying you anything. The JavaScript code for handling options is a direct port of the Java code. It is a lot more readable by virtue of, for example, regular expressions being first class data types and by the JavaScript syntax itself being less obtrusive.

Perl

#!/usr/bin/perl

=head1 SYNOPSIS

headpl [--count|-n <lines>] [FILE]

=cut

use strict;
use warnings qw/all/;
use Getopt::Long;
use Pod::Usage;

my ($needHelp, $err, $lines) = (0, "", 10);

if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) {
    $lines = $1;
    shift(@ARGV);
}
GetOptions(
    "help|?" => \$needHelp,
    "count|n=s" => \$lines,
);
unless ($lines =~ /^\d+$/) {
    $err = "Bad count argument";
    $needHelp = 1;
}

pod2usage(
    exitval => 1,
    message => $err
) if ($needHelp);

exit 0 unless (@ARGV);
open(my $fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

The Perl solution remains admirably compact despite being formatted for readability. As you can see, Perl’s reputation as a “write-only” language is undeserved. Readability or otherwise of Perl code is entirely down to the author. Note the following:

  • Lines 3-7: This is POD (plain old documentation). The documentation serves double duty as the usage message which is very Perlish
  • Lines 20-23: The getopt implementation in Perl is very elegant and has a low cost of entry. The function perturbs the ARGV array so that what is left over represents non-option arguments. The arguments GetOptions are pattern-reference pairs. We could replace the “big-arrow” (=>) operators with “,”, although the former is more idiomatic
  • Line 22: We could have specified this option as “count|n=i” and then GetOptions would discard any non-numeric value with a warning, leaving $lines unmodified. However, the other solutions error on a non-numeric argument and since we need to reject a negative value anyway, I have chosen to check the value myself
  • Line 29: pod2usage makes usage messages easy and your messages are as good as your documentation.

Python

#!/usr/bin/python
import sys
import re
import argparse

def usage():
    return '''Usage:
  headpy[--count|-n <lines>] [FILE]
'''

lines = 10
if len(sys.argv) > 1:
    customOption = re.compile(r"\-(\d+)")
    m = customOption.match(sys.argv[1])
    if m:
        lines = int(m.group(1))
        del sys.argv[1]

parser = argparse.ArgumentParser(
    usage = usage(), add_help = False
)
parser.add_argument("-?", "--help", required = False, action="store_true")
parser.add_argument("-n", "--count", required = False)
parser.add_argument("argv", nargs = argparse.REMAINDER)
args = parser.parse_args()
err = None
needHelp = args.help
if args.count:
    numericOption = re.compile(r"^\d+$")
    m = numericOption.match(args.count)
    if m:
        lines = int(args.count)
    else:
        err = "Bad count argument"
        needHelp = True
if needHelp:
    parser.error(err)    

if not args.argv:
    sys.exit(0)
try:
    with open(args.argv[0]) as f:
        for line in f:
            sys.stdout.write(line)
            lines -= 1
            if not lines:
                break
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

Python has a number of facilities for parsing options, including a getopt implementation. The module recommended in the pydoc is argparse. I wouldn’t say that using argparse is easy but it’s usable. Note the following:

  • Line 16: Unlike Perl and JavaScript, Python requires an explicit cast to an integer. This halfway-house to strong typing makes Python a less friendly scripting language
  • Lines 18-20: This is the code that constructs the parser. argparse has the ability to generate a help screen in response to -h/--help but I didn’t like the appearance of the help. These overrides, along with the usage function make the usage output look similar to the output of the other four solutions
  • Line 22: If we added “type = int” to the argument list, argparse would coerce the count value to an integer. However, as with the Perl version, we also want to reject negative values so I’m choosing to parse the value myself
  • Line 23: This is a remarkably non-obvious way to get the option parser to not barf over non-option arguments, so -1. As coded, non-option arguments will appear in an array property of the parsed arguments called argv

Part 3: Reading from stdin

A well behaved program that takes its input from a file should also be able to read from stdin. This allows it to form part of a pipeline where the output of another program forms our program’s input. For example, we might want to take the output of the sort utility to show the top ten results. Conventionally, no file argument or an argument specified as “-” indicates that we should read from stdin.

C++

#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <regex>
#include <getopt.h>
using namespace std;

const int deflines = 10;

void usage(const char *name, const char *msg) {
    if (msg) {
        cerr << msg << "\n";
    }
    cerr << "Usage:\n  " << name << " [--count|-n <lines>] [FILE]\n";
}

void printstream(istream &in, int lines) {
    string line;
    for (int i = 0; i < lines && getline(in, line); ++i) {
        cout << line;
        if (!in.eof()) {
            cout << "\n";
        }
    }
}

int main(int argc, char **argv) {
    int lines = deflines, opt;
    const char *err = NULL;
    bool needHelp = false;
    string countOpt;
    regex numeric("\\d+");
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-' && regex_match(&argv[1][1], numeric)) {
        countOpt = &argv[1][1];
        argv[1] = argv[0];
        ++argv;
        --argc;
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                if (regex_match(optarg, numeric)) {
                    countOpt = optarg;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                }
                break;
            case '?':
            default:
                needHelp = true;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (!countOpt.empty()) {
        lines = stoi(countOpt);
    }

    ios_base::sync_with_stdio(false);

    string fName;
    if (argc > optind) {
        fName = argv[optind];
    }
    if (fName.empty() || "-" == fName) {
        printstream(cin, lines);
    }
    else {
        ifstream file(fName);
        if (!file.is_open()) {
            cerr << "Unable to open " << fName <<
                ": " << strerror(errno) << "\n";
            return 1;
        }
        printstream(file, lines);
    }
    return 0;
}

Nothing difficult here. There’s a slight hindrance in that stream objects cannot be assigned. For example, this wouldn’t work:

istream p;
...
p = cin;

We could use pointers:
istream *p = NULL;
...
p = &cin;

However, I chose to refactor, moving the code that does the actual work into a function that takes an istream reference. It is then a simple matter of calling it with the correct input stream object.

Java

import java.io.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

class Head {
    private static final int deflines = 10;

    private static void usage(String msg) {
        if (!(msg == null || msg.isEmpty())) {
            System.err.println(msg);
        }
        System.err.println("Usage:\n  java Head [--count|-n <lines>] [FILE]");
        System.exit(1);
    }

    public static void main(String[] args) {
        int optIdx = 0, lines = deflines;
        boolean needHelp = false;
        String err = null;
        Pattern customOption = Pattern.compile("\\-(\\d+)"),
            numericOption = Pattern.compile("^\\d+");

        if (args.length > 0 && args[0].startsWith("-")) {
            Matcher m = customOption.matcher(args[0]);
            if (m.find()) {
                lines = Integer.parseInt(m.group(1));
                ++optIdx;
            }
        }
        for (; optIdx < args.length; ++optIdx) {
            if (args[optIdx].equals("--help") ||
                args[optIdx].equals("-?")) {
                needHelp = true;
            }
            else if (args[optIdx].equals("--count") ||
                args[optIdx].equals("-n")) {
                Matcher m = numericOption.matcher(args[optIdx + 1]);
                if (m.find()) {
                    lines = Integer.parseInt(args[optIdx + 1]);
                    ++optIdx;
                }
                else {
                    err = "Bad count argument";
                    needHelp = true;
                    break;
                }
            }
            else {
                break;
            }
        }
        if (needHelp) {
            usage(err);
        }

        String fName = "";
        if (optIdx < args.length) {
            fName = args[optIdx];
        }
        try {
            Reader r = null;
            if (fName.isEmpty() || fName.equals("-")) {
                r = new InputStreamReader(System.in);
            }
            else {
                r = new FileReader(fName);
            }
            BufferedReader br = new BufferedReader(r);
            String line = null;
            for (int i = 0; (line = br.readLine()) != null &&
                i < lines; ++i) {
                System.out.println(line);
            }
        }
        catch(Exception e) {
            System.err.println(e);
            System.exit(1);
        }
    }
}

Java placed no obstacles in my way here. We just need to switch the Reader object that we use to construct the BufferedReader used to split the input into individual lines.

Node.js

#!/usr/bin/env node

var lines = 10, i, optIdx = 2, needHelp = false, err = null,
    args = process.argv;
if (args.length > optIdx && /^\-\d+$/.test(args[optIdx])) {
    lines = parseInt(args[optIdx].substr(1));
    args.splice(optIdx, 1);
}

for (; optIdx < args.length; ++optIdx) {
    if (args[optIdx] == "--help" || args[optIdx] == "-?") {
        needHelp = true;
    }
    else if (args[optIdx] == "--count" || args[optIdx] == "-n") {
        if (/^\d+$/.test(args[optIdx + 1])) {
            lines = parseInt(args[optIdx + 1]);
            ++optIdx;
        }
        else {
            err = "Bad count argument";
            needHelp = true;
            break;
        }
    }
    else {
        break;
    }
}

if (needHelp) {
    if (err) {
        console.error(err);
    }
    console.error("Usage:\n  " + args[1] + " [--count|-n <lines>] [FILE]");
    process.exit(1);
}

var fs = require('fs'),
    readline = require('readline'),
    fName = "", strm, eof = false;
if (optIdx < args.length) {
    fName = args[optIdx];
}
if (!fName || fName === "-") {
    strm = process.stdin;
}
else {
    strm = fs.createReadStream(fName);
}
strm.on("error", function(e) {
    console.error(e.message);
    process.exit(1);
});
strm.on("end", function() {
    eof = true;
});

var rd = readline.createInterface({
    input: strm,
    terminal: false
});

rd.on("line", function(line) {
    if (!eof) {
        line += "\n";
    }
    process.stdout.write(line);
    if (--lines === 0) {
        rd.close();
    }
});

rd.on("close", function() {
    process.exit();
});

Again, no problems. JavaScript’s weak typing makes this easier than the Java version since strm is just a reference to a thing that behaves in a stream-like manner.

Perl

#!/usr/bin/perl

=head1 SYNOPSIS

headpl [--count|-n <lines>] [FILE]

=cut

use strict;
use warnings qw/all/;
use Getopt::Long;
use Pod::Usage;

my ($needHelp, $err, $lines) = (0, "", 10);

if (@ARGV && $ARGV[0] =~ /\-(\d+)$/) {
    $lines = $1;
    shift(@ARGV);
}
GetOptions(
    "help|?" => \$needHelp,
    "count|n=s" => \$lines,
);
unless ($lines =~ /^\d+$/) {
    $err = "Bad count argument";
    $needHelp = 1;
}

pod2usage(
    exitval => 1,
    message => $err
) if ($needHelp);

my $fh = *STDIN;
if (@ARGV && $ARGV[0] ne "-") {
    open($fh, "<", $ARGV[0]) or die("Unable to open $ARGV[0]: $!\n");
}
while (defined(my $line = <$fh>) && $lines--) {
    print($line);
}

The Unix model that everything is a file applies to Perl: stdin is simply a filehandle which means that all we have to do is change what $fh refers to. Note the following:

  • Line 34: The syntax may be unfamiliar. STDIN is an entry in the symbol table, but it doesn’t have a defined type (i.e., there isn’t a $STDIN or a %stdin). *STDIN is called a typeglob and means “the thing that STDIN refers to”
  • Line 35: Perl treats string values and numeric values interchangeably, depending on the context. Distinguishing numeric equality from string equality requires different operators: =/!= for numeric equality, eq/ne for string equality.

Python

#!/usr/bin/python
import sys
import re
import argparse

def usage():
    return '''Usage:
  headpy[--count|-n <lines>] [FILE]
'''

def printstream(f, lines):
    for line in f:
        sys.stdout.write(line)
        lines -= 1
        if not lines:
            break   

lines = 10
if len(sys.argv) > 1:
    customOption = re.compile(r"\-(\d+)")
    m = customOption.match(sys.argv[1])
    if m:
        lines = int(m.group(1))
        del sys.argv[1]

parser = argparse.ArgumentParser(
    usage = usage(), add_help = False
)
parser.add_argument("-?", "--help", required = False, action="store_true")
parser.add_argument("-n", "--count", required = False)
parser.add_argument("argv", nargs = argparse.REMAINDER)
args = parser.parse_args()
err = None
needHelp = args.help
if args.count:
    numericOption = re.compile(r"^\d+$")
    m = numericOption.match(args.count)
    if m:
        lines = int(args.count)
    else:
        err = "Bad count argument"
        needHelp = True
if needHelp:
    parser.error(err)    

fName = None
if args.argv and args.argv[0] != "-":
    fName = args.argv[0]
try:
    if fName:
        with open(fName) as f:
            printstream(f, lines)
    else:
        printstream(sys.stdin, lines)
except IOError, e:
    sys.stderr.write("Unable to read %s: %s\n" % (sys.argv[1], e.strerror))
    sys.exit(1)

As with the C++ solution, I refactored the Python script to move the core functionality into its own function that takes some kind of iterable object. Unlike the C++ solution, I’m not seeing other ways that I could make this work. The controlled execution block that controls the lifetime of the file object is not something that I can change to use sys.stdin, so we’re stuck with having to treat a file and stdin differently. Note the following:

  • Line 47: Things like unfamiliar logic operators make programming in a language more difficult than it needs to be. Given that Python is written in C, would it have killed GvR to have used the familiar “&&”?

Part 4: Relative Performance

This section is for fun. I don’t believe that performance should be the primary criterion for all but a few problem domains. Assuming your algorithms are good, most programs these days are fast enough. The correct target for early optimization is the programmer rather than the CPU, since CPU time, unlike programmer time, is always getting cheaper. Therefore, clearer code that takes a few more microseconds to execute is a worthwhile investment. That said, I/O is something you want to be fast since that is one of the performance limiters of any program.

The software versions are the ones hanging around on my MacBook:

$ clang --version
Apple LLVM version 7.3.0 (clang-703.0.31)
$ java -version
java version "1.6.0_51"
$ node --version
v4.2.2
$ perl --version
This is perl 5, version 18, subversion 2 (v5.18.2) ...
$ python --version
Python 2.7.10

The C++ implementation was built as follows:

$ g++ -o headcpp -O3 -Wall head.cpp

To gauge performance, I ran the following for each implementation:

$ time for x in {1..10}; do <HEADCMD> -200000 </usr/share/dict/words >/dev/null; done

This means that we are reading two million lines with the various getline implementations. Redirecting output to /dev/null means that we are not left waiting on the terminal. For each implementation, I did three runs and took the best of the three.

head

We’ll use the real head implementation for reference. As I said earlier, were I writing a serious implementation, I wouldn’t read one line at a time and neither does the actual head utility. Anyway:

real	0m0.296s
user	0m0.268s
sys	0m0.026s

C++

real	0m4.848s
user	0m3.947s
sys	0m0.872s

Remember that there were a couple of source level optimizations. Let’s see how performance changes if we undo them. First, using endl instead of “\n”:

real	0m4.860s
user	0m3.978s
sys	0m0.856s

No meaningful difference so not using endl was premature optimization. Now let’s see how keeping C++ streams synchronized with C stdio affects performance:

real	0m4.683s
user	0m3.841s
sys	0m0.819s

Again, no change. However, it did make a big difference on a Linux system, so this is an optimization worth keeping.

Java

real	0m8.358s
user	0m9.526s
sys	0m2.219s

Node.js

real	0m6.597s
user	0m5.592s
sys	0m1.099s

Perl

real	0m1.278s
user	0m1.140s
sys	0m0.106s

Python

real	0m1.448s
user	0m1.196s
sys	0m0.211s

The real surprise here is the performance of the C++ program which is around three times worse than the Python and Perl programs. Not so surprising is that Java is the worst performer, around twice as slow as the C++ version. This is despite Java compiling to binary bytecode and then using a JIT compiler to turn the bytecode into native code. The Node.js implementation lies somewhere between the C++ and Java implementations, which is not very impressive given that it, too, compiles to native code and considering also the demented event-driven file handling that was imposed on us. The Python and Perl performance is extremely impressive, not a million miles from the reference figures with Perl just edging out its Dutch rival.

I was so surprised at the terrible performance of the C++ program that I rewrote it as a straight C program using stdio:

#include <stdio.h>
#include <getopt.h>
#include <ctype.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

#define DEFLINES 10

void usage(const char *name, const char *msg) {
    if (msg) {
        fprintf(stderr, "%s\n", msg);
    }
    fprintf(stderr, "Usage:\n  %s [--count|-n <lines>] [FILE]\n", name);
}

int numericOption(const char *s) {
    const char *p = s;
    for (; *p; ++p) {
        if (!isdigit(*p)) {
            return -1;
        }
    }
    return atoi(s);
}

int main(int argc, char **argv) {
    int lines = DEFLINES, needHelp = 0, opt, nOpt;
    size_t len = 0;
    ssize_t read;
    const char *err = NULL;
    char *fn = NULL, *line = NULL;
    FILE *fp = stdin;
    struct option lOpts[] = {
        { "help", no_argument, NULL, '?' },
        { "count", required_argument, NULL, 'n' },
        { NULL, 0, NULL, 0 }
    };
    if (1 < argc && argv[1][0] == '-') {
        nOpt = numericOption(&argv[1][1]);
        if (nOpt >= 0) {
            lines = nOpt;
            argv[1] = argv[0];
            ++argv;
            --argc;
        }
    }
    while ((opt = getopt_long(argc, argv, "n:?", lOpts, NULL)) != -1) {
        switch (opt) {
            case 'n':
                nOpt = numericOption(optarg);
                if (nOpt >= 0) {
                    lines = nOpt;
                }
                else {
                    err = "Bad count argument";
                    needHelp = 1;
                }
                break;
            case '?':
            default:
                needHelp = 1;
                break;
        }
    }
    if (needHelp) {
        usage(argv[0], err);
        return 1;
    }
    if (argc > optind && strcmp(argv[optind], "-")) {
        fn = argv[optind];
    }
    if (fn) {
        fp = fopen(fn, "r");
        if (!fp) {
            fprintf(stderr, "Unable to open %s: %s\n", fn, strerror(errno));
            return 1;
        }
    }
    while ((read = getline(&line, &len, fp)) != -1 && lines--) {
        fputs(line, stdout);
    }
    free(line);
    if (fn) {
        fclose(fp);
    }
    return 0;
}

Let’s see how it performs:

$ time for x in {1..10}; do ./headc -200000 </usr/share/dict/words >/dev/null; done

real	0m0.306s
user	0m0.270s
sys	0m0.026s

That’s more like what you’d expect from a compiled binary and isn’t noticeably slower than the native head utility. Importantly, it’s ten times faster than the C++ implementation and given that the only real difference between the two is the I/O routines, I have to conclude that the performance of C++ stream I/O is rather dismal. The C++ code is measurably faster when reading the file directly rather than redirecting stdin which points the finger at cin.

Part 5: Conclusions

C++ is a pleasant language to work with and the syntax is unobtrusive, if a little long-winded. The reward for the extra effort is that the compiled program runs at native speed. One goal of C++ is the “zero overhead principle”, by which it is meant that you won’t be able to get better performance by programming in another language. When it comes to I/O, however, that simply isn’t true. C++ I/O is not only greatly slower than the legacy C stdio but also slower than two scripting languages, Perl and Python. For bread-and-butter programming, plain C would appear to be the better choice.

In a former life, I spent around three years as a Java programmer. At the time it felt like a breath of fresh air, but given that I’d spent some previous years doing Windows development using COM, that’s not surprising. On reflection, the difference between them was like the difference between warm faeces and cold vomit. If forced to express an opinion, you might favour one over the other but you’d rather have neither in your mouth. Java is simply unpleasant to work with: fussy, verbose syntax combining the developer overhead of a statically-typed, compiled language with the poor performance of an interpreted one. Somewhere along the way, Java got entrenched as the “Enterprise” development language but I don’t quite understand how: if it’s no good at the basics, how can it really be any good for the enterprise? The supposed benefit of binary portability is questionable, since Perl and Python are every bit as portable and don’t have Java’s shortcomings. In fairness to Java, I will acknowledge that I am not using the latest and greatest version, although I doubt that a single version bump has suddenly made Java run quickly. I am also aware that Java 1.7 has introduced a “try-with-resources” statement that brings a touch of RAII to the Java world. This would have made working with Java slightly less unpleasant.

Programmers love novelty as much as the next person which is why, I’m guessing, Node.js has gained traction. Otherwise, I can’t quite see what it’s bringing to the table. Event-driven programming makes absolute sense for code running in the browser, since you’re responding to mouse clicks and form submissions and so on. It doesn’t make quite so much sense in a scripting language, adding needless complexity to basic tasks. Proponents would say that the problem domain for Node.js is not basic scripting but scalable network applications and, to be fair, a non-blocking event model makes a lot more sense for sockets than it does for files. However, I would reply that Python (with Twisted) and Perl (with POE) can do scalable network applications just fine and are really good at the basic stuff as well.

I presume that love of novelty is also why Perl has lost mindshare in the last decade-or-so. To many programmers, Perl is what that crotchety old Unix-guy in the corner reaches for when awk runs out of steam. Perl is, indeed, the Unix philosophy applied to language design. But that’s a good thing, because it helps make Perl economical and elegant. Perl’s manifesto is that easy things should be easy and hard things should be possible and it succeeds in its aims. Interestingly, the file size of the Perl solution is half that of its nearest rival which translates to greater programmer productivity: if I have to write less, I’m going to get more done. It gets better. If you’ve ever read ‘Code Complete’, you’ll know that the number defects per 1000 lines of code is roughly constant, so fewer lines of code means fewer defects. Get more done with fewer defects: who wouldn’t want that?

Python is also a very solid choice as a general-purpose programming language, being both economical and performant. There is a hidden cost, however: the very idiosyncratic syntax is a barrier to any programmer who has anything like a C background. Python fans have an adjective for idiomatic Python code, “pythonic”. If you’re not “pythonic”, writing in Python can be more of a chore than a pleasure. Given that performance compared to Perl is a wash and that it’s not as economical (file size is twice that of the Perl solution), I won’t be switching to Python any time soon. That said, if I had to use Python for the bulk of my work, I wouldn’t start looking for another job. You can’t say that about Java!

At the start of this article, I said that choice of programming language should be a neutral one. Of course, that is far from the truth. With the possible exception of target platform, choice of language is the single most important engineering decision you can make. Make the wrong choice and you’re halving team productivity while doubling the number of software defects. Don’t go for the latest fashionable craze (which this year is Rust) and try to avoid the sort of mindset that believes “enterprise” means Java. If raw speed is not a primary criterion (and generally, it isn’t), give very serious consideration to a good scripting language. You’ll be a happier engineer for doing so.

Seven Habits of Highly Ineffective Programmers

Credit: The People Speak!
Credit: The People Speak!

Some programmers are better than others. In fact, there’s a statistical distribution: a few are absolutely brilliant, some are good, most are at least competent, some are barely competent and a few are truly dire. It’s an interesting observation that the Microsofts and Googles of the world will have seriously incompetent people writing parts of their software. And the really bad software can live for years, long after the author has taken his mediocrity with him to another unsuspecting company because one characteristic of bad software is that it is fantastically baroque and no-one else wants to take ownership. So there it stays. It’s almost certainly inefficient and probably insecure but so long as it doesn’t explode, everyone just leaves it the heck alone.

I will point out that bad programmers are not necessarily stupid and are certainly not bad people (or if they are, it’s not related to them also being bad programmers), they’re just individuals who find themselves in a job that they’re unsuited for. That said, the worst programmers share a number of characteristics. There are a lot more than seven, of course, but here’s my list of seven of the worst. Bad programmers…

…Lack basic knowledge

Writing software is an art to be sure, but it’s also a craft and like any craft, there is certain basic knowledge that you need to be able to practise your craft effectively. I don’t expect you to be Donald Knuth but I do expect you to know the basics of data structures and algorithms. I don’t expect you to be able to code a hashtable but I do expect you to know that there is such a thing and why you might use one.

Not understanding the basics of algorithms and data structures will result in serious performance issues. Imagine a list whose members are in sort order that currently contains 1000 items. A method exists to insert new items into the list. The bad programmer’s implementation will add items to the end of the list and then sort it. This is O(nlog2n) or roughly 10000 comparisons. The competent programmer will do a linear search of the list looking for the insertion point. This is O(n/2) or roughly 500 comparisons. The good programmer will use binary search to find the insertion point. This is O(log2n) or roughly 10 comparisons. The worst implementation is three orders of magnitude slower than the best implementation. If there are 1,000,000 items in the list, the worst implementation is heading for disaster.

…Fail to understand idioms

Programming languages are the medium through which we express our intentions. Like human languages, computer languages have their idioms and idiosyncracies. The good programmer will master the idioms while the bad programmer will create anti-patterns.

The anti-patterns are typically irksome rather than disastrous. For example, if you don’t understand the const idiom in C++, you’re likely to write functions like this:

void writeData(char *data) {
    // Do stuff
}

Leaving aside the fact that char * is effectively a promise to modify the data, the API is a pain to use, requiring ugly const_cast invocations:
std::string data("...");
...
writeData(const_cast<char *>(data.c_str());

If we don’t own the code, we’re stuck with it. If we do own the code, we’re probably still stuck with it for two reasons:

  • If I fix the signature for writeData so that the argument is declared const, I need to fix all existing calls to writeData as well
  • The rest of the code in the file containing writeData is likely to be terribad. If I touch it, I own it and I don’t want it!

…Use the wrong tools

Imagine you’re looking at a C++ header file and you see something like this:

typedef struct LIST {
  struct LIST *next;
  void* payload;
} list, *plist;

If you don’t recognize that, because you weren’t around in the 1990s, that’s a linked list struct, probably the same one that the bad programmer lifted from Herb Schildt’s ‘Teach Yourself C’ back in 1995 and which he’s been using ever since. Whatever its origin, it’s a sure sign that the programmer is using the wrong tools. std::list would be an obvious replacement but looking at the code more closely might indicate that std::vector or std::deque would be better suited to the task at hand. One thing is certain, however: the right tool for the job is never going to be a hand-rolled linked list.

Windows developers are forced to try and do everything in their graphical apps by the sheer poverty of the operating environment. They can’t reuse tools to do what their application requires because there are none, or none worth using. Unix developers, on the other hand, have more tools to play with than some of them know how to actually use. Have you ever seen something like this?

#!/bin/bash
pattern=...
file=...
...
grep $pattern $file | awk '{ print $2 }'

In case you don’t recognize the syntax, a tool called grep is being used to pick out matching lines from a file and these lines are being piped to another tool called awk to print the second field of each line. As any fule kno, awk can search files for patterns just fine so grep is redundant:

awk "/$pattern/ {print \$2}" $file

…Ignore conventions

Well behaved Unix (and Windows) programs conform to well-established conventions:

  • Input is read (or at least can be read) from a special filehandle called stdin
  • Output is written (or at least can be written) to a special filehandle called stdout
  • Success is indicated by an exit status of 0
  • Failure is indicated by an exit status other than 0
  • Associated error text is written to a special filehandle called stderr
  • Program options are specified by command-line switches that look like -i <FILENAME> or --inputfile <FILENAME>

Following these conventions allows marvelous things to happen. The output of another program, such as grep, can be used as the input to my program, the output of which can be passed to, say, a PDF writer that produces beautifully formatted reports. My program doesn’t need to know how to do pattern matching and it certainly doesn’t need to know how to write PDFs. Instead, it can just get on with doing whatever it is that it does.

A long-departed colleague wrote dozens of Python scripts and compiled C++ programs that had the following characteristics:

  • The command line options were always two XML documents, one for actual options and one for “parameters”
  • The output was always an XML document on stdout, even in the case of failures
  • The exit status was always 0
  • The actual status was somewhere in the output XML, along with any error text
  • Progress text was written to a magic filehandle with a fileno of 3.

Needless to say, he was a terrible programmer.

…Are too clever by half

I said above that bad programmers are not necessarily stupid. Some are really quite clever. Too damned clever if you ask me. Have you ever seen something like this?

void majorfail(char *s) {
    char buf[256], *p = buf;
    while (*p++ = *s++);
    // Do stuff with buf
}

Bad programmers are insecure programmers as well by which I don’t mean that they’re secretly afraid that they’re not very good but rather that the software they write has security holes large enough to drive trucks through. That while loop may look 1337 hardcore but if I see something like it, I’m not praising the author’s mastery of the pointer but rather asking why he’s reinventing strcpy. And this message goes out to all the bad programmers of the world: 256 characters (or 512 or 1024) is not big enough to hold any possible input. The fix is not only safe, but way simpler and more comprehensible:
void minorfail(char *s) {
    std::string str(s);
    // Do stuff with str
}

Sometimes, a piece of clever code relies on some really obscure knowledge to make sense of it. Look at the following JavaScript:

var hardcoreMathFn = function(x) {
    if (x === x) {
        // Do loads of really cool stuff
    }
};

Because I’m a bit of a clever-clogs myself, I recognize that this is an NaN test (NaN is the only value which is not equal to itself). If you, the maintainer of this obscurantist nonsense, are not aware of that piece of arcane knowledge, then the code looks like this:
if (true) {
    // Do loads of really cool stuff
}

The correct fix is:
if (!isNaN(x)) {
    // Do loads of really cool stuff
}

but the more likely one is:
// Do loads of really cool stuff

This sort of uncommented esotericism led to one of the worst bugs ever and I personally hold the smartypants openssl devs responsible, not the well meaning Debian package maintainer.

…Reinvent wheels. Badly

Have you ever seen code that does its own command-line processing? Something like this:

int main(int argc, char **argv) {
    char *file = NULL, *host = NULL, ...
    for (int i = 1; i < argc; i += 2) {
        if (!strcmp(argv[i], "-file")) {
            file = argv[i + 1];
        }
        else if (!strcmp(argv[i], "-host")) {
            host = argv[i + 1];
        }
        ...
    }
    ...
}

This is ugly to read, a ballache to maintain and accommodating an option that doesn’t take an additional parameter, such as -verbose will be irksome. Also a variant argument form, -file rather than --file is used. If you don’t know already, you’ll probably guess that DIY command-line processing is quite unnecessary as well since the problem has been solved. The bad programmer is simply too ignorant to know that this is the case or too lazy to figure out how to do it properly.

Bad programmers don’t just reinvent, they invent as well. A sure sign of bad software design is the invention of (oxymoron alert!) a proprietary network protocol. Imagine that we are implementing a client/server system that receives JSON-encoded requests (no, not XML because this is the 21st century) and emits JSON-encoded responses. An exchange might look like this:

{"request":"Yo!"}
{"response":"Wassup?"}

Now imagine that our “protocol” is simply that each request and response terminates on a newline so that the actual network exchange is this (“\n” signifies an actual newline character rather than an escape sequence):

{"request":"Yo!"}\n{"response":"Wassup?"}\n

Why is this a problem? Because implementing the protocol requires us to work at an uncomfortably low level. We have to get down and dirty with send, recv and select. We have to handle the possibility that the other end might fall over halfway through a message ourselves. If, however, we’d chosen HTTP rather than inventing our own protocol, we could use something like libcurl on the client side, a library that is not only mature and robust but more importantly is maintained by someone else. The server side could be implemented as a simple CGI script.

…Are plain sloppy

Look at my code samples in this article. Even when I’m demonstrating manifestly bad practices, the code is neat and orderly. You may question my opening brace placement since I favour the one, true placement of Kernighan and Ritchie over the deviant form of the heretic Stroustrup but you can’t deny that my coding style is readable and consistent. Bad software isn’t neat. The bad programmer can’t decide whether or not a space should go after keywords like if and while (hint: yes it should). You can generate random numbers from how many spaces go after commas: 0, 1 or other (hint: the correct value is 1). Similar strong randomness seems to govern whether or not there is a space between a function name and its argument list (hint: there shouldn’t be but I’ll let it pass if you’re consistent). And tabs and spaces are intermixed almost whimsically.

You may retort that whitespace isn’t significant, so what’s the problem? I said above that programming is a craft as well as an art. It’s also a discipline, one that requires an orderly thought process and a methodical approach. Code spends most of its time being maintained, not being written. If you don’t take a few minutes to ensure that your indentation matches your nesting how is the person who comes after you supposed to follow your logic? If you can’t pay attention to details how are you going to avoid sneaking in defects? In my experience, the probability of untidy code also being good code is close to 0.

Extending JavaScript Built-ins

Credit Martin Burns
Credit: Martin Burns

One should neither split infinitives nor end sentences with prepositions. These statements are examples of prescriptive grammar. Prescriptive grammar means rules invented by self-appointed experts who presume to tell others how they should speak if they don’t want to sound like louts. As far as I can tell, the origin of these prescriptions is that you couldn’t split infinitives or strand prepositions in Latin. Except we speak English and English accommodates “to boldly go” just fine and if we were to ask someone “About what are you talking?” they would be unlikely to compliment us on the quality of our speech.

The Internet is full of self-appointed experts who presume to tell others how to program in a similarly prescriptive way. One of the prescriptions that you may or may not be aware of is that one should never extend the built-in JavaScript classes (i.e. String, Array, Date, etc.). This Stack Overflow post is a good example of the controversy complete with huffing and puffing and stern warnings against the evils of such a filthy habit. Yet the ability to extend the behaviour of classes without subclassing is a core feature of JavaScript, not a bug, so a prescription against such a useful ability had better have a really good rationale (and no, MDN, fulmination is not a valid argument). The arguments against go something like this:

  • The method might be implemented in a future standard in a slightly different way
  • Your method name appears in for...in loops
  • Another library you use might implement the same method
  • There are alternative ways of doing it
  • It’s acceptable practice for certain elites but not for the unwashed masses
  • A newbie programmer might be unaware that the method is non-standard and get confused.

The retort to that last argument is “git gud”, while the retort to the last but one is something rather more robust. We’ll examine the other arguments in turn.

Implemented in a Future Standard

It might happen, particularly if the functionality is obviously useful. I started writing JavaScript early in 2005 and it didn’t take me long to get fed up with doing this:

var a = "...", s = "...";
...
if (a.substr(0, s.length) === s) {
   // a starts with s
}

The obvious solution was to extend the String class:
String.prototype.startsWith = function(s) {
   return (this.substr(0, s.length) === s);
};

This allowed me to express myself rather more clearly:
if (a.startsWith(s)) {
   ...
}

This is obvious functionality implemented in an obvious way. No, my implementation didn’t test for pathological input but then I never fed it pathological input so it didn’t matter. Sometime around 2012, browsers containing a native implementation of startsWith started appearing and the only thing that needed changing was a test for the presence of the method before stomping over it. Otherwise, despite the addition of a feature-in-search-of-a-use-case (a start position other than 0), the official implementations behaved identically to my venerable extension.

For argument’s sake, let’s say I’d called the method beginsWith. The fix would take longer to type than to apply:

$ sed -i -e 's/beginsWith/startsWith/g;' *.js

Even if I’d been stupid enough to do something like return 0/-1 rather than true/false, fixing up my uses of startsWith to use a different return type would have been the work of a few minutes. We call this refactoring which means that the argument in favour of possible future implementations is actually an argument against refactoring which is no argument at all.

for…in Pollution

The problem here is that your method name appears in the enumerable properties of an object but this is actually a non-issue. Leaving aside the fact that it has been possible for some years to ensure that your method is not enumerable in the great majority of browsers in the world, would you ever enumerate the properties of a String or Date object? Hint: there are no enumerable properties. Furthermore, if you enumerate over an Array with a for...in loop, you’re doing it wrong.

Implemented by Another Library

The problem here is that the last library included wins and if utility library A behaves differently to utility library B, code may break. But again, this is something of a non-issue if certain safeguards are observed: test for the existence of an existing method and implement obvious functionality in an obvious way. If anything, this is an argument against using the monolithic jqueries of the world in favour of micro frameworks, since micro frameworks tend not to do a 1001 things that you don’t know about.

It’s Unnecessary

True, inasmuch as there is always more than one way to do anything. The question is: are alternatives better? Let’s say that I want a function to shuffle an Array. The function might look like this:

function shuffle(a) {
    var l = a.length, c = l - 1, t, r;
    for (; c > 0; --c) {
        r = Math.floor(Math.random() * l);
        t = a[c];
        a[c] = a[r];
        a[r] = t;
    }
    return a;
}

For the interested, this is the Fisher-Yates algorithm. How are we going to make this function available to our application? We can reject one possibility - subclassing Array - right away. The reason is that arrays are almost always instantiated using the more expressive literal notation:
var a = [1, 2, 3],   // like this almost always
    b = new Array(); // like this almost never

This makes using a MyExtendedArray class unfeasible - the overhead would simply be too great. The same applies to theoretical subclasses of String and Number.

Using a procedural approach is doable but we now come up against the problem of namespacing. As written, the identifier "shuffle" is placed into the global namespace. This isn't as fatal as some would have you believe if it's for private consumption. However, if you plan to create a library for consumption by others you should avoid using the global namespace because collision with identically named entities is a non-negligible risk. One could imagine, for example, another "shuffle" function that works on a "DeckOfCards" object or a "Songlist". Utility functions are problematic because the namespace isn't obvious. One could choose something like "ArrayUtils" but then you're liable to be competing with everyone else's ArrayUtils. So you might end up doing this:

if (!("CCWUtils" in window)) {
    CCWUtils = {};
}
CCWUtils.Array = {
    shuffle: function(a) {
        ...
    }
};

...

var cards = ["AS", "2S", "3S", ... , "JC", "QC", "KC"];
CCWUtils.Array.shuffle(cards);

Remember that we're doing it this way because we believe that adding a "shuffle" method to the native Array class is somehow sinful. If that feels like the tail wagging the dog, compare the sinful approach:
if (!("shuffle" in Array.prototype)) {
    Array.prototype.shuffle = function() {
        ...
    };
}

...

var cards = ["AS", "2S", "3S", ... , "JC", "QC", "KC"];
cards.shuffle();

To my eyes, cards.shuffle() is both more idiomatic and more elegant. Namespacing isn't a problem and I take care to play nicely with any existing shuffle method.

Doing it Right

I believe that extending built-in classes is a valid practice but there are some guidelines that you might follow:

  • Add useful functionality. For example, the concat method of String objects implements the same functionality as the + operator. Don't do something similar
  • Use an obvious name and implement obvious functionality
  • Be careful to avoid overwriting a method with the same name
  • If possible, ensure that your method is non-enumerable, if only to silence critics who might otherwise complain that you're polluting their for...in loops
  • Take some pains to ensure that your method behaves like other methods. For example, methods of built-in objects are typically implemented generically and mutator methods (methods that modify the object) typically return the mutated object as well.

With that in mind, here is the complete shuffle extension:

(function() {
    "use strict";
    var _shuffle = function(a) {
        if (null == a) {
            throw new TypeError("can't convert " + a + " to an object");
        }
        var _a = Object(a), len = _a.length >>> 0, c = len - 1, t, r;
        for (; c > 0; --c) {
            r = Math.floor(Math.random() * len);
            // Swap the item at c with that at r
            t = _a[c];
            _a[c] = _a[r];
            _a[r] = t;
        }
        return _a;
    },
    obj = Array.prototype, mName = "shuffle",
        m = function() { return _shuffle(this); };
    if (!(mName in obj)) {
        try {
            Object.defineProperty(obj, mName, {
                enumerable : false,
                value : m
            });
        }
        catch(e) {
            obj[mName] = m;
        }
    }
})();

Note the following:

  • Lines 4-6: Following other Array methods, raise a TypeError if passed undefined or null. And yes, that is == rather than === since I don't care to distinguish the two
  • Line 7: Our method can work generically on array-like objects (objects with a length and numeric property names) and it won't barf if passed a primitive value. You can pass non-Array objects to the method by invoking it as Array.prototype.shuffle.call(obj). Note, however, that a TypeError will be raised if the method is passed an immutable object, such as a String. That is also true of, say, reverse so it's not a problem
  • Line 15: Our method returns the mutated Object in the manner of similar methods such as sort and reverse
  • Line 17: Use the obvious name "shuffle". If we used something like "$shuffle" to ensure that we didn't conflict with a future implementation, we wouldn't automatically benefit from the native implementation. As it is, if "shuffle" ever becomes a standard method on Array, our extension simply becomes a shim
  • Line 19: Don't overwrite an existing method
  • Lines 21-24: This is how you ensure that the new method is not an enumerable property
  • Line 27: Fallback position for the off-chance that our code is running in Internet Explorer <= 8 or something similarly inadequate.

But Object.prototype is Forbidden, Right?

I said earlier that forbidding the use of a useful feature needed a good rationale. Well, here goes. Imagine that I wanted a method to count the number of properties of an Object and I implemented it like this:

Object.prototype.count = function() {
    var p, ret = 0;
    for (p in this) {
        if (this.hasOwnProperty(p)) {
            ++ret;
        }
    }
    return ret;
};

The problem is that Object is a first-class data type used to store key-value pairs which means that I can also do this:
var stat = {
    value: 1,
    count: 4   // Oops! count method is now shadowed
};

Rather than adding value, my count method has stolen a perfectly good property name. Even if we think we can do what we want by picking really obscure method names (breaking the "obvious" guideline in the process), Object is the base class of everything and the practice is simply hazard-prone. Illustratively, adding methods directly to Object.prototype breaks the "behaves like other methods" guideline as well, since most Object methods are static:
var a = 1;
...
if (Object.is(a, 1)) {
    // Not "if (a.is(1))"
}

So, yes, Object.prototype is forbidden.

Erm, What About "Host" Objects?

By "host" objects, we mean the objects that the browser makes available to your script that represent the visible and not-so-visible parts of your user interface: Document, Element, Node, Event and so on. These were the subject of an essay some years ago. Most of the issues considered are no longer really issues (applet elements, anyone?) and now this is a reasonable strategy:

if (!("Element" in window)) {
    // Don't try and do anything at all
}

With that in place, I can't see that the sky would fall if you did something like this:

Element.prototype.show = function(style) { // e.g. "inline-block"
    style = style || "";
    this.style.display = style;
};
Element.prototype.hide = function() {
    this.style.display = "none";
};

A possible problem might be one of scope. The behaviours that a String object doesn't have that you wish it did are distinctly finite: while you might want it to return a reversed copy of itself, you're probably not hankering for it to calculate a SHA256 hash of itself. The behaviours that we might want to add to objects that represent interface elements are not so limited. For example, we might want a method to set colours. Does our setColor method take two arguments and set the background colour as well? Or does it simply set the foreground colour, leaving the background colour to be specified by a call to setBackgroundColor? What about a method to set position and z-order? You'll quickly find yourself in danger of violating both the "useful" and "obvious" guidelines.

I'm a great believer in the Unix philosophy of doing one thing well. User-interface toolkits have a bad habit of trying to do flipping everything. My personal feeling is that extending interface objects is a bit like pulling a thread on your jumper: at some point you're going to wish you hadn't started. But if it works for you who am I to say you mustn't?

Conclusions

TL;DR: my views on extending built-in classes are:

  • Array, String, Number, Date: Yes; apply common sense
  • Object.prototype: Seriously, no
  • "Host" objects: not to my taste, but if it floats your boat...