Formatting Numbers

Formatting numbers (especially tables of numbers) is one of the most frustrating programming tasks. Different programming languages and environments have taken different approaches to this problem, but the result is still unsatisfying. In this post, I will argue that half a century of advances in computing have delivered too little progress in this simple task.

Fortran

I started programming in Fortran (mainly Fortran 66 with a bit of Fortran 77 thrown in), but have not used this language for many decades now, and I cannot even recognize a modern dialect of Fortran like Fortran 2003. Still I remember the Fortran format statement with fondness, and believe after half a century of computing progress, the formatted write of Fortran 66 remains unsurpassed for producing purely numerical tabular output. Three elements make it so powerful:

  1. Reuse/repetition: Fortran allows a format specifier to be preceded by a repetition count. In addition, when Fortran reaches the last outer right parenthesis of the format specification without exhausting the list of numbers to be printed, it moves back to the matching left parenthesis (or to the beginning of the format specification) and repeats that format specification on a new line as many times as needed to exhaust the list of numbers. To take a simple example, if you ask Fortran to print a list of 100 numbers with a format specification of 8F10.4, it will print 8 numbers on a line (each number in a field of width 10 with 4 decimal places) and will keep reusing the 8F10.4 until all 100 numbers have been printed. We will get 12 lines of 8 numbers and a final (13th) line with 4 numbers. Combined with implicit do loops, printing tables of numbers is much simpler and easier than in supposedly more ‘modern’ languages. Just one simple explicit loop over rows is enough to print a 35 by 35 correlation matrix nicely with 8 numbers per line. To do the same thing in most other languages requires two nested loops (over rows and columns) with an ugly if statement inside the inner loop to break the line after every multiple of 8 numbers. Actually Fortran is even more powerful – if the program prints a 35×35 matrix and a 11×20 matrix and a 50×40 matrix, and we decide to print 13 (and not 8) numbers per line (using wider paper), changing one format statement from 8F10.4 to 13F10.4 would make this change for all tables because all these tables can use a single format specification despite their different sizes.

  2. Fixed width is truly fixed: If a number cannot be printed in the width specified, Fortran just fills the space with *’s without disrupting the tabular formatting. The printf in C insists on overriding the user specified width and printing the full number. The C solution makes sense when it comes to printing just one number, but I think Fortran does the sensible thing if a whole table is involved. If a numbers overflows the column width of a table, there are four possibilities: (a) that the column width was wrong, or (b) the number was the result of an erroneous input or an erroneous computation, or (c) the number is uninteresting or meaningless (equivalent in a practical sense to a NaN) or (d) the number is exceptional but correct and displaying its true value is important. In the first 3 cases, the Fortran solution is superior, and even in the last case, the G format discussed below is better than the C approach.

  3. The G format specification: I was an avid user of the G format specification when I was programming in Fortran, but hardly ever use the G format in C printf. In my experience Fortran programmers used the G a lot while C programmers use it very sparingly (Of course, my sample of programmers is very small and possibly biased). This indicates that the Fortran interpretation of G (use F if the width is sufficient, but shift to E if it is not) represents a far more common use case than the printf interpretation which is to use whichever format is shorter.

In short, Fortran 66 was fabulous when it came to printing numbers in tables. It was not very good at printing anything else. Object Oriented Programming has introduced the idea of letting objects decide how they must be printed, and this is a huge advance when it comes to printing complex objects nicely. When I run a sophisticated regression model inside an R script, I know that one print command will produce an elegant page of output containing all important coefficients and statistical tests. Doing this in Fortran would be a nightmare.

C printf

Even if you have never programmed in C, chances are that you would have used the printf format specification. In some form or the other, some version of printf or sprintf is the way to print something exactly the way you want it in a host of languages: C++, Java, JavaScript, PHP, Perl, Python, R, awk, Ruby to name a few. For example, a C++ programmer would typically use the << operator for sending something to an output stream; but, once in a while, the programmer encounters the need to do an sprintf and then send the result to an output stream using <<. In many ways, the printf is a huge advance over the Fortran format specification in terms of the options available. If you want to print just one item at a time, printf will give you a level of control far beyond what is available anywhere else. Correspondingly, the printf library is quite large and I have seen C programmers avoid using it at all to keep their executable file size small.

Great as printf is for formatting single numbers, it is quite inconvenient for tabular output. The lack of a repetition count and the other nice goodies of Fortran 66 make it really painful to print a table nicely.

Microsoft Excel

There is one thing that Excel (and other spreadsheet software) does well when it comes to formatting numbers and that is the % format. Particularly in finance, humans like to write and read parameters like interest rates as 5%, but the computer needs to interpret it as 0.05. In most languages, that means multiplying and dividing by 100 on output and input respectively. This is error prone: in one case, getting this wrong caused losses of $216 million and forced the founder of the company to pay a multi-million dollar fine and resign from the company. In finance, reading and writing percentages correctly is a big pain point, and Excel makes this much easier with its % format. None of the other languages that I use have this facility.

In most other ways, formatting a table of numbers in Excel is a big pain, but its treatment of the % issue merits its mention here.

R data.frames and Python Pandas DataFrame

R data.frame and Python Pandas DataFrame are a way to represent a table as a data structure and manipulate it easily. These data structures can be printed with a single command provided you are content with the default formatting styles. If you want to customize the printing, for example, 3 decimal places in one column and 5 decimal places in another column, it takes a lot of effort. And to think that it would have taken only one line in Fortran 66! Fourth generation languages make it easy to compute on arrays, but in half a century, we have only regressed when it comes to printing the array.

When I really need to print a data.frame nicely I use the xtable package to convert a data.frame to an HTML table or a LaTex table. Or I use the latex function in the Hmisc package to produce LaTex output. Even then, I sometimes I need to use the Siunitx or numprint package in LaTex to fine tune the formatting.

Surely, there ought to be a better way to do all this.