Formatting numbers (especially tables of numbers) is one of the most frustrating programming tasks. Different programming languages and environments have taken different approaches to this problem, but the result is still unsatisfying. In this post, I will argue that half a century of advances in computing have delivered too little progress in this simple task.
I started programming in
Fortran 66 with a bit of
Fortran 77 thrown in), but have not used this language for many decades now, and I cannot even recognize a modern dialect of
Fortran 2003. Still I remember the
format statement with fondness, and believe after half a century of computing progress, the formatted write of
Fortran 66 remains unsurpassed for producing purely numerical tabular output. Three elements make it so powerful:
Fortranallows a format specifier to be preceded by a repetition count. In addition, when
Fortranreaches the last outer right parenthesis of the format specification without exhausting the list of numbers to be printed, it moves back to the matching left parenthesis (or to the beginning of the format specification) and repeats that format specification on a new line as many times as needed to exhaust the list of numbers. To take a simple example, if you ask
Fortranto print a list of 100 numbers with a format specification of
8F10.4, it will print 8 numbers on a line (each number in a field of width 10 with 4 decimal places) and will keep reusing the
8F10.4until all 100 numbers have been printed. We will get 12 lines of 8 numbers and a final (13th) line with 4 numbers. Combined with
implicit doloops, printing tables of numbers is much simpler and easier than in supposedly more ‘modern’ languages. Just one simple explicit loop over rows is enough to print a 35 by 35 correlation matrix nicely with 8 numbers per line. To do the same thing in most other languages requires two nested loops (over rows and columns) with an ugly
ifstatement inside the inner loop to break the line after every multiple of 8 numbers. Actually
Fortranis even more powerful – if the program prints a 35×35 matrix and a 11×20 matrix and a 50×40 matrix, and we decide to print 13 (and not 8) numbers per line (using wider paper), changing one format statement from
13F10.4would make this change for all tables because all these tables can use a single format specification despite their different sizes.
- Fixed width is truly fixed: If a number cannot be printed in the width specified,
Fortranjust fills the space with
*’s without disrupting the tabular formatting. The
Cinsists on overriding the user specified width and printing the full number. The
Csolution makes sense when it comes to printing just one number, but I think
Fortrandoes the sensible thing if a whole table is involved. If a numbers overflows the column width of a table, there are four possibilities: (a) that the column width was wrong, or (b) the number was the result of an erroneous input or an erroneous computation, or (c) the number is uninteresting or meaningless (equivalent in a practical sense to a NaN) or (d) the number is exceptional but correct and displaying its true value is important. In the first 3 cases, the
Fortransolution is superior, and even in the last case, the
Gformat discussed below is better than the
Gformat specification: I was an avid user of the
Gformat specification when I was programming in
Fortran, but hardly ever use the
printf. In my experience
Fortranprogrammers used the
Ga lot while
Cprogrammers use it very sparingly (Of course, my sample of programmers is very small and possibly biased). This indicates that the
Fif the width is sufficient, but shift to
Eif it is not) represents a far more common use case than the
printfinterpretation which is to use whichever format is shorter.
Fortran 66 was fabulous when it came to printing numbers in tables. It was not very good at printing anything else.
Object Oriented Programming has introduced the idea of letting objects decide how they must be printed, and this is a huge advance when it comes to printing complex objects nicely. When I run a sophisticated regression model inside an
R script, I know that one print command will produce an elegant page of output containing all important coefficients and statistical tests. Doing this in
Fortran would be a nightmare.
Even if you have never programmed in
C, chances are that you would have used the
printf format specification. In some form or the other, some version of
sprintf is the way to print something exactly the way you want it in a host of languages:
Ruby to name a few. For example, a
C++ programmer would typically use the << operator for sending something to an output stream; but, once in a while, the programmer encounters the need to do an
sprintf and then send the result to an output stream using <<. In many ways, the
printf is a huge advance over the
Fortran format specification in terms of the options available. If you want to print just one item at a time,
printf will give you a level of control far beyond what is available anywhere else. Correspondingly, the
printf library is quite large and I have seen
C programmers avoid using it at all to keep their executable file size small.
printf is for formatting single numbers, it is quite inconvenient for tabular output. The lack of a repetition count and the other nice goodies of
Fortran 66 make it really painful to print a table nicely.
There is one thing that Excel (and other spreadsheet software) does well when it comes to formatting numbers and that is the
% format. Particularly in finance, humans like to write and read parameters like interest rates as 5%, but the computer needs to interpret it as 0.05. In most languages, that means multiplying and dividing by 100 on output and input respectively. This is error prone: in one case, getting this wrong caused losses of $216 million and forced the founder of the company to pay a multi-million dollar fine and resign from the company. In finance, reading and writing percentages correctly is a big pain point, and Excel makes this much easier with its
% format. None of the other languages that I use have this facility.
In most other ways, formatting a table of numbers in Excel is a big pain, but its treatment of the
% issue merits its mention here.
R data.frames and Python Pandas DataFrame
DataFrame are a way to represent a table as a data structure and manipulate it easily. These data structures can be printed with a single command provided you are content with the default formatting styles. If you want to customize the printing, for example, 3 decimal places in one column and 5 decimal places in another column, it takes a lot of effort. And to think that it would have taken only one line in
Fortran 66! Fourth generation languages make it easy to compute on arrays, but in half a century, we have only regressed when it comes to printing the array.
When I really need to print a
data.frame nicely I use the
xtable package to convert a
data.frame to an
HTML table or a
LaTex table. Or I use the
latex function in the
Hmisc package to produce
LaTex output. Even then, I sometimes I need to use the
numprint package in
LaTex to fine tune the formatting.
Surely, there ought to be a better way to do all this.