Responding to Volkswagen: open data and open source

Earlier this week, I wrote a piece in the Mint arguing that when big firms such as Volkswagen use software to cheat their customers, the regulatory response should focus on open data and open source so that consumers can verify whatever the big firms are telling them. After writing this piece, I have been thinking whether ultimately, it will be necessary to rely on smart contracts residing on a blockchain to deter such frauds. I have not fully thought this through. In the meantime, below is my Mint piece:

Distrust and cross-check

The implications of big firms such as Volkswagen using software to cheat their customers go far beyond a few million diesel cars

The Volkswagen emissions scandal challenges us to move beyond Ronald Reagan’s favourite Russian proverb “trust but verify” to a more sceptical attitude: “distrust and cross-check”.

A modern car is reported to contain a hundred million lines of code to deliver optimised performance. But we learned last month that all this software can also be used to cheat. Volkswagen had a cheating software in its diesel cars so that the car appeared to meet emission standards in the lab while switching off the emission controls to deliver fuel economy on the road.

The shocking thing about Volkswagen is that (unlike, say Enron), it is not perceived to be a significantly more unethical company than its peers. Perhaps, the interposition of software makes the cheating impersonal, and allows managers to psychologically distance themselves from the crime. Individuals who might hesitate to cheat personally might have less compunctions in authorizing the creation of software that cheats.

The implications of big corporations using software to cheat their customers go far beyond a few million diesel cars. We are forced to ask whether, after Volkswagen, any corporate software can be trusted. In this article, I explore the implications of distrusting the software used by big corporations in the financial sector:

Can you trust your bank’s software to calculate the interest on your checking account correctly? Or might the software be programmed to check your Facebook and LinkedIn profiles to deduce that you are not the kind of person who checks bank statements meticulously, and then switch on a module that computes the interest due to you at a lower rate?

Can you be sure that the stock exchange is implementing price-time priority rules correctly or might the software in the order matching engine be programmed to favour particular clients?

Can you trust your mutual funds’ software to calculate Net Asset Value (NAV) correctly? Or might the software be programmed to understate the NAV on days where there are lots of redemption (and the mutual fund is paying out the NAV) while overstating the NAV on days of large inflows when the mutual fund is receiving the NAV?

Can you be sure that your credit card issuer has not programmed the software to deliberately add surcharges to your purchases. Perhaps, if you complain, the surcharges will be promptly reversed, but the issuer makes a profit from those who do not complain.

Can you trust the financials of a large corporation? Or could the accounting software be smart enough to figure out that it is the auditor who has logged in, and accordingly display a set of numbers different from what the management sees?

After Volkswagen, these fears can no longer be dismissed as mere paranoia. The question today is how can we, as individuals, protect ourselves against software-enabled corporate cheating? The answer lies in open source software and open data. Computing is cheap, and these days each of us walks around with a computer in our pocket (though, we choose to call it a smartphone instead of a computer). Each individual can, therefore, well afford to cross-check every computation if (a) the requisite data is accessible in machine-readable form, and (b) the applicable rules of computation are available in the form of open source software.

Financial sector regulations today require both the data and the rules to be disclosed to the consumers. What the rules do not do is to require the disclosures to be computer friendly. I often receive PDF files from which it is very hard to extract data for further processing. Even where a bank allows me to download data as a text or CSV (comma-separated value) file, the column order and format changes often and the processing code needs to be modified every time this happens. This must change. It must be mandatory to provide data in a standard format or in an extensible format like XML. Since data anyway comes from a computer database, the bank or financial firm can provide machine-readable data to the consumer at negligible cost.

When it comes to rules, disclosure is in the form of several pages of fine print legalese. Since the financial firm anyway has to implement rules in computer code, there is little cost to requiring that computer code be freely made available to the consumer. It could be Python code as the US SEC proposed five years ago in the context of mortgage-backed securities (, or it could be in any other open source language that does not require the consumer to buy an expensive compiler to run the code.

In the battle between the consumer and the corporation, the computer is the consumer’s best friend. Of course, the big corporation has far more powerful computers than you and I do, but it needs to process data of millions of consumers in real time. You and I need to process only one person’s data and that too at some leisure and so the scales are roughly balanced if only the regulators mandate that corporate computers start talking to consumers’ computers.

Volkswagen is a wake-up call for all financial regulators worldwide. I hope they heed the call.


Saying no to spreadsheets

The spreadsheet was the first and most important business application of the personal computer. Corporate America started buying PCs 35 years ago to run VisiCalc; they paid a couple of thousand dollars for an Apple computer only to run the hundred dollar spreadsheet program. VisiCalc gave way to Lotus 123, which in turn was supplanted by Microsoft Office (Excel). Of all the software that runs on the PC, the spreadsheet is the hardest to supplant. Those who try to sell Linux computers to businesses find that the biggest stumbling block is Excel. LibreOffice provide most of the functionality of Microsoft Office; LibreOffice Calc can work seamlessly with most simple Excel spreadsheets, but its macro languages are not compatible with Excel’s Visual Basic, and many businesses have tens of thousands of lines of Visual Basic code.

Yet spreadsheets are under attack even in the business world. They are notoriously error prone, hard to debug and maintain. There is a whole website devoted to spreadsheet horror stories. Research has shown that over 5% of spreadsheet formulas contain errors, and about 95% of all spreadsheets contain errors. The error rate in spreadsheet formulas is roughly twice the error rate in conventional programming. More importantly, programming errors are corrected through heavy testing: often a third or more of programming effort goes to testing. Spreadsheets are not only subjected to much less testing, they are also very hard to test. Spreadsheets are not easy to document, and they do not have good version control and collaboration tools. All this means that large spreadsheets are almost impossible to test and maintain. In fact, when Section 404 of the Sarbanes-Oxley Act in the US required companies to document, test and disclose weaknesses in their internal control processes, many organizations started giving up the use of spreadsheets for most financial reporting purposes.

Personally, what drove me to get rid of spreadsheets were documentation and version control. I had often had the need to revisit a disused spreadsheet developed five years earlier; and had found that this was an absolute nightmare because of poor documentation. I had also observed that I often had a dozen variants of the same spreadsheet and it was very difficult to keep track of what had been changed from one version to the other. None of these problems exist in my programming source code which is usually well documented and version controlled.

A few years ago, I started migrating my spreadsheets to the open source programming language R. R has a data structure called a data frame with row and column names that provides much of the functionality of a spreadsheet table (This is where R scores over Python-Numpy which also I use a lot in other contexts). Operations in R are implicitly vectorized (they act on whole columns of a data frame) without explicit loops. This parallels (but is much more powerful than) the spreadsheet idea of copying a formula across several rows or columns instead of writing an explicit loop. R has many more mathematical and statistical functions than any spreadsheet software; and its graphing abilities (especially with packages like ggplot2) would put any spreadsheet to shame. Numerous open source R packages extend the capabilities of the language in many different directions.

Another big benefit from using R is the seamless integration with documents and presentations using knitr. It amazes me that MS Office and LibreOffice (which are designed to integrate spreadsheets, documents and presentations in a single suite) do a rather bad job of this integration. On the other hand, R, LaTexBeamer and MarkDown were designed completely independently of each other, yet, knitr integrates all of them splendidly.

My migration plan was very simple. First, I stopped adding to the stock of spreadsheets by deciding that no new spreadsheets would be created; all new development would be in R. Second, to deal with the legacy of old spreadsheets, I adopted a rule that old spreadsheets would not be modified; all modifications would happen only after conversion to R. As I began migrating, I realized that some financial functions related to present values and bond valuation were not readily available in R and I wrote my own package for this. I also converted my Visual Basic code for other financial functions (for example, Black Scholes option pricing) to R. This package can be installed from CRAN (Comprehensive R Archive Network); the source code is hosted on github.

Since switching from spreadsheets to R a couple of years ago, I have not missed spreadsheets at all. On the other hand, on the few occasions that I am forced to use a spreadsheet (in the classroom for example), I miss the power of R. Over a period of time, I have tried to bring R or Python to the classroom, but that is a much more challenging task requiring a change in the computing habits of many other people. In the last couple of years, a few of my friends and colleagues have also switched to R, and none of them have switched back. Rather, they now wonder how they ever managed to do their work with spreadsheets.