Migrating from R to Python

Many years ago, I shifted from Microsoft Excel and LibreOffice Calc to R data frames as my primary spreadsheet tool. This was one of the earliest steps in my ongoing move from bloat to minimalism (see my three blog posts on this process). Shifting to R yielded many benefits:

  • Greater readability and maintainability
  • Version control
  • Reusable code
  • Dynamic generation of reports and presentations from computed data using LaTex and knitr
  • Production quality graphics and charts using plain R graphics and more importantly ggplot
  • Access to a comprehensive library of statistical and quantitative finance tools written in R

Over the last few months, I have been shifting from R to Python for most of my work. The primary reason for making this change is that Python is a full fledged programming language unlike R which is primarily a statistical language which has been extended to do a lot of other things. A few years ago (when I first shifted to R), Python was totally unsuitable for use as a spreadsheet because the language was primarily designed to work with scalars rather than vectors and matrices. But in recent years, the Python tool sets (NumPy, SciPy, pandas, matplotlib, statsmodels, scikit-learn) have developed rapidly and now goes beyond the capabilities of R in many respects. Jake VanderPlas’s keynote talk at the Scipy 2015 Conference is an excellent introduction to this entire set of tools. Overall, I am very happy with the pandas implementation of data frames based on NumPy arrays; the best features of R have been preserved.

Many of the original reasons for moving from spreadsheets to R now favour Python over R.

  • Readability and maintainability: Readability is a subjective judgement, but in my view, Python easily outshines R in this respect. Maintainability of Python code is much greater because of excellent refactoring tools like rope.
  • Reusable Code: I find Python imports superior – to do the equivalent in R, one has to convert the R script into a package and this is worthwhile only for fairly large pieces of code. Most of the time, in R, I ended up using source with its attendant pollution of the global namespace. With Python, I am often importing small (say 15 lines) Python modules because any Python file can be treated as a module by putting (or symlinking) it somewhere in the module search path.
  • Dynamic document generation: I am quite satisfied with Pweave. It is not as elaborate as knitr, but it meets my needs well, and is often faster and more elegant. It is true that knitr supports other languages including Python, but I find Pweave much better.
  • High quality graphics: Matplotlib is far harder to learn than ggplot, but it is also much more powerful. In interactive use, Matplotlib is superior (for example, interactively rotating 3D surface plots).
  • Statistical modelling: Here of course, R is far ahead. It is not very often that I am running a complex cointegration model, but when I do, I must turn to R. The nice thing is I can do this without leaving Python by using the rpy2 module.
  • Quantitative finance tools: Here I think R still has the edge (Rmetrics for example) though the Python interface to QuantLib is better than the corresponding R interface.

What I miss in Python is a whole lot of syntactic sugar. For example, to compute the discount factors at 10% interest rate for years 1 to 4, the R code is short and clear:

1.1^-(1:4)

The Python equivalent is longer and somewhat obscure:

import numpy as np
1.1**-np.arange(1,5) 

The problem is that vector arithmetic is an after thought in Python and while NumPy provides the requisite functionality, it cannot change the syntax of the underlying language. Most of the time, operator overloading can provide the illusion of native functionality, but once in a while, the illusion breaks and NumPy notation becomes clumsy.

This is a small price to pay for the benefits of using a full fledged programming language with a comprehensive set of libraries for everything from web scraping to natural language processing. Moreover, Python is faster and copes better with large data sets. Finally, when it comes to interactivity using the web browser as a front end, I find Jupyter notebooks easier than R Shiny.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s