Working toward open and reproducible calculations

In Communication, Data, General by Nat

spectrum of reproducibility

Figure taken from Peng (2011), DOI: 10.1126/science.1213847

Last post, Andree wrote about archiving data with NODC. I’d like to share something in a similar vein that I think is important not just for polar scientists, but for researchers of all disciplines. Much of modern science relies heavily on computation, whether for analysing data, evaluating analytical results, or running full-scale general circulation models. A problem associated with the reliance on complex computer codes and large datasets is that it’s sometimes very difficult to reproduce published results. I’ve spent a lot of time trying to repeat existing experiments, and even when there is a carefully-written methods section it is time-consuming or nearly impossible.
In theory, these difficulties can be reduced by providing well-documented code and data along with publications. In practice, there are a number of obstacles:

  • Licenses on data or software may prohibit distribution
  • Researchers may be embarrassed by their code
  • Code may be poorly-documented, difficult to run, and challenging to distribute

Of these problems, I think that there has been important progress in addressing the last. Recently, the journal Nature hosted a neat example showing how code can be provided over the internet in an executable format. They did this by using an IPython Notebook, which is a piece of software developed by volunteers and academics to combine program code and annotations as a document.
An IPython Notebook allows you to write code combined with explanations, links, and LaTeX formulas through a web browser. Notebooks can easily be backed up, e-mailed, and potentially published to show exactly how a particular computation was done. What’s interesting about the Nature demonstration is that it shows how a Notebook can be hosted remotely and made available for anyone to experiment with.
I think it’s pretty cool to imagine an IPython Notebook being included with a future paper that (for example) analyses regional climate model output over Greenland and draws some conclusions. The Notebook would download the necessary data and show exactly the calculations involved. This would allow other scientists or the general public to interactively experiment and build upon the results in the published paper.
As a final note, the IPython Notebook was originally built by the scientific Python community. It’s newest incarnation is being renamed Jupyter to reflect the fact that it now works with other languages common in science, such as R, Julia, and Perl.
Look for a future post from Clark describing how this problem has been approached by the R community.

Related reading:
Reproducible Research in Computational Science
The case for open computer programs