And For My Next Trick...

This post originally appeared on the Software Carpentry website.

One of the things people have voted right up to the top of our poll is reproducible research. The phrase means different things to different people, but part of it must be accurately tracking the provenance of every bit of data a scientist touches: where it came from, what was done to it, what was done to the results, and so on. Once I've done the next episode of the Make lecture (on macros), I'd like to spend 10 minutes showing people how to do some of this by:

embedding Subversion keywords like $Id:$ in original files, and
having tools copy those IDs, plus their own version numbers and settings, into the files they generate.

For example, if the command line to generate summary-1.dat is:

stats.py --alpha 3.5 data-1-1.dat data-1-2.dat > summary-1.dat

then data-1-1.dat should contain the line:

$Id: data-1-1.dat 138 2010-09-08 21:30:43Z cdarwin $

which Subversion will update each time the file is checked in. data-1-2.dat should have a similar line, as should stats.py; its should be embedded in a string so that it can be printed:

# inside stats.py version = "$Id: stats.py 71 2010-08-17 08:13:17Z aturing $"

When stats.py runs, it copies its own ID string, its parameters (—alpha 3.5), and the ID strings of data-1-1.dat and data-1-2.dat into summary-1.dat. If some other program then processes summary-1.dat to create something else, it copies all of that information again, so that each file has a header with its complete ancestry. Yes, a proper provenance tool would be more robust and more flexible, but this technique will convey the idea, and implementing it is a good medium-sized exercise in Python.

Thoughts?

(And if you're interested in reproducible research, you'll probably enjoy a recently-published declaration of principles drawn up by Victoria Stodden and others.)