CROSS-CUTTING

Version control for research

Systematic recording of changes to files over time, tracking who changed what and when, with recovery of any previous state. Git is the standard tool; it applies reproducibility to the computational practice of research.

Extended definition

Version control is the systematic recording of changes to a set of files over time, keeping track of who changed what and when, so that any previous state can be recovered. In research, this means treating code, data, figures, notes, and the manuscript itself as a traceable history, rather than as files overwritten with each save. The standard tool is Git, a distributed system in which each copy of the repository contains the full history; platforms such as GitHub and GitLab host these repositories and add collaboration. Blischak and colleagues (2016) provide the reference introduction to the basic workflow: recording changes in commits with descriptive messages, branching to experiment without breaking the main work, and synchronizing with a remote repository. Ram (2013) argues that Git is especially suited to science because it gives a lightweight, robust framework for managing the whole set of research outputs, linking version control directly to reproducibility and transparency.

When it applies

Version control applies to any research with a computational component, from the analysis script to the article’s text. It applies to reproducibility: recording the exact state of the code that produced a result is what lets another researcher, or yourself months later, redo the analysis. Sandve and colleagues (2013), among the rules for reproducible computational research, place versioning everything as a central practice. It applies to collaboration, where several people edit the same files without overwriting each other’s work, with conflicts resolved explicitly. It applies to transparency, by allowing the history to be published alongside the article, and to error recovery, by making it trivial to return to a state that worked. It also applies to data and notes, not only code.

When it does not apply

Version control with Git does not apply well to large binary files, such as high-resolution images or bulky datasets, which bloat the repository; in those cases, specific extensions or data repositories are the correct route. It does not apply as a substitute for backup: a history versioned in a single place is still lost if the disk fails, and versioning is not replication. It does not apply without commit discipline: recording huge, rare changes with vague messages hollows out the traceability benefit. It does not apply as a tool for managing sensitive data without care, since committed secrets and personal data stay in the history even after being deleted from the file. And it does not apply as an end in itself: versioning does not make research reproducible on its own, it is one of the conditions, not the full guarantee.

Applications by field

  • Computational research and data science: versioning of analysis code, linking each result to the state that produced it.
  • Bioinformatics: management of pipelines and scripts, a field where the practice consolidated early.
  • Quantitative social sciences: control of cleaning and modeling scripts, with a publishable history.
  • Collaborative writing: versioned manuscripts and technical documents, with parallel editing and no overwriting.

Common pitfalls

The first pitfall is confusing versioning with backup: Git records the history but does not protect against the loss of the single place where it lives. The second is committing large or binary files to the main repository, bloating the history. The third is a lack of discipline: rare commits and vague messages cancel the traceability that justifies the tool. The fourth is committing secrets or sensitive data, which remain in the history even after being removed from the current file. The fifth is treating version control as a guarantee of reproducibility, when it is only one of the pillars, and requires tracking the environment, data, and dependencies for the result to actually be reproduced.

Last updated —