Why should I start using a source code repository?
A user story
Alan is a principal investigator in a University. He has created a suite of R scripts to perform the statistical analysis on his data. He has 3 computers, work, home and laptop, and each computer has a copy of his R scripts. He spends 2 hours one evening tracking down a difference in results between his home and work PC. He has recently submitted a paper to a journal and the reviewer would like some further analysis. Alan spends 2 days trying to recreate the results as he has not documented his process or which version of the scripts he used. Alan has a post-doc researcher, Beth. Beth has taken a copy of Alan's scripts and is now modifying them. Beth and Alan spend, on average, 1 hour a week trying to ensure the copy each of them has is up to date. Alan's directory looks like this:
varianceDataSet1.R varianceDataSetUS.R varianceDataSetUK.R varianceDataSetHomev1.R varianceDataSetHomeBroken.R varianceDataSet2v2.R varianceDataSet2v3.R varianceDataSet2v4.R
There are many articles on getting started using a source code repository, but from my experience, there is a bigger question to answer.
What is a source code repository and why should I spend any effort in learning how to use one?
I intend to discuss this question in the post, and hope that readers will see that they do indeed need a source code repository and understand why they need one.
Target Audience
- Academics and researchers writing code
- Those who's primary job is not writing code, but are still creating and maintaining source code
- Managers of the above
Source code and repositories
Source code can be any sort of program code, C, R, FORTRAN, Perl, Python, MATLAB etc. Documents like DocBook XML files and LaTex files can also be considered as source code.
A software repository is a specialised piece of software designed to assist with the maintenance of source code. There are a great many open source and commercial systems available.
git is a popular open source version control system, which after much exploration, Certus Technology has adopted as its preferred version control system. There is also a wikipedia entry on git.
On top of git, it is possible to run a web based repository manager. These systems typically provide user management and access control and a web portal for viewing repositories and their meta data. GitLab and gogit are open source systems that you can host yourself. GitHub and Bitbucket are commercial systems that offer free and paid accounts.
Some of the features you can expect from a source code repository are:
- tracking changes to files, including the time of change and the author
- controlling access to the repository
- comparing different revisions of files
- tagging files at a certain point
Some things that are simplified by using a content repository.
- developing code on more than one computer
- more than one person developing the same code at the same time
- marking a version of code, say as used for journal paper, or last known working copy
- reverting changes that now break your analysis
- recording changes over time
Costs
- You have to learn how to use a repository
- You have to choose a repository and live with this choice. It may be hard to move to another
- You may have to pay for software, but there are many open source and free products
- You may have to pay for hosting, although there are systems that allow you to host your own repositories
and Benefits
- You are now using an industry standard approach to looking after source code. Even if you don't get it yet, then this fact should help spur you on.
- Alan's problems mostly evaporate
- You can spend more time on your code, and less on thinking about which version of which file you need.
An improved user story
Carol is a principal investigator in a University. She has been on an introductory course run by her faculty and has an account on the faculty GitLab deployment. She organises her work into git repositories, and after a bit of practice, has become familiar enough with git to be comfortable to use for all her coding. She has learnt about writing good commit messages and tags her repositories at important events. Her new post-doc researcher, Derrick, was able to start contributing to Carol's code base very quickly as he was already familiar with git. Carol and Derrick's meetings are spent productively on the code functions, rather than checking to see who has which version of the code. Carol and Derrick publish their repository on GitHub so anyone can access their code. They also generate a DOI for their code by publishing their code as a research object. They cite their code in journal articles.