git for data analysis – part I: putting data under version control?
In the last post, I already mentioned some of the advantages of version control in general, and git in particular. However, git was originally developed in order to facilitate collaboration at large scale open source projects, and hence is not explicitly designed for data analysis. Thus, its main underlying concept (every file you ever put under version control will always remain there forever) could become slightly adverse in data analysis if we do not adjust our workflow accordingly. Imagine a situation in data analysis where you have a project with some data files of significant size that change frequently. If you put these data files under version control, you will effectively store not only the latest and most updated version of your data, but implicitly also all of your previous versions. Hence, you will soon bloat the required disk space of your project! Accordingly, in the next posts we will gradually derive a workflow for git which is particularly tailored for data analysis.
1 Putting data under version control?
As mentioned, data of significant size that changes frequently could bloat up your disk usage in version control. The first way to cope with this problem would be to synchronize some of the data outside of the repository, for example using a different synchronizing software like Dropbox. Depending on your needs – the frequency of data changes and the size of your data – you need to decide to what degree data will find its way into the repository. However, I strongly encourage you to commit your data at least at some landmark points of your project. For example, imagine a situation, where you use some underlying code base for two different publications: a working paper, and a larger project like a PhD thesis. After finishing the working paper, you maybe want to update your data to more recent observations for the larger project. At the same time, however, you need to be able to roll back your project to the point where you published your working paper, just in case that you need to make some further refinements there as well. Such refinements could be the inclusion of a new model as further benchmark. However, as text, graphics and tables of the old working paper are all tailored to the old state of your dataset, you would rather want to build your refinements on this old dataset – something that easily can be done through git.
A somewhat higher degree of putting your data under version control could be to only exclude temporary data associated with current development. As the underlying code for this data is still unstable, the data is supposed to change more frequently at this step. But as soon as the data reaches some degree of stability, it will be added to the repository.
In contrast to that, a second approach is to manage and synchronize all data exclusively through git, while adapting the workflow to the special needs of data files. From a conceptual point of view, this is definitely the more demanding approach, as it will require higher familiarity of users with regards to some more sophisticated (and even more dangerous) commands in git. So, why would one want to opt for this solution? The answer is: to comply with the requirements of the incredibly useful tool GNU Make.
In case you never heard of GNU Make so far: it let’s you define some hierarchical structure for the interrelations and dependencies of your project files. For each “target” file you can specify “dependencies”, which are required in order to “build” (or simply: create) this target. For example, if your target is a pdf graphic, you usually need a script to produce it, together with a file for the data to be visualized. This implicitly defines a very important relationship between your files: whenever either the underlying data or the visualization script changes, your graphic will have to be re-created in order to comply with the changes made. And GNU Make exactly makes use of this knowledge, in order to automatically update all “targets”, given that any of its respective underlying dependencies was updated. This way, you just need to embed all the individual steps of your project into one large sequence of actions, and GNU Make will automatically handle any updates to your project.
In order to be able to detect any updates of files automatically, GNU Make relies on their timestamp, which captures the point in time that a file was “touched” for the last time. In most cases, “touching” is equivalent to just modifying a file, although it is a slightly different action. The important thing with git, however, is that it “touches” a file, whenever you commit it to your repository, and also when you pull it from an external repository. Thereby, it differs from Dropbox, which leaves the timestamp of files unchanged. Hence, combining git for code and Dropbox for data synchronization could artificially mess up the chronological order of your files. For example, let’s assume that you create a script that produces a dataset, you save the script, and then you execute it. The timestamps of your files are in the correct order, with the timestamp of your dataset being more recent than the underlying script. However, if you now commit your script to git, it will “touch” the file, thereby modifying its timestamp. Now, the timestamp of your script suddenly has become the more recent one, so that GNU Make would automatically recompute the dataset on execution. Hence, if you want to make use of the powerful automatic creation capabilities of GNU Make, you should put all of your data under version control, in order to avoid possibly time-consuming re-computations of your data.
Hence, now that we chose to handle all data within git, we now want to take a more detailed look at how one could do this without unnecessarily bloating up disk usage. Therefore, it is important to first get a deep understanding of how to eliminate unnecessary data from git, which we will treat in the next post. Once we are familiar with this operation, we will derive a complete workflow with git that fits all our needs in data analysis.