git for data analysis – part IV: Subdivide Workflow for data analysis
In this final part on git for data analysis, we now want to draw from the experiences that we have made in the previous posts. Putting the individual parts together, we want to derive a robust workflow with git that allows for effective research and collaboration in data analysis. This workflow will be called Subdivide Workflow.
The workflow that now shall be introduced is heavily inspired by gitflow workflow, which originally was described by Vincent Driessen. It involves working with multiple branches that basically can be divided into two different kinds: development branches for quick and dirty code development, and branches that keep track of all results and code after the basic clean-up. Temporarily splitting off current development from the stable and working code base is a routinely applied strategy in software development. This way, new features can be tested in a sandbox environment first, before they will get integrated back to the main code base. In data analysis, however, it is even another aspect of this strategy that one will primarily benefit from: with development branches also comes the ability to rewrite history before merging into the main branch. This way, unnecessary data that was included during experimenting can be completely removed again from the history. As we have seen in the previous posts, this is a crucial step in data analysis, in order to avoid bloating up the project’s disk usage.
At the core of the project, the master branch will comprise the somewhat more stable code which gets merged from development branches. Unnecessary data now already should have been filtered out, and commits hopefully should be somewhat cleaned up into meaningful steps. Since the master branch represents the very core of the project, any modifications to it should always be made available to collaborators as soon as possible. This way, other people can branch off from the most recent version of master when they start development of new features on their own. As master always resides on the remote repository, accessible to others to build their on work on top of it, rewriting its history is a very dangerous operation. Hence, to avoid messing up other people’s work, rewriting history should better be forbidden on the master branch!
In a smaller project, where there is only one main publishing goal, this one stable master branch should already be enough to meet most needs. However, in a larger project environment, where the overall project splits up into several subparts, we could need some further extension to this workflow. The reason is, that we probably want to be able to publish individual subparts already along the way. And, as every initial publication usually leaves some room for improvement, we then need to be able to incorporate further refinements into already published subparts as well. Therefore, we need to be able to temporarily roll back the larger project to the exact state of a previous publication, so that the refinements can be included. As a first step in this direction, the workflow is extended with tags. This way, important landmark points of the project will get marked with a more descriptive name, such that they can be retrieved more easily later on. Using tags, updating old content of a given subpart now becomes rather easy. In order to roll back the project to the old state prevailing at publication time, we just need to find the respective tag, and checkout the associated commit. Now, we just need to branch off of this commit, so that we are free to incorporate any updates without the risk of breaking code from the larger project.
Usually, I think, branches in software development are perceived as only temporary separation. From the very first moment of their initialization, they are supposed to get merged into the main branch in the future again. After all, software, once stable, usually tries to integrate all of its features. This way, individual branches are only very seldomly expected to incorporate any mutually exclusive components. In our case, however, this is an explicit design goal for the workflow. Once you allow subparts of your model to be published at different points of time, they could rely on different states of the same common dependencies. The actual state of a dependency could vary over time, since it is determined through the comprising larger project context. For example, in data analysis, two different subparts may share the same underlying database. Nevertheless, the state of the database could have been different for both subparts at publishing time. Additionally, some descriptive passages of each publication – some text, tables and figures – may rely on the exact state of the database at publishing time. Hence, you probably do not want to break this link between publication and actual state of the underlying database, as you otherwise would be required to update all these paragraphs. In brief, in version control you usually want to have multiple branches of different evolution, who will share some equal ending through merging. However, with individually published subparts, the evolution of some branches shall be identical – what should differ is the starting point. This way, some of your branches will remain forever, and they will never get merged into the main branch again.
To make it a bit less theoretical, this entails the following consequences for our workflow. Whenever we publish a finished subpart of the project, we only need to associate a tag with it. However, when we want to make modifications on an already published subpart, the project usually has evolved further in the meantime. Hence, we need to make sure that the underlying dependencies will be kept in the same state as they were at time of publishing. Hence, we need to roll back the project to its old state, and start modifying the subpart in a different branch we call update_release_subpart_name. Thereby, modifications follow the same routine as with the main branch: you branch off a separate development branch, and only merge it after its history was cleaned up and rewritten, such that no unnecessary data files make it into the main branch update_release_subpart_name. However, the modifications you made are an improvement to the subpart, and you quite naturally also want them to be included into the main branch as well. Hence, this temporary development needs to be merged into your main branch, too. In effect, your master branch and your update_release_subpart_name branch now will share an identical evolution for a short while, which is given through the new modifications. Nevertheless, these evolutions build on different states of the underlying dependencies. Once the modifications are merged into the update_release_subpart_name branch, you can publish the results as a new release of the old subpart. Concluding, you will end up with a new branch for each of your subparts that gets updated. Thereby, these branches should have the same properties as your master branch: they should be made publicly available as soon as possible, and hence rewriting history should not be allowed.
This workflow shall now be illustrated through an example. Let’s assume we start a new project. We initialize a git repository, and make a first commit to the master branch.
Now, we want to add some fancy software feature. As we know that this will quite surely involve some experimenting, we will factor it out into a separate development branch called dev_feature1. This way, we can easily clean up our experimental results and data before we merge it into the main branch. In the course of development, we add four commits of quick and dirty code. The backward-pointing arrows in the graphics are pointing to the parent of each commit.
Since this ad-hoc implementation did involve some unnecessary steps and experiments, we now want to rewrite the commit history before merging it into the main branch. This way, we get rid of any unnecessarily committed data files, and keep the commit history clean and clear for our collaborators. The simplest way to achieve this probably is by using git reset –soft and gradually re-committing in clusters of related content. Alternatively, one could also use git rebase -i, in order to modify the history somewhat more interactively.
After cleaning up the history of the development branch, we are left with only two of the original four commits. These clean changes can now be merged into the master branch. Since our master branch did not evolve further yet, the merge will happen through a simple fast-forward merge at this step, which will be depicted as a bold and forward pointing arrow in the graphics.
These commits now shall already constitute the first subpart of our project, so that we publish our results so far. Therefore, we mark the current snapshot of the project with the tag release1, so that we can find it more easily afterwards. On the contrary, the development branch in its current state will not be needed anymore, so that it can be removed from the repository.
In the minute that we publish our first subpart, we notice that some essential feature is missing. Hence, the next step will be to fix this in order to update the publicly available code as soon as possible. However, we already expect this fix to take some time, so that we can not be sure that the master branch still will be in the same state once we want to merge our modified code. Hence, we create a new branch update_release1, whose beginning points to the exact state of our project at the time of publishing. This way, any modifications will always be based on the initial three commits in master only.
This new branch from now on will be the stable branch for all future modifications to this subpart, and it will remain to the end of our large project. As this branch is considered to be a stable branch, any experiments again should not be made here directly. Therefore, we branch off an additional development branch update_release, where the actual development will be conducted, and rewriting is possible.
By the time we are finished with the modifications, we clean up our development history and merge into the subpart’s release branch. As update_release1 is still in the same state, this will be a fast-forward merge again. Now we simply need to publish our modifications, and mark the current snapshot with the tag release1_v2 for future refinements.
In addition, these refinements should also be made available for the project core, so that we additionally merge the modifications into master. Since master did not further evolve so far, this merge will also be a fast-forward merge.
Now that the refinements have been merged into both master and update_release1, we can remove the development branch update_release again.
Meanwhile, development of two new features was started, and branches dev_feature1 and dev_feature2 were simultaneously branched off from the same commit of master. Such a workflow is a main feature of git, since it allows for simultaneous development of multiple features.
When development of the individual features is finished, their changes shall be made available on the master branch. Therefore, however, we first should clean up the history on both development branches. Then, dev_feature1 will be merged into master. However, master already did evolve in the meantime, since it was extended with the refinements of the released subpart. Hence, merging will require a three-way merge this time, which is indicated by a red forward-pointing arrow. However, if you want your history of master to be linear, you could also use git rebase on the dev_feature1 branch previously, to relate all of its refinements to the most recent commit of master. Here, however, we are already satisfied with the three-way merge. At last, we also merge dev_feature2 by three-way merge, and publish the second subpart of the project. Again, for clarity, we mark the commit associated with the publication with the tag release2.
Finally, we want to look again at how an already published subpart could be further refined, in order to assure ourselves that the workflow works as expected. Thereby, let’s just think about the subpart by means of a more concrete example. By the time of first publication, the project did consist of three commits. Let’s assume that the very first commit consisted of a database, and commit two and three did contain some analysis of the data. Thus, the data is a pre-requisite to the analysis. Any future modifications to this pre-requisite may not affect the originally published analysis, and hence must happen on the main branch only. On the contrary, any modifications of the analysis itself has to be done on the update_release1 branch, and only afterwards be merged into the main branch, too. With this in mind, the current state of our project is as follows. The analysis of the first subpart is based on the very first version of the database, and was already updated once. Meanwhile, the master branch did evolve as well, so that probably the database at the current point is not the same as at the initial commit anymore. Now, it is clear that any new modifications on the analysis have to be started on the update_release1 branch, such that they are still based on the original version of the dataset. Hence, the development branch update_release branches off of update_release1.
After cleaning up, the modifications on update_release will be merged into the stable branch for the analysis: update_release1. This is easily achieved through fast-forward merging. At this point, the snapshot of the update_release1 branch contains all modifications on the originally published analysis, but still the very first version of the dataset. This will be published as update of the subpart, and marked with a tag again.
When merging the modifications into the master branch, we will need a three-way merge. update_release now contains modifications that are not yet contained in master (the modifications that we want to merge), and master contains modifications that are not contained in update_release: probably some modifications on the dataset. Hence, after merging, master will again have the most recent versions of both data and analysis included.
In order to facilitate effective research, there are a bunch of requirements that we pose to our workflow. The most important ones are listed below, together with some remarks about motivation and implementation through git, GNU Make and the Subdivide Workflow:
- robust collaboration
- solved through version control with git
- easy outsourcing of computational tasks to “the cloud”
- using git and GNU Make, you only need to type the following commands on the server:
- git pull
- git commit
- git push
- splitting large projects into subparts
- subparts can be published more timely, and re-used more easily by other researchers
- re-creatable subparts
- in order to be extendable, subparts of the project need to be re-creatable in exactly the same state that they were in at publishing; this is achieved through Subdivide Workflow.
- efficient data handling to not bloat storage usage
- in version control only possible through re-writing of history
- minimal danger of re-writing git history
- re-writing history will only be allowed on temporary development branches
- reproducible research
- complete project can be replicated with GNU Make and retraced with version control
- easy updates for data changes
- automatic re-computation with GNU Make