git for data analysis – part III: git together with GNU Make

In the first part of this blog post series on git for data analysis, we decided to handle all data files within git for one reason: to comply with the requirements of GNU Make. Without the intention to use GNU Make, it would also be appropriate and probably even more convenient to synchronize data files externally, with different software like Dropbox.

Hence, when particularly tailoring our workflow to the requirements of GNU Make, we of course first want to assure ourselves that git and GNU Make really do work together seamlessly. Especially, we want to test whether git manages timestamps such that re-computation with GNU Make works as expected.

In order to test this under realistic conditions, this time we also want to set up a remote repository such that we can synchronize data over multiple computers. However, at this point I will assume that you are already familiar with remote repositories, so that I will not go further into details. I will use an account on bitbucket to set up the remote repository, since it allows free hosting services for private repositories. We start by setting up a git repository locally:

mkdir ~/git_workflow/
cd ~/git_workflow/
git init

~/git_workflow $ Initialized empty Git repository in /home/user/git_workflow/.git/

Then, we create an R script that will plot a simply polynomial function into a pdf file:

grid <- seq(1:10)
yVals <- grid^2 - 8
pdf(file ="./polynomial_plot.pdf")
plot(grid, yVals)
dev.off()

The script will be stored as create_plot.R.

Instead of manually executing the script, we want it to be executed only through GNU Make. Hence, we want GNU Make to run it whenever the output pdf graphics is not yet present in our directory. Therefore, we initialize our Makefile as follows:

./polynomial_plot.pdf: ./create_plot.R
   R CMD BATCH ./create_plot.R

At this point, let me shortly explain the meaning of the Makefile for those of you that are not yet familiar with GNU Make. The basic syntax is such that “target” files are listed before a semi-colon, while the dependencies necessary for the creation of the target file follow. In this case: whenever create_plot.R has a more recent timestamp than polynomial_plot.pdf, GNU Make will automatically re-compute the target. Now we only need to tell GNU Make how the target can be created: this is specified in the next line. Thereby, the syntax demands that the line containing the command for creation must be indented by a TAB! Hence, make sure that your text editor does not automatically replace TABs with simple whitespaces.

Now, in order to create the output graphic, we simply need to type make into the command line:

cd ~/git_workflow/
make

Gnu Make tells us that the following command is executed:

R CMD BATCH ./create_plot.R

Indeed, our directory now contains the desired pdf graphic:

ls -l

total 20
-rw-rw-r-- 1 user user  102 Oct  1 19:10 create_plot.R
-rw-rw-r-- 1 user user  896 Oct  1 19:21 create_plot.Rout
-rw-rw-r-- 1 user user   69 Oct  1 19:20 Makefile
-rw-rw-r-- 1 user user 4594 Oct  1 19:21 polynomial_plot.pdf

If you wonder about the file with .Rout extension: this file is always produced when R executes a script in batch mode.

Comparing the timestamps of the files, one can see that the output file polynomial_plot.pdf has a more recent timestamp than its dependency, create_plot.R. Hence, a repeated call of the makefile will not lead to a re-computation:

make

make: `polynomial_plot.pdf' is up to date.

With our first results in hand, it now is a good point to add a first commit to our git repository.

git add create_plot.R
git add Makefile
git add polynomial_plot.pdf
git commit -m "Makefile, create_plot.R and pdf output committed"

~/git_workflow $ ~/git_workflow $ [master (root-commit) ab6561f] Makefile, create_plot.R and pdf output committed
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 3 files changed, 9 insertions(+)
 create mode 100644 Makefile
 create mode 100644 create_plot.R
 create mode 100644 polynomial_plot.pdf

Checking the timestamps of our files, we can see that simply adding and committing files to the repository does not change their timestamp.

ls -l

total 20
-rw-rw-r-- 1 user user  102 Oct  1 19:10 create_plot.R
-rw-rw-r-- 1 user user  896 Oct  1 19:21 create_plot.Rout
-rw-rw-r-- 1 user user   69 Oct  1 19:20 Makefile
-rw-rw-r-- 1 user user 4594 Oct  1 19:21 polynomial_plot.pdf

Now, let’s see how this works with a remote repository. Therefore, we set up a new repository at bitbucket, which will be called git_workflow as well. Since our project has already been started, we choose the option I have an existing project to push up at the webpage. Then, we just need to copy the code provided by bitbucket, in order to push up all local files via ssh. Note, however, that this will only work, if you did already set up an ssh authentication for your computer. You can read more about this in the Bitbucket Documentation.

git remote add origin ssh://git@bitbucket.org/username/git_workflow.git
git push -u origin --all # pushes up the repo and its refs for the first time
git push -u origin --tags # pushes up any tags

Counting objects: 5, done.
Delta compression using up to 2 threads.
(1/5)   
Compressing objects:  40% (2/5)   
Compressing objects:  60% (3/5)   
Compressing objects:  80% (4/5)   
Compressing objects: 100% (5/5)   
Compressing objects: 100% (5/5), done.
(1/5)   
Writing objects:  40% (2/5)   
Writing objects:  60% (3/5)   
Writing objects:  80% (4/5)   
Writing objects: 100% (5/5)   
Writing objects: 100% (5/5), 4.39 KiB, done.
Total 5 (delta 0), reused 0 (delta 0)
To ssh://git@bitbucket.org/username/git_workflow.git
master
Branch master set up to track remote branch master from origin.
Everything up-to-date

Next, we want to clone the repository into a different folder on our local machine. Therefore, we create a new folder ~/git_workflow_experiment/, where we clone the repository from the remote host. The URL to the repository can be found on the repositories homepage at bitbucket.

mkdir ~/git_workflow_experiment/
cd ~/git_workflow_experiment/
git clone git@bitbucket.org:username/git_workflow.git .

~/git_workflow_experiment $ Cloning into '.'...
remote: Counting objects: 5, done.
(1/5)           
remote: Compressing objects:  40% (2/5)           
remote: Compressing objects:  60% (3/5)           
remote: Compressing objects:  80% (4/5)           
remote: Compressing objects: 100% (5/5)           
remote: Compressing objects: 100% (5/5), done.        
remote: Total 5 (delta 0), reused 0 (delta 0)
(1/5)   
Receiving objects:  40% (2/5)   
Receiving objects:  60% (3/5)   
Receiving objects:  80% (4/5)   
Receiving objects: 100% (5/5)   
Receiving objects: 100% (5/5), 4.39 KiB, done.

Now, we can check the timestamps of the files in our second directory:

ls -l

total 16
-rw-rw-r-- 1 user user  102 Oct  2 09:16 create_plot.R
-rw-rw-r-- 1 user user   69 Oct  2 09:16 Makefile
-rw-rw-r-- 1 user user 4594 Oct  2 09:16 polynomial_plot.pdf

As you can see, all files now show the same timestamp. This timestamp does not equal the last time that they have been modified anymore, but it refers to the time that they have been pulled from the repository. And with equal timestamps, GNU Make will treat target files as being up to date:

make

make: `polynomial_plot.pdf' is up to date.

Hence, so far everything works consistently and as expected. However, by equalizing all the timestamps, we automatically also lose some information. Previously, we could tell which file is more recent. Now, however, both files suddenly appear as being modified simultaneously, which always causes GNU Make to treat the target as up to date. It is not difficult to come up with an example where this will lead to an unwanted result: we just need to commit changes to both source and target, where the target, nevertheless, is not up to date. Let’s take a look at an example.

We first make some changes to the underlying graphic’s script, where we exchange the exponent such that the graph will now show a square root function:

grid <- seq(1:10)
yVals <- grid^0.5 - 8
pdf(file ="./polynomial_plot.pdf")
plot(grid, yVals)
dev.off()

Since our pdf target depends on this script, GNU Make will automatically update the graphics on execution:

make

R CMD BATCH ./create_plot.R

Looking at the repository status, we can see that both files have been modified by now.

git status

On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)

modified:   create_plot.R
modified:   polynomial_plot.pdf

Untracked files:
(use "git add <file>..." to include in what will be committed)

.RData
create_plot.Rout
no changes added to commit (use "git add" and/or "git commit -a")

Hence, if we committed the files, pushed them to the repository, and pulled them again from our second local repository, both files would show an updated and equal timestamp. Now, in order to mess up the sequence of GNU Make, we change the source file again, and only afterwards we commit our changes. Hence, we add an additional minus sign to the function that is plotted, and add the value of 100.

grid <- seq(1:10)
yVals <- -grid^0.5 + 100
pdf(file ="./polynomial_plot.pdf")
plot(grid, yVals)
dev.off()

Checking the timestamps of our files, we already see that GNU Make would update the target again, as create_plot.R is more recent than polynomial_plot.pdf.

ls -l

total 20
-rw-rw-r-- 1 user user  122 Oct  2 10:05 create_plot.R
-rw-rw-r-- 1 user user  913 Oct  2 09:59 create_plot.Rout
-rw-rw-r-- 1 user user   69 Oct  2 09:16 Makefile
-rw-rw-r-- 1 user user 4603 Oct  2 09:59 polynomial_plot.pdf

This can be seen even better if we call GNU Make with option –dry-run. GNU Make then only displays all the steps that it would conduct on next execution, without actually conducting them now.

make --dry-run

R CMD BATCH ./create_plot.R

As expected, GNU Make would update the pdf graphic.

Now, we commit both files,

git add create_plot.R
git add polynomial_plot.pdf
git commit -m "updated source and target, but committed in not updated state"

~/git_workflow_experiment $ [master 0fa24d6] updated source and target, but committed in not updated state
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 2 files changed, 5 insertions(+), 5 deletions(-)

and push the new commit to the remote repository:

git push origin master

Counting objects: 7, done.
Delta compression using up to 2 threads.
(1/4)   
Compressing objects:  50% (2/4)   
Compressing objects:  75% (3/4)   
Compressing objects: 100% (4/4)   
Compressing objects: 100% (4/4), done.
(1/4)   
Writing objects:  50% (2/4)   
Writing objects:  75% (3/4)   
Writing objects: 100% (4/4)   
Writing objects: 100% (4/4), 1.11 KiB, done.
Total 4 (delta 1), reused 0 (delta 0)
To git@bitbucket.org:username/git_workflow.git
master

To get the changes into our second local repository, we switch directory and pull from the remote repository:

cd ~/git_workflow/
git pull origin master

remote: Counting objects: 7, done.
(1/4)           
remote: Compressing objects:  50% (2/4)           
remote: Compressing objects:  75% (3/4)           
remote: Compressing objects: 100% (4/4)           
remote: Compressing objects: 100% (4/4), done.        
remote: Total 4 (delta 1), reused 0 (delta 0)
(1/4)   
Unpacking objects:  50% (2/4)   
Unpacking objects:  75% (3/4)   
Unpacking objects: 100% (4/4)   
Unpacking objects: 100% (4/4), done.
From ssh://bitbucket.org/username/git_workflow
FETCH_HEAD
Updating ab6561f..0fa24d6
Fast-forward
 create_plot.R       |   10 +++++-----
4603 bytes
 2 files changed, 5 insertions(+), 5 deletions(-)

Although we did commit in a state where GNU Make would update on execution, the files in the directory now exhibit equal timestamps.

ls -l

total 20
-rw-rw-r-- 1 user user  122 Oct  2 10:10 create_plot.R
-rw-rw-r-- 1 user user  896 Oct  1 19:21 create_plot.Rout
-rw-rw-r-- 1 user user   69 Oct  1 19:20 Makefile
-rw-rw-r-- 1 user user 4603 Oct  2 10:10 polynomial_plot.pdf

Hence, GNU Make will not update the target, although the pdf graphics in its current state does not comply with its underlying script.

make

make: `polynomial_plot.pdf' is up to date.

Summing up, git was explicitly designed to handle timestamps in a manner that is consistent to GNU Make. The combination of git and GNU Make works robust, as long as you never perform the one dangerous sequence of actions: committing in a state that requires updating, when both source and target were modified! A simple rule to avoid this would be: “always update your branch before committing”.

Advertisements

Posted on 2013/10/02, in tools and tagged , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: