git for data analysis – part II: removing data files from history

This is the second part on data analysis with git, where we want to take a more detailed look at how data files are handled. First, we want to assure ourselves of the problems that arise from large data files. Afterwards, we will see how a once added data file could be completely removed from the repository again.

At first, we set up a new git repository in order to be able to experiment a bit. Therefore, we create a new folder in the home directory and then type git init to initialize the repository.

But before we start, one short remark: I myself am a completely convinced Linux user by now. Hence, all path names in the example will comply with Linux conventions, and git will be executed as originally intended through the Linux command line.

mkdir ~/git_workflow/
cd ~/git_workflow/
git init

Git confirms the initialization of an empty repository:

~/git_workflow $ Initialized empty Git repository in /home/user/git_workflow/.git/

As next step, we want to add a file to the repository. In order to avoid manually typing some content we just print the contents of the home directory into a file called list_of_files.txt.

ls -l ~ | cat > list_of_files.txt

When listing all files in the current directory,

ls -l

we now can see the newly created file:

total 4
-rw-rw-r-- 1 user user 2165 Oct  1 15:03 list_of_files.txt

This file now shall be added to the repository. Therefore, we first add the file, so that git will monitor it, and then we commit the file.

git add list_of_files.txt
git commit -a -m "list of files in home directory added"

Git displays the following output:

[master (root-commit) 694c86f] list of files in home directory added
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 35 insertions(+)
 create mode 100644 list_of_files.txt

In the first lines git tells you that it will automatically set the author of the commit according to the system settings. At the bottom, you can see that list_of_files.txt has been added, and the number of lines that it contains.

Besides the manually created file list_of_files.txt, the directory additionally contains a hidden directory .git. This is where git stores all its files.

ls -la

total 24
drwxrwxr-x  3 user user  4096 Oct  1 15:03 .
drwxr-xr-x 58 user user 12288 Oct  1 15:03 ..
drwxrwxr-x  8 user user  4096 Oct  1 15:03 .git
-rw-rw-r--  1 user user  2165 Oct  1 15:03 list_of_files.txt

In order to see how git handles data files, we now want to create a relatively large csv file. We do this by simply creating a matrix of random numbers in R.

setwd("~/git_workflow")
nVars <- 1000
nObs <- 5000
observations <- runif(nVars*nObs)
largeMatrix <- matrix(observations, nObs, nVars)
write.csv(largeMatrix, "unifMatrix.csv")

Listing the files in the directory, one now can see the new data file unifMatrix.csv, together with its file size, which is approximately 90 MB.

ls -lh

total 86M
-rw-rw-r-- 1 user user 2.2K Oct  1 15:03 list_of_files.txt
-rw-rw-r-- 1 user user  86M Oct  1 15:04 unifMatrix.csv

The following shell command displays the size of the complete directory:

du -sh ~/git_workflow/

87M   /home/user/git_workflow/

As shown, the creation of the data file did succeed, so that we now will add it to our repository.

git add unifMatrix.csv
git commit -a -m "large data matrix commited"

From the message displayed by git we can see that the data matrix contains 5001 lines:

[master 2534a91] large data matrix commited
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 5001 insertions(+)
 create mode 100644 unifMatrix.csv

Instead of just referencing to the data file at this point, git already makes a compressed copy of the data file for internal usage. Hence, the complete directory now has already substantially increased in size:

du -sh ~/git_workflow/

128M  /home/user/git_workflow/

You can also see the compressed copy of the data in git’s folder:

du -sh ~/git_workflow/.git

42M   /home/user/git_workflow/.git

Now, we want to update the data file. Again, we simply store a matrix of random numbers to disk. This time, however, the matrix will be much smaller.

setwd("~/git_workflow")
nVars <- 1
nObs <- 50
observations <- runif(nVars*nObs)
largeMatrix <- matrix(observations, nObs, nVars)
write.csv(largeMatrix, "unifMatrix.csv")

Listing the files of the directory, the reduced size of the data file can be seen. It now has lass than one megabyte.

ls -lha

total 28K
drwxrwxr-x  3 user user 4.0K Oct  1 15:03 .
drwxr-xr-x 58 user user  12K Oct  1 15:03 ..
drwxrwxr-x  8 user user 4.0K Oct  1 15:04 .git
-rw-rw-r--  1 user user 2.2K Oct  1 15:03 list_of_files.txt
-rw-rw-r--  1 user user 1.2K Oct  1 15:04 unifMatrix.csv

A short remark at this point: you are not able to see the real size of sub-directories with the ls command. The 8KB used by .git is only the size required for an empty folder. In order to get the size of the complete sub-directory, use du:

du -sh ~/git_workflow/.git

42M   /home/user/git_workflow/.git

Now, the important thing is that you are not able to easily get rid of the compressed copy of the old data file under version control.

Let’s commit the new data file, which is much smaller by now.

git commit -a -m "large data matrix is now small"

[master 9350bb0] large data matrix is now small
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 51 insertions(+), 5001 deletions(-)
)

As git displays, we have removed all of the 5001 lines of the old and large data file, in order to replace it by only 51 lines of new data.

To see that the old large data file still resides on disk, we let the size of the directory be displayed again.

du -sh ~/git_workflow/

42M   /home/user/git_workflow/

Hence, always keep in mind that any file added to your repository by default will block some disk space forever.

Now, we want to try to get around this property. Thereby, our first guess will be to simply remove the file from the repository.

Let’s first see which files are in the repository at this point:

git ls-tree --full-tree -r HEAD

100644 blob 7d7e1f4bbda784750b2ddf5886fe2d555c8f48f5  list_of_files.txt
100644 blob e6f10d9b519b0159b66b50ffd0baca7091ee27f2  unifMatrix.csv

Now, we remove the large data matrix from the repository:

git rm --cached unifMatrix.csv

Git will confirm this by displaying:

rm 'unifMatrix.csv'

Now, we make a new commit without the original data file:

git commit -a -m "unifMatrix removed from git repo"

[master b1a5770] unifMatrix removed from git repo
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 51 deletions(-)
 delete mode 100644 unifMatrix.csv

As can be seen, unifMatrix.csv is not in the current state of the repository anymore:

git ls-tree --full-tree -r HEAD

100644 blob 7d7e1f4bbda784750b2ddf5886fe2d555c8f48f5  list_of_files.txt

However, it is crucial to note that this did only remove the file from the current HEAD of the branch! The old version of the file is still part of the history of the repository:

du -sh ~/git_workflow/

42M   /home/user/git_workflow/

Hence, this attempt did not work, because we simply added another commit, where we only deleted the file from the current state. We can see the additional commit in the repository log:

git log --pretty=oneline

b1a57708133c28f71195e833ba09f3b876a391ab unifMatrix removed from git repo
9350bb0d1f6e937ef29622258eb5ac947e9b2054 large data matrix is now small
2534a91ee8f4518278976d05b8a23d56dbd5c7b6 large data matrix commited
694c86f98fae627869d405348ae5ed3ab3e8633d list of files in home directory added

Hence, since this didn’t work, we now want to completely remove the last attempt. De facto, this is already the first time that we rewrite history, since the last commit will not only be reverted (which would lead to an additional commit), but it will be completely removed from the repository. Furthermore, there are two options to reset. Option –hard will remove everything up to a given commit and also immediately set the files of the working directory to the state associated with this commit. In contrast, option –soft will keep all changes to the files of the working directory, so that some of the changes could be staged and committed again. Here, we will use the –hard option:

git reset --hard HEAD~1

Git informs us about the new state of our repository:

HEAD is now at 9350bb0 large data matrix is now small

As the log shows, we now are at the same state that we have been before:

git log --oneline

9350bb0 large data matrix is now small
2534a91 large data matrix commited
694c86f list of files in home directory added

Still, we need to get rid of the old big version of our data file in our git history. Therefore, we reset the repository to a state where the data file has not yet been committed to git. This time, however, we use option –soft, in order to keep all changes in our working directory.

git reset --soft HEAD~2

As ls shows, we now are left with the files that are equivalent to the overall commits of both reset commits. Hence, unifMatrix.csv is already the small version,

ls -lah

total 28K
drwxrwxr-x  3 user user 4.0K Oct  1 15:04 .
drwxr-xr-x 58 user user  12K Oct  1 15:03 ..
drwxrwxr-x  8 user user 4.0K Oct  1 15:05 .git
-rw-rw-r--  1 user user 2.2K Oct  1 15:03 list_of_files.txt
-rw-rw-r--  1 user user 1.2K Oct  1 15:04 unifMatrix.csv

and it is not yet known to git so far:

git ls-tree --full-tree -r HEAD

100644 blob 7d7e1f4bbda784750b2ddf5886fe2d555c8f48f5  list_of_files.txt

We can now commit only the final version of our data file:

git commit -a -m "small data matrix added directly"

[master c505438] small data matrix added directly
Your name and email address were configured automatically based
on your username and hostname. Please check that they are accurate.
You can suppress this message by setting them explicitly:

    git config --global user.name "Your Name"
    git config --global user.email you@example.com

After doing this, you may fix the identity used for this commit with:

    git commit --amend --reset-author

 1 file changed, 51 insertions(+)
 create mode 100644 unifMatrix.csv

As the log shows, our history comprises only two commits:

git log --oneline

c505438 small data matrix added directly
694c86f list of files in home directory added

So, did we really get rid of the large data file stored in git yet? Let’s check the size of the repository:

du -sh ~/git_workflow/

42M   /home/user/git_workflow/

What the … ?

Don’t worry, this is just an additional safety net provided by git. The data is not immediately removed from the repository. Still, you can retrieve the delete data, in case that the re-writing of history was made unintentionally. You could just reference the old removed content through the hash code shown by git reflog:

git reflog

c505438 HEAD@{0}: commit: small data matrix added directly
694c86f HEAD@{1}: reset: moving to HEAD~2
9350bb0 HEAD@{2}: reset: moving to HEAD~1
b1a5770 HEAD@{3}: commit: unifMatrix removed from git repo
9350bb0 HEAD@{4}: commit: large data matrix is now small
2534a91 HEAD@{5}: commit: large data matrix commited
694c86f HEAD@{6}: commit (initial): list of files in home directory added

However, these blobs of content will only be accessible for a short amount of time, and they will not be synced to remote repositories also. This effectively should be enough for most needs: the useless old dataset will not disturb anymore.

However, you could also get rid of the old data immediately. Therefore, you first must assure that nothing points to the data anymore. As we have seen, the data was still referenced by the reflog. Hence, we need to remove the reference in the reflog, by removing all references that point to objects that are not part of any commit.

git reflog expire --all --expire-unreachable=0

If we look at the reflog again, we can see that we are left with only the parts of the commit history now:

git reflog

c505438 HEAD@{0}: commit: small data matrix added directly
694c86f HEAD@{1}: commit (initial): list of files in home directory added

Hence, we only need to call the automatic garbage collecting of git, in order to remove all unreferenced objects:

git repack -A -d
git gc --prune=now

Counting objects: 6, done.
Delta compression using up to 2 threads.
(1/5)   
Compressing objects:  40% (2/5)   
Compressing objects:  60% (3/5)   
Compressing objects:  80% (4/5)   
Compressing objects: 100% (5/5)   
Compressing objects: 100% (5/5), done.
(1/6)   
Writing objects:  33% (2/6)   
Writing objects:  50% (3/6)   
Writing objects:  66% (4/6)   
Writing objects:  83% (5/6)   
Writing objects: 100% (6/6)   
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 0 (delta 0)
Counting objects: 6, done.
Delta compression using up to 2 threads.
(1/5)   
Compressing objects:  40% (2/5)   
Compressing objects:  60% (3/5)   
Compressing objects:  80% (4/5)   
Compressing objects: 100% (5/5)   
Compressing objects: 100% (5/5), done.
(1/6)   
Writing objects:  33% (2/6)   
Writing objects:  50% (3/6)   
Writing objects:  66% (4/6)   
Writing objects:  83% (5/6)   
Writing objects: 100% (6/6)   
Writing objects: 100% (6/6), done.
Total 6 (delta 0), reused 6 (delta 0)

Checking the size of our directory, we can see that the unwanted data file now is completely removed:

du -sh ~/git_workflow/

156K  /home/user/git_workflow/

Now that we know how to remove data completely and irreversibly from a git repository, let me end this part with some words of warning. Originally, git was designed for the purpose of being able to retrieve any old state of a given project. Hence, deleting files from the repository irreversibly is in a way contrary to the one underlying purpose of version control. There are many computer scientist that would urgently advise you to never mess with the history of your repository. I am convinced that occasionally removing files from the repository is a necessary step in data analysis, due to the different requirements of the field in contrast to computer science. Nevertheless, we should take these warnings from the field of computer science very seriously, and only deviate from their recommendations if really necessary. For example, you should never mess with the history of branches that are already accessible to other persons. This is extremely dangerous, since they could have based their own work on some of the parts that you are about to remove! Hence, in collaborative projects you need to set up some rules that are binding for all persons, in order to avoid possibly damaging actions. One such rule could be: never rewrite history on publicly available parts of the repository, but only on privately used development branches.

Advertisements

Posted on 2013/10/01, in tools and tagged , . Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: