Category Archives: tools
Software tools for research in statistics and data analysis.
Table of Contents
In this blog post I want to give a description of how to set up a running IJulia server on an Amazon EC2 instance. Although there already are some useful resources on the topic out there, I didn’t find any comprehensive tutorial yet. Hence, I decided to write down my experiences, just in case it might be a help for others.
1 What for?
At this point, you might think: “okay – but what for?”. Well, what you get with Amazon AWS is a setting where you can easily load a given image on a computer that can be chosen from a range of different memory and processing characteristics (you can find a list of available instances here). In addition, you can also store some private data on a separate volume with Amazon’s Elastic Block Store. So taking both components together, this allows fast setup of the following workflow:
- choose a computer with optimal characteristics for your task
- load your ready-to-go working environment with all relevant software components pre-installed
- mount an EBS volume with your private data
And your customized computer is ready for work!
So, what you get is a flexible environment that you can either use for selected tasks with high computational requirements only, or that you steadily use with a certain baseline level of CPU performance and only scale it up when needed. This way, for example, you could outsource computations from your slow netbook to the Amazon cloud whenever the weather allows you to do your work outside in your garden. Any easy to transport and lightweight netbook already provides you with an entrance to huge computational power in the Amazon cloud. However, compared to a moderately good local computer at home, it doubt that the Amazon cloud will be a cost-efficient alternative for permanent work yet. Besides, you always have the disadvantage of having to rely on a good internet connection, which especially becomes problematic when traveling.
Anyways, I am sure you will find your own use cases for IJulia in the cloud, so let’s get started with the setup.
2 Setting up an Amazon AWS account
The first step, of course, is setting up an account at Amazon web services. You can either log in with an existing Amazon account, or just create a completely new account from scratch. At this step, you will already be required to leave your credit card details. However, Amazon automatically provides you with a free usage tier, comprising (among other benefits) 750 hours of free EC2 instance usage, as well as 30 GB of EBS storage. For more details, just take a look at Amazon’s pricing webpage.
3 Setting up an instance
Logging in to your account will take you to the AWS Management Console: a graphical user interface that allows managing all of your AWS products. Should you plan on making some of your products available to other users as well (employees, students, …), you could easily assign rights to groups and users through their AWS Identity and Access Management (IAM), so that you are the only person with access to the full AWS Management Console.
Now let’s create a new instance. In the overview, click on EC2 to get to the EC2 Dashboard, where you can manage and create Amazon computing instances. You need to specify the region where your instance should physically be created, which you can select in the upper right corner. Then, click on instances on the left side, and select
Launch Instance to access the step by step instructions.
In the first step you need to select a basic software equipment that your instance should use. This determines your operating system, but also could comprise additional preinstalled software. You can select from a range of Amazon Machine Images (AMI), which could be images that are put together either by Amazon, by other members of the community or by yourself. At the end of the setup, we will save our preinstalled IJulia server as AMI as well, so that we can easily select it as ready-to-go image for future instances. For now, however, we simply select an image with a basic operating system installed. For the instructions described here, I select Ubuntu Server 14.04 LTS.
In the next step we now need to pick the exact memory and processing characteristics for our instance. For this basic setup, I will use a t2.micro instance type, as it is part of the free usage tier in the first 12 months. If you wanted to use Amazon EC2 for computationally demanding tasks, you would select a type with better computational performance.
In the next two steps,
Configure Instance and
Add Storage, I simply stick with the default values. Only in step 5 I manually add some tags to the instance. Tags can be added as key-value pairs. So, for example, we could use the following tags to describe our instance:
- task = ijulia,
- os = ubuntu,
- cpu = t2.micro
By now, all characteristics of the instance itself are defined, and we only need to determine the traffic that is allowed to reach our instance. In other words: we have defined a virtual computer in Amazon’s cloud – but how can we communicate with it?
In general, there are several ways to access a remote computer. The standard way on Linux computers would be through a terminal with SSH connection. This way, we can execute commands at the remote computer, which allows the installation of software, for example. In addition, however, we might want to communicate with the remote computer through other channels than the terminal. For example, in our case we want to access the computing power of the remote through an IJulia tab in our browser. Hence, we need to determine all ways through which a communication to the remote should be allowed. For security reasons, you want to keep these communication channels as sparse and restricted as possible. Still, you need to allow some of these channels in order to use our instance in the manner that you want to. In our case, we add a new rule that allows access to our instance through the browser, and hence select
All TCP as its type. Be warned, however, that I am definitely not an expert on computer security issues, so that this rule might be overly permissive. Ultimately, you only want to permit communication through https with a given port (8998 in our example), and there might be better ways to achieve this. Still, I will add an additional layer of protection by restricting permission to my current IP only. If you do this, however, then you most likely need to adapt your settings each time that you restart your computer, as your IP address might change. Anyways, in many cases your instance will only live temporarily, and will be deleted after completion of a certain task. Your security concerns for such a short-living instance hence might be rather limited. Either way, you can store your current settings as a new security group, so that you can select the exact same settings for some of your future instances. I will choose “IJulia all tcp” as name of the group.
Once you are finished with your security settings, click
Review and Launch and confirm your current settings with
Launch. This will bring up a dialogue that lets you exchange ssh keys between remote computer and local computer. You can either create a new key pair or choose an already existing pair. In any case, make sure that you will really have access to the chosen key afterwards, because it will be the only way to get full access to your remote. If you create a new pair, the key will automatically be downloaded to your computer. Just click
Launch Instance afterwards, and your instance will be available in a couple of minutes.
4 Installing and configuring IJulia
Once your instance will be displayed as running you can connect to it through ssh. Therefore, we need to use the ssh key that we did get during the setup, which I stored in my download folder with name aws_ijulia_key.pem. First, however, we will need to change the permissions of the key, so that it can not be accessed by everyone. This step is mandatory – otherwise the ssh connection will fail.
chmod 400 ~/Downloads/aws_ijulia_key.pem
Now we are ready to connect. Therefore, we need to link to the ssh key through option
-i. The address of your instance is composed of two parts. First, the username, which is
ubuntu for the case of an Ubuntu operating system. And second, we need to copy the Public DNS of the instance, which we can find listed in the description when we select the instance in the AWS Management Console.
ssh -i ~/Downloads/aws_ijulia_key.pem firstname.lastname@example.org
Now that we have access to a terminal within our instance, we can use it to install all required software components for IJulia. First, we install Julia from the julianightlies repository.
sudo add-apt-repository ppa:staticfloat/julia-deps sudo add-apt-repository ppa:staticfloat/julianightlies sudo apt-get update sudo apt-get -y install julia
We now already could use Julia through the terminal. Then we need to install a recent version of ipython, which we can do using pip.
sudo apt-get -y install python-dev ipython python-pip sudo pip install --upgrade ipython
IJulia also requires some additional packages:
sudo pip install jinja2 tornado pyzmq
At last, we need to install the IJulia package:
julia -e 'Pkg.add("IJulia")'
At this step, we have installed all required software components, which we now only need to configure correctly.
The first thing we need to do is set up a secure way of connecting to IJulia through our browser. Therefore, we need to set a password that we will be required to type in when we log in. We will create this password within ipython. To start ipython on your remote, type
Within ipython, type the following
from IPython.lib import passwd passwd()
Now type in your password. For this demo, I chose ‘juliainthecloud’, equal to the setup here. IPython then prints out the encrypted version of this password, which in my case is
You will need this hashed password later, so you should temporarily store it in your editor.
We now can exit ipython again:
As a second layer of protection, we need to create a certificate which our browser will automatically use when it communicates with our IJulia server. This way, we will not have to send our password in plain text to the remote computer, but the connection will be encrypted right away. In your remote terminal, type:
mkdir ~/certificates cd ~/certificates openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
You are asked to add some additional information to your certificate, but you can also skip all fields by pressing enter.
Finally, we now only need to configure IJulia. Therefore, we add the following lines to the julia profile within ipython. This profile should already have been created through the installation of the IJulia package, so that we can open the respective file with vim:
i for insert mode, and after
c = get_config()
add the following lines (which I myself did get from here):
# Kernel config c.IPKernelApp.pylab = 'inline' # if you want plotting support always # Notebook config c.NotebookApp.certfile = u'/home/ubuntu/certificates/mycert.pem' c.NotebookApp.ip = '*' c.NotebookApp.open_browser = False c.NotebookApp.password = u'sha1:32c75aa1c075:338b3addb7db1c3cb2e10e7b143fbc60be54c1be' # It is a good idea to put it on a known, fixed port c.NotebookApp.port = 8998
Make sure that you did set
c.NotebookApp.password to YOUR hashed password from before.
To leave vim again, press escape to end insert mode and
:wq to save and quit.
5 Starting and connecting to IJulia server
We are done with the configuration, so that we now can connect to our IJulia server. To start the server on the remote, type
sudo ipython notebook --profile=julia # nohup ipython notebook --profile julia > ijulia.log &
And to connect to your server, open a browser on your local machine and go to
This will pop up a warning that the connection is untrust. Don’t worry, this is just because we are using the encryption that we did set up ourselves, and this certificate is not officially registered. Once you allow the connection, an IJulia logo appears, and you will be prompted for your password.
Note, however, that the Public DNS of your instance changes whenever you stop your instance from running. Hence, you need to type a different URL when you connect to IJulia the next time.
6 Store as AMI
Now that we have done all this work, we want to store the exact configuration of our instance as our own AMI to be able to easily load it on any EC2 instance type in the future. Therefore, we simply stop our running instance in the AWS Management Console (after we also stopped the running IJulia server of course). Right-click on it as soon as the Instance State says
create image, select an appropriate name for your configuration, and you’re done. The next time you launch a new image, you can simply select this AMI instead of the Ubuntu 14.04 AMI that we did use previously.
However, creating an image of your current instance requires you to store a snapshot of your volume in your elastic block store, which might cause additional costs depending on your AWS profile. Using Amazon’s free usage tier, you have 1GB free snapshot storage at your disposal. Due to compression, your snapshot should be smaller in storage size than the original 8GB of your volume. Nevertheless, I do not yet know where to look up the overall storage size of my snapshots (determining the size of individual snapshots if you have multiple snapshots is even impossible). I only can say that Amazon did not bill me for anything yet, so I assume that the snapshot’s size should be smaller than 1GB.
7 Add volume
In order to have your data available for multiple instances, you should store it on a separate volume, which then can be attached to any instance you like. This also allows easy duplication of your data, so that you can attach one copy of it to your steadily running t2.micro instance, while the other one gets attached to some temporary instance with greater computational power in order to perform some computationally demanding task.
For creation and mounting of a data volume we largely follow the steps shown in the documentation.
On the left side panel, click on
Volumes, then select
Create Volume. As size we choose 5 GB, as this should leave us with enough free storage size in order to simultaneously run two instances, each with its own copy of personal data mounted. Also, we choose to not encrypt our volume, as t2.micro instances do not support encrypted volumes.
Now that the volume is created, we need to attach it to our running instance. Right-click on the volume and select
Attach Volume. Type in the ID of your running instance, and as your device name choose
/dev/sdf as proposed for linux operating systems.
Connect to your running instance with ssh, and display all volumes with the
lsblk command in order to find the volume name. In my case, the new attached volume displays as
xvdf. Now, since we created a new volume from scratch, we additionally need to format it. Be careful to not conduct this step if you attach an already existing volume with data, as it would delete all of your data! For my volume name, the command is
sudo mkfs -t ext4 /dev/xvdf
The device can now be mounted anywhere in your file system. Here we will mount it as folder
research in the home directory.
sudo mkdir ~/research sudo mount /dev/xvdf /home/ubuntu/research
And, in case you need it at some point, you can unmount it again with
umount -d /dev/xvdf
Well, now we know how to get an IJulia server running. To be honest, however, this really is only half of the deal if you want to use it for more than just some on-the-fly experimenting. For more elaborate work, you need to be able to also conveniently perform the following operations at your Amazon remote instance:
- editing files on the remote
- copying files to and from the remote
There are basically two ways that allow you to edit files on the remote. First, you use some way of editing files directly through the terminal. For example, you could open a vim session in your terminal on the remote itself. Second, you work with a local editor that allows to access files on a remote computer via ssh. As emacs user, I stick with this approach, as emacs tramp allows working with remote files as if they were local.
For synchronization of files to the remote I use git, as I host all of my code on github and bitbucket. However, I also want my remote to communicate with my code hosting platforms through ssh without any additional password query. In principle, one could achieve this through the standard procedure: creating a key pair on the remote, and copying the public key to the respective file hosting platform. However, this could become annoying, given that we have to perform it whenever we launch a new instance in the future. There is a much easier way to enable communication between remote server and github / bitbucket: using SSH agent forwarding. What this does, in simple terms, is passing your local ssh keys on to your remote computer, which is then allowed to use them for authentication itself. To set this up, we have to add some lines to
./ssh/config (take a look at this post for some additional things that you could achieve with your
config file). And since we are already editing the file, we also take the chance to set an alias aws_julia_instance for our server and automatically link to the required login ssh key file. Our
config file now is
Host aws_julia_instance HostName ec2-54-76-169-215.eu-west-1.compute.amazonaws.com User ubuntu IdentityFile ~/Downloads/aws_ijulia_key.pem ForwardAgent yes
Remember: you have to change the HostName in your
config file whenever you stop your instance, since the Public DNS changes.
Given the settings in our
config file, we now can access our remote through ssh with the following simple command:
And, given that we have set up ssh access to github and bitbucket on our local machine, we can use these keys on the remote as well.
As a last convenience for the synchronization of files on your remote I recommend the use of myrepos, which allows you to easily keep track of all modifications in a whole bunch of repositories. You can install it with
mkdir ~/programs cd ~/programs git clone https://github.com/joeyh/myrepos.git cd /usr/bin/ sudo ln -s ~/programs/myrepos/mr
Nowadays, a lot of time of everyday research is spent in front of computers. Especially in data analysis, of course, computers are an elementary part of science. Nevertheless, most researchers still seem to have not gotten a real training in computer science, but tend to just develop their own manners for getting the job done.
Greg Wilson, together with the other members of the software training group Software Carpentry, devotes his time to promoting best practices of the computer science community into other fields of the scientific community. I highly recommend his newly published paper Best Practices for Scientific Computing, in which he lists a number of recommendations for an improved workflow in scientific computing. Also, make sure to check the Software Carpentry homepage, which provides a number of short video tutorials for a bunch of topics that are fundamental to any data analysis.
Each time you log into one of your countless online accounts, you probably also hear that voice in your head, nagging, because of your poor internet security practices. Finally, however, I decided to get rid of it, now and for all time. And this is how it’s done: password managers.
In this final part on git for data analysis, we now want to draw from the experiences that we have made in the previous posts. Putting the individual parts together, we want to derive a robust workflow with git that allows for effective research and collaboration in data analysis. This workflow will be called Subdivide Workflow.
In the first part of this blog post series on git for data analysis, we decided to handle all data files within git for one reason: to comply with the requirements of GNU Make. Without the intention to use GNU Make, it would also be appropriate and probably even more convenient to synchronize data files externally, with different software like Dropbox.
Hence, when particularly tailoring our workflow to the requirements of GNU Make, we of course first want to assure ourselves that git and GNU Make really do work together seamlessly. Especially, we want to test whether git manages timestamps such that re-computation with GNU Make works as expected.
This is the second part on data analysis with git, where we want to take a more detailed look at how data files are handled. First, we want to assure ourselves of the problems that arise from large data files. Afterwards, we will see how a once added data file could be completely removed from the repository again.
In the last post, I already mentioned some of the advantages of version control in general, and git in particular. However, git was originally developed in order to facilitate collaboration at large scale open source projects, and hence is not explicitly designed for data analysis. Thus, its main underlying concept (every file you ever put under version control will always remain there forever) could become slightly adverse in data analysis if we do not adjust our workflow accordingly. Imagine a situation in data analysis where you have a project with some data files of significant size that change frequently. If you put these data files under version control, you will effectively store not only the latest and most updated version of your data, but implicitly also all of your previous versions. Hence, you will soon bloat the required disk space of your project! Accordingly, in the next posts we will gradually derive a workflow for git which is particularly tailored for data analysis.
If you have never used version control software for your projects, you might ask yourself, what benefits there possibly could be. The main characteristic of version control is that everything you ever have written in your project will be stored forever. Hence, at any time you could easily roll back to any prior point of your project, and check out the exact state that your project previously was in. Still, of course, it might not be immediately clear why this should be of any great use. I mean, who wants to roll back the project to some long past unfinished step in the middle of development? Well, let’s see…
To me, the main benefit of version control lies in collaboration. Compared, for example, with simple synchronization software like Dropbox, version control is just a lot more robust for collaboration. The reason lies in the conceptual design of synchronizing software. While Dropbox is incredibly powerful and easy to use when it comes to synchronizing and sharing content amongst collaborators, it is simply not designed to simultaneously working at the same file! Just assume two people applying changes to a given line simultaneously – how would Dropbox know, which changes should be kept? Such conflicting changes will always require human interaction in order to be resolved satisfactory. The goal of any software for collaboration, hence, must be to make the operation of resolving conflicts as cheap as possible. And this is exactly what git does, since it was explicitly built for this purpose. Linus Torvalds, the chief architect of the Linux kernel, created git in order to improve collaboration at this open source project with participation of thousands of individual developers.
However, don’t be fooled at this point into thinking that version control is for multi-person projects only. There is a different kind of multi-dimensionality involved in nearly every single person project as well: using multiple computers. So far, I myself use git to simply keep my research synchronized on work computer, private computer and netbook. Of course, this is something that Dropbox could do quite efficiently as well, as the number of conflicting changes should be rather negligible (basically, in single person projects you only run into conflicting changes when you accidentally start editing a file on the next computer, while Dropbox still hasn’t finished synchronizing – something that usually happens with either low bandwidth or large files). However, even without conflicting changes, you will still profit from some of the other features of version control.
Through branching, for example, git allows development of new code in a sandbox environment: without a risk of breaking functioning code in the current stable project version. For example, this becomes important whenever you choose to improve the performance of an already working part of code. You do not want to lose your old version, just to have a safety net in case that the experiment goes wrong. This becomes especially useful whenever you work out improvements on some code base which simultaneously is required to be in a reliable state for a second project as well. With branches, you can switch to the stable old version in a second. Or, if you want to extend some project that is currently under revision, just set a mark at the state of the project when you did hand it in. When you get back the paper with remarks on refinements, you can easily work in the refinements in a second branch, while your project already continues to evolve in some different direction on the main branch.
So, in my opinion, the largest benefit of version control is that it allows robust and easy collaboration. And even if you do not participate at a collaborative project right now, it might be a good idea to already get prepared for when the time comes. Maybe getting accustomed to a workflow with version control will even lower the barriers for future collaborations.
But why exactly should you prefer git? What about Mercurial, CSV, Subversion and the like? Well, honestly speaking, I might not be the best person to answer this question, as I only did work with git so far. However, by what I have heard so far, git allegedly is the unbeaten champion when it comes to further features like branching. It is extremely fast and reliable, and heavily used in open source software development.
But if you still need some further persuasion, and probably some more expertise as well, you better listen to the argumentation of the inventor of git himself. There exists a quite entertaining Google Tech Talk, where Linus Torvalds (“a man of strong opinions”, as he describes himself) alternately describes the developers of other version control software as either “morons” or “incredibly stupid people”.
Now that you are convinced: you can find plenty of resources to get started with git at the documentation section on the official git homepage. Furthermore, you can have free public repositories hosted by github, and free repositories for up to 5 simultaneous users by bitbucket. Using these external storage providers, git immediately becomes an efficient backup system as well. After you are sufficiently accustomed to the basic commands, there is plenty of inspiration for how to make the most of git in your daily workflow. Either take a look a the different git workflows presented on the Atlassian homepage, or read the blog post on A successful Git branching model.