Category Archives: science

How should science be? Best practices, publishing, reproducible research, …

Germany most likely to win Euro 2016

After World Cup 2014 we finally are facing the next spectacular football event now: Euro 2016. With billions of football fans spread all over the world, football still seems to be the single most popular sport. Might have something to do with the fact that football is a game of underdogs: David could beat Goliath any day. Just take a look at the marvelous story of underdog Leicester City in this year’s Premier League season. It is this high uncertainty in future match outcomes that keeps everybody excited and puzzled about the one question: who is going to win?

A question, of course, that just feels like a perfectly designed challenge for data science, with an ever increasing wealth of football match statistics and other related data that is freely available nowadays. It comes as no surprise, hence, that Euro 2016 also puts statistics and data mining into the spotlight. Next to “who is going to win?”, the second question is: “who is going to make the best forecast?”

Besides the big players of the industry, whose predictions traditionally get the most of the media attention, there also is a less known academic research group that already had remarkable success in forecasting in the past. Andreas Groll and Gunther Schauberger from Ludwig-Maximilians-University, together with Thomas Kneib from Georg-August-University, again did set out to forecast the next Euro champion, after they already were able to predict the champion of Euro 2012 and World Cup 2014 correctly.

Based on publicly available data and the gamlls R-package they built a model to forecast probabilities of win, tie and loss for any game of Euro 2016 (Actually, they even get probabilities on a more precise level with an exact number of goals for both teams. For more details on the model take a look at their preliminary technical report).

This is what their model deems as most likely tournament evolution this time:

em_results_group em_results_tree

The model not only takes into account individual team strengths, but also the group constellations that were randomly drawn and also have an influence on the tournament’s outcome. This is what their model predicts as probabilities for each team and each possible achievement in the tournament:

em_results

So good news for all Germans: after already winning World Cup 2014, “Die Mannschaft” seems to be getting its hands on the next big international football title!

Well, do they? Let’s see…

Mainstream media usually only picks up on the prediction of the Euro champion – the “point forecast”, so to speak. Keep in mind that although this single outcome may well be the most likely one, it still is quite unlikely itself with a probability of 21.1% only. So from a statistical point of view, you basically should not judge the model only on grounds of whether it is able to predict the champion again, as this would require a good portion of luck, too. Just imagine the probability of the most likely champion was 30%, then getting it correctly three times in a row merely has a probability of (0.3)³=0.027 or 2.7%. So in order to really evaluate the goodness of the model you need to check its forecasting power on a number of games and see whether it consistently does a good job, or even outperforms bookmakers’ odds. Although the report does not list the probabilities for each individual game, you still can get a quite good feeling about the goodness of the model, for example, by looking at the predicted group standings and playoff participants. Just compare them to what you would have guessed yourself – who’s the better football expert?

 

Advertisements

spotted elsewhere: SlideRule

Being a big Massive Open Online Course (MOOC) and Coursera fan already for quite some time, I stumbled upon another internet platform that promises to bring video education to you just today: SlideRule. It searches several online course providers and “helps you discover the world’s best online courses in every subject”. In extension, there also is iversity, which is not yet searched by SlideRule. Have fun studying!

spotted elsewhere: best practices for scientific computing

Nowadays, a lot of time of everyday research is spent in front of computers. Especially in data analysis, of course, computers are an elementary part of science. Nevertheless, most researchers still seem to have not gotten a real training in computer science, but tend to just develop their own manners for getting the job done.

Greg Wilson, together with the other members of the software training group Software Carpentry, devotes his time to promoting best practices of the computer science community into other fields of the scientific community. I highly recommend his newly published paper Best Practices for Scientific Computing, in which he lists a number of recommendations for an improved workflow in scientific computing. Also, make sure to check the Software Carpentry homepage, which provides a number of short video tutorials for a bunch of topics that are fundamental to any data analysis.

spotted elsewhere: animated education videos

When it comes to education, I believe in two simple things:

  • the more entertaining and exciting we make education, the more people will focus while learning, and the better they will concentrate
  • the more senses we manage to involve, the better people will remember what they have learned

One simple way of dealing with both issues is through animated videos. Such animations can be a quite entertaining way to transmit knowledge, while simultaneously addressing both auditory and visual senses.

For a prime example of such animated videos, take a look at the RSA Animate video on Changing Education Paradigms, which is only one of a series of many animated videos of the Royal Society for the encouragement of Arts, Manufactures and Commerce.

Even complex economic topics can be presented quite entertaining this way. Simply take a look at the animated video on How the Economic Machine Works by Ray Dalio:

spotted elsewhere: how science goes wrong

A friend of mine recently pointed out to me an article from The Economist that lists some of the deficiencies of the current scientific system. As stated in the article, a large fraction of the results published in scientific journals could not be verified through reproduction:

Last year researchers at one biotech firm, Amgen, found they could reproduce just six of 53 “landmark” studies in cancer research. Earlier, a group at Bayer, a drug company, managed to repeat just a quarter of 67 similarly important papers.

As explanation, the article identifies a combination of two factors. First, it blames the current “publish or perish” system, where solely publications matter, and “verification does little to advance a researcher’s career”. And second, it points out the bias of scientific journals to publish positive results:

“Negative results” now account for only 14% of published papers, […]

Given these circumstances,

Careerism also encourages exaggeration and the cherry-picking of results.

spotted elsewhere: websites that make you cleverer

These days, I stumbled upon a page on Inktank.fi that contains an extensive list of websites with educational content. Of course, I didn’t have the time to check out all of them yet. However, as some of my favorite websites are on the list, I assume that the others could be worth a glance, too. To give you a short teaser, some websites on the list are:

khanacademy.org
Watch thousands of micro-lectures on topics
ranging from history and medicine to chemistry and computer science.
coursera.org
Educational site that works with universities to get
their courses on the Internet, free for you to use.
ted.com
Collection of TED (Technology, Entertainment and Design)
talks in which knowledgeable speakers address a variety
of topics in short videos (< 18 minutes)
codecademy.com
Website packed with introductory courses for
various programming languages and web
technologies.

spotted elsewhere: social media and open science

A presentation of Titus Brown about using social media and open science to sharpen your academic profile.

Excellent way to enhance your academic career; network to find, discuss, and explore alternative career options; and build a life you find to be worth living

Some of the sites that he calls attention to:

Programming style guidelines: R and MATLAB

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” (Martin Fowler)

This post addresses the issue of programming style guidelines. It highlights some of the most important recommendations found in commonly accepted sources on style guidelines, and tries to find a compromise between style conventions of R and MATLAB communities.

1 Motivation

I think one of the differences between statisticians and computer scientists is that many statisticians tend to think about coding just as a means to an end. Once equipped with theories and models, their curiosity lies with the validation of their ideas on data – no matter how this will be achieved exactly. And if pen and paper weren’t such a cumbersome method on the large data-sets of today, it probably still would be a most highly regarded approach in statistics, too.

Of course, this description of statisticians so far is quite a bit exaggerated. However, as a statistician myself, I sometimes find myself behaving very similar to what I did describe so overstated above: the data analysis part – stirring my curiosity – increasingly becomes the center of my attention, while implementation details get put further into the background. But only until I need to step through some older parts of code again, and suddenly realize, that I have problems understanding even my own code. Not to mention the effort required whenever some bug needs to be tracked down in uncommented and unstructured code. It is always in these situations that one gets a reminder on one of the most important lessons in data analysis: there is more to good coding than just coming up with a solution! What we should be striving for is producing “code that is more likely to be correct, understandable, sharable and maintainable” (Richard Johnson). Or, using the words of Martin Fowler: “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”

Against this background, I want to share some sources and recommendations about programming style conventions that I have stumbled upon so far. Thereby, I will pick up only on those conventions in more detail, where I think my own coding style requires the most improvement. Furthermore, I will also try to come up with some compromise on these recommendations, in order to guarantee a minimal consistency between the two programming languages that I use for statistical applications the most: R and MATLAB. In addition, it might be worthwhile for you to even go beyond this post and – at least once – take a look into one of the original and more elaborate programming style guidelines yourself:

R guideline resources:

R Coding Conventions by Henrik Bengtsson

Google’s R Style Guide

MATLAB guideline resources:

MATLAB Programming Style Guidelines by Richard Johnson

MATLAB Programming Style Guide Wiki

2 Variable / function names

For variable and function names, I would recommend to follow a mixed case convention, starting with lowercase, such as sortMatrixColumns. I strongly discourage the use of sort.matrixcolumns in R, since a “.” is not allowed for names in MATLAB, and even in R it can be confused with the method of an object. So far, I was always using underscore to decompose names into meaningful parts (e.g. sort_matrix_columns). However, I found that both R and MATLAB style guides discourage this style – although I must admit, that I did not encounter any problems with it so far. The only drawback I could come up with is that any TeX-based interpreter could take underscore as indicator of subindices. For example, some editors like emacs might display the part after “_” slightly displaced downwards. Opposed to that, however, it is a lot easier to customize an editor to allow deletion of word parts separated by underscore. Replacing daily_returns through monthly_returns without touching the second word part requires less tweaking than for the case of dailyReturns and monthlyReturns. In other words, you can easily customize an operation “word-deletion” to only remove word parts separated by underscore gradually.

In addition, however, I originally would have liked to see some difference between function and variable names. While both can be distinguished rather easily in R and Julia, since round brackets are used for function arguments and squared brackets for variable indices, it could provide additional visual help in MATLAB, where round brackets are used for both functions and variables. Nevertheless, this is in no way common convention in any of the communities, so that I will not opt for taking this approach alone by myself.

Pluralization of variable names should be noted more clearly as through the appending of “s”. Distinguishing between date and dates is far less obvious than it is between date and dateArray. However, I must admit that I do not intend to go for the long extension with …Array, but I rather will use an abbreviation like …Arr or …Matr for …Matrix. Most style guides, however, explicitly advise to refrain from using such abbreviations.

Negated logical operators should not be used: isFound is preferred over isNotFound.

3 Comments

Most users are probably very aware of the importance of comments for the understandability of code that should be shared with others. In addition to that, comments should also comply with some minimum standards.

Important variables should be introduced and commented near the start of the file. If you initialize a variable stockReturns, you might want to give it some further explanation:

stockReturns <- matrix(NA, ncol = 200, nrow = 1000)
## captures logarithmic daily returns in percent, columns
## corresponding to individual stocks, rows to dates

The very bare minimum of any function documentation should contain one line on the function’s main task, and documentation of arguments (inputs) and returns (outputs). For example, from Google’s R Style Guide:

CalculateSampleCovariance <- function(x, y, verbose = TRUE) {
## Computes the sample covariance between two vectors.
##
## Args:
##   x: One of two vectors whose sample covariance is to be calculated.
##   y: The other vector. x and y must have the same length, greater than one,
##      with no missing values.
##   verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE.
##
## Returns:
##   The sample covariance between x and y.

Also, if the function implies any side effects (actions besides its output: plotting, printing, creating files), these need to be mentioned in its description.

4 Control Flow

R: “If” statements should have a white space in front of the condition, and the “else” part should be surrounded with braces:

if (condition) {
  one or more lines
} else {
  one or more lines
}

MATLAB: “switch” variables should always be a string. Furthermore, any switch block should include an otherwise expression that captures all remaining cases:

switch switch_expression
   case case_expression
      statements
   case case_expression
      statements
    :
   otherwise
      statements
end

Also, end lines in MATLAB can have comments.

5 White space

R: The google style guidelines advise placing spaces around all binary operators (=, +, -, <-). The only exception could be spaces around “=” when it refers to passing arguments in a function call. Commas should always be followed by space, while you never should put a space in front of a comma.

MATLAB: Logical operators like “=”, “&” and “|” should always be surrounded by white spaces. Conventional operators and commas generally could be followed by spaces. In my opinion, this is something that usually provides additional clarity, so that I tend to use spaces most of the time. Especially for the case of commas, I stick with the same behavior that is recommended in the R style guidelines.

6 Errors

Errors should be raised with stop() in R.

7 Summary

Like I said, this list is far from being exhaustive, and mainly captures the conventions that were unknown to me so far. There are a lot of additional conventions in the original sources which I find very helpful and recommendable. In order to see which points deserve the most importance in MATLAB according to my opinion, you could also take a look at my annotated version of Richard Johnson’s MATLAB Programming Style Guidelines.

The nine circles of scientific hell