Programming style guidelines: R and MATLAB

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” (Martin Fowler)

This post addresses the issue of programming style guidelines. It highlights some of the most important recommendations found in commonly accepted sources on style guidelines, and tries to find a compromise between style conventions of R and MATLAB communities.

1 Motivation

I think one of the differences between statisticians and computer scientists is that many statisticians tend to think about coding just as a means to an end. Once equipped with theories and models, their curiosity lies with the validation of their ideas on data – no matter how this will be achieved exactly. And if pen and paper weren’t such a cumbersome method on the large data-sets of today, it probably still would be a most highly regarded approach in statistics, too.

Of course, this description of statisticians so far is quite a bit exaggerated. However, as a statistician myself, I sometimes find myself behaving very similar to what I did describe so overstated above: the data analysis part – stirring my curiosity – increasingly becomes the center of my attention, while implementation details get put further into the background. But only until I need to step through some older parts of code again, and suddenly realize, that I have problems understanding even my own code. Not to mention the effort required whenever some bug needs to be tracked down in uncommented and unstructured code. It is always in these situations that one gets a reminder on one of the most important lessons in data analysis: there is more to good coding than just coming up with a solution! What we should be striving for is producing “code that is more likely to be correct, understandable, sharable and maintainable” (Richard Johnson). Or, using the words of Martin Fowler: “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”

Against this background, I want to share some sources and recommendations about programming style conventions that I have stumbled upon so far. Thereby, I will pick up only on those conventions in more detail, where I think my own coding style requires the most improvement. Furthermore, I will also try to come up with some compromise on these recommendations, in order to guarantee a minimal consistency between the two programming languages that I use for statistical applications the most: R and MATLAB. In addition, it might be worthwhile for you to even go beyond this post and – at least once – take a look into one of the original and more elaborate programming style guidelines yourself:

R guideline resources:

R Coding Conventions by Henrik Bengtsson

Google’s R Style Guide

MATLAB guideline resources:

MATLAB Programming Style Guidelines by Richard Johnson

MATLAB Programming Style Guide Wiki

2 Variable / function names

For variable and function names, I would recommend to follow a mixed case convention, starting with lowercase, such as sortMatrixColumns. I strongly discourage the use of sort.matrixcolumns in R, since a “.” is not allowed for names in MATLAB, and even in R it can be confused with the method of an object. So far, I was always using underscore to decompose names into meaningful parts (e.g. sort_matrix_columns). However, I found that both R and MATLAB style guides discourage this style – although I must admit, that I did not encounter any problems with it so far. The only drawback I could come up with is that any TeX-based interpreter could take underscore as indicator of subindices. For example, some editors like emacs might display the part after “_” slightly displaced downwards. Opposed to that, however, it is a lot easier to customize an editor to allow deletion of word parts separated by underscore. Replacing daily_returns through monthly_returns without touching the second word part requires less tweaking than for the case of dailyReturns and monthlyReturns. In other words, you can easily customize an operation “word-deletion” to only remove word parts separated by underscore gradually.

In addition, however, I originally would have liked to see some difference between function and variable names. While both can be distinguished rather easily in R and Julia, since round brackets are used for function arguments and squared brackets for variable indices, it could provide additional visual help in MATLAB, where round brackets are used for both functions and variables. Nevertheless, this is in no way common convention in any of the communities, so that I will not opt for taking this approach alone by myself.

Pluralization of variable names should be noted more clearly as through the appending of “s”. Distinguishing between date and dates is far less obvious than it is between date and dateArray. However, I must admit that I do not intend to go for the long extension with …Array, but I rather will use an abbreviation like …Arr or …Matr for …Matrix. Most style guides, however, explicitly advise to refrain from using such abbreviations.

Negated logical operators should not be used: isFound is preferred over isNotFound.

3 Comments

Most users are probably very aware of the importance of comments for the understandability of code that should be shared with others. In addition to that, comments should also comply with some minimum standards.

Important variables should be introduced and commented near the start of the file. If you initialize a variable stockReturns, you might want to give it some further explanation:

stockReturns <- matrix(NA, ncol = 200, nrow = 1000)
## captures logarithmic daily returns in percent, columns
## corresponding to individual stocks, rows to dates

The very bare minimum of any function documentation should contain one line on the function’s main task, and documentation of arguments (inputs) and returns (outputs). For example, from Google’s R Style Guide:

CalculateSampleCovariance <- function(x, y, verbose = TRUE) {
## Computes the sample covariance between two vectors.
##
## Args:
##   x: One of two vectors whose sample covariance is to be calculated.
##   y: The other vector. x and y must have the same length, greater than one,
##      with no missing values.
##   verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE.
##
## Returns:
##   The sample covariance between x and y.

Also, if the function implies any side effects (actions besides its output: plotting, printing, creating files), these need to be mentioned in its description.

4 Control Flow

R: “If” statements should have a white space in front of the condition, and the “else” part should be surrounded with braces:

if (condition) {
  one or more lines
} else {
  one or more lines
}

MATLAB: “switch” variables should always be a string. Furthermore, any switch block should include an otherwise expression that captures all remaining cases:

switch switch_expression
   case case_expression
      statements
   case case_expression
      statements
    :
   otherwise
      statements
end

Also, end lines in MATLAB can have comments.

5 White space

R: The google style guidelines advise placing spaces around all binary operators (=, +, -, <-). The only exception could be spaces around “=” when it refers to passing arguments in a function call. Commas should always be followed by space, while you never should put a space in front of a comma.

MATLAB: Logical operators like “=”, “&” and “|” should always be surrounded by white spaces. Conventional operators and commas generally could be followed by spaces. In my opinion, this is something that usually provides additional clarity, so that I tend to use spaces most of the time. Especially for the case of commas, I stick with the same behavior that is recommended in the R style guidelines.

6 Errors

Errors should be raised with stop() in R.

7 Summary

Like I said, this list is far from being exhaustive, and mainly captures the conventions that were unknown to me so far. There are a lot of additional conventions in the original sources which I find very helpful and recommendable. In order to see which points deserve the most importance in MATLAB according to my opinion, you could also take a look at my annotated version of Richard Johnson’s MATLAB Programming Style Guidelines.

Advertisements

Posted on 2013/08/16, in R, science and tagged , , . Bookmark the permalink. 4 Comments.

  1. welcome to worldpress :-). look forward to read some more remarks in section ERRORS. on my opinion, some “computational” statistics and the most computer scientist often spend more time DEBUGGING a program than actually writting it

    • Well, speaking for the statistics community, I am quite sure that we could improve efficiency in programming a lot, if we just focused a little bit more on learning fundamental programming skills upfront (writing clean code, avoiding code duplicity, testing code,…). However, I am afraid that the stochastic nature of many applications in statistics will set an upper bound on these gains, as strategies like unit testing will lose a lot of their power when randomness is involved.

      Also, right now, I am not only worried because of overly high debugging time spent. It’s rather the bugs that have NOT been debugged, that I am worried about….

  2. Very good article! We will be linking to this particularly great content
    on our website. Keep up the great writing.

  1. Pingback: unit testing in MATLAB | Quantifying Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: