Coping with missing stock price data

1 Missing stock price data

When downloading historic stock price data it happens quite frequently that some observations in the middle of the sample are missing. Hence the question: how should we cope with this? There are several ways how we could process the data, each approach coming with its own advantages and disadvantages, and we want to compare some of the most common approaches in this post.

In any case, however, we want to treat missing values as NA and not as Julia’s built-in NaN (a short justification on why NA is more suitable can be found here). Hence, data best should be treated through DataFrames or – if the data comes with time information – through type Timenum from the TimeData package. In the following, we will use these packages in order to show some common approaches to deal with missing stock price data using an artificially made up data set that shall represent logarthmic prices.

The reason why we are looking at logarthmic prices and returns instead of normal prices and net returns is just that logarithmic returns are defined as simple difference between logarithmic prices of successive days:

\displaystyle r_{t}^{\log}=\log(P_{t}) - \log(P_{t-1})

This way, our calculations simply involve nicer numbers, and all results equally hold for normal prices and returns as well.

We will be using the following exemplary data set of logarthmic prices for the comparison of different approaches:

using TimeData
using Dates
using Econometrics

df = DataFrame()
df[:stock1] = @data([100, 120, 140, 170, 200])
df[:stock2] = @data([110, 120, NA, 130, 150])

dats = [Date(2010, 1, 1):Date(2010, 1, 5)]

originalPrices = Timenum(df, dats)
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-03 140 NA
2010-01-04 170 130
2010-01-05 200 150

One possible explanation for such a pattern in the data could be that the two stocks are from different countries, and only the country of the second stock has a holiday at January the 3rd.

Quite often in such a situation, people just refrain from any deviations from the basic calculation formula of logarithmic returns and calculate the associated returns as simple differences. This way, however, each missing observation NA will lead to two NAs in the return series:

simpleDiffRets = originalPrices[2:end, :] .- originalPrices[1:(end-1), :]
idx stock1 stock2
2010-01-02 20 10
2010-01-03 20 NA
2010-01-04 30 NA
2010-01-05 30 20

For example, this also is the approach followed by the PerformanceAnalytics package in R:

library(tseries)
library(PerformanceAnalytics)

stockPrices1 <- c(100, 120, 140, 170, 200)
stockPrices2 <- c(110, 120, NA, 130, 150)

## combine in matrix and name columns and rows
stockPrices <- cbind(stockPrices1, stockPrices2)
dates <- seq(as.Date('2010-1-1'),by='days',length=5)
colnames(stockPrices) <- c("A", "B")
rownames(stockPrices) <- as.character(dates)
(stockPrices)

returns = Return.calculate(exp(stockPrices), method="compound")
nil nil
20 10
20 nil
30 nil
30 20

When we calculate returns as the difference between successive closing prices P_{t} and P_{t-1}, a single return simply represents all price movements that happened at day t, including the opening auction that determines the very first price at this day.

Thinking about returns this way, it obviously makes sense to assign a value of NA to each day of the return series where a stock exchange was closed due to holiday, since there simply are no stock price movements on that day. But why would we set the next day’s return to NA as well?

In other words, we should distinguish between two different cases of NA values for our prices:

  1. NA occurs because the stock exchange was closed this day and hence there never were any price movements
  2. the stock exchange was open that day, and in reality there were some price changes. However, due to a deficiency of our data set, we do not know the price of the respective day.

For the second case, we really would like to have two consecutive values of NA in our return series. Knowing only the prices in t and t+2, there is no way how we could reasonably deduce the value that the price did take on in t+1. Hence, there are infinitely many possibilities to allocate a certain two-day price increase to two returns.

For the first case, however, it seems unnecessarily rigid to force the return series to have two NA values: allocating all of the two-day price increase to the one day where the stock exchange was open, and a missing value NA to the day that the stock exchange was closed doesn’t seem to be too arbitrary.

This is how returns are calculated by default in the (not yet registered) Econometrics package.

logRet = price2ret(originalPrices, log = true)
idx stock1 stock2
2010-01-02 20 10
2010-01-03 20 NA
2010-01-04 30 10
2010-01-05 30 20

And, the other way round, aggregating the return series again will also keep NAs for the respective days, but otherwise perform the desired aggregation. Without specified initial prices, aggregated prices will all start with value 0 for logarithmic prices, and hence express something like normalized prices that allow a nice comparison of different stock price evolutions.

normedPrices = ret2price(logRet, log = true)
idx stock1 stock2
2010-01-01 0 0
2010-01-02 20 10
2010-01-03 40 NA
2010-01-04 70 20
2010-01-05 100 40

To regain the complete price series (together with a definitely correct starting date), one simply needs to additionally specify the original starting prices.

truePrices = ret2price(logRet, originalPrices, log = true)
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-03 140 NA
2010-01-04 170 130
2010-01-05 200 150

In some cases, however, missing values NA may not be allowed. This could be a likelihood function that requires real values only, or some plotting function. For these cases NAs easily can be removed through imputation. For log price plots, a meaningful way would be:

impute!(truePrices, "last")
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-03 140 120
2010-01-04 170 130
2010-01-05 200 150

However, for log returns, the associated transformation then would artificially incorporate values of 0:

impute!(logRet, "zero")
idx stock1 stock2
2010-01-02 20 10
2010-01-03 20 0
2010-01-04 30 10
2010-01-05 30 20

As an alternative to replacing NA values, one could also simply remove the respective dates from the data set. Again, there are two options how this could be done.

First, one could remove any missing observations directly in the price series:

originalPrices2 = deepcopy(originalPrices)
noNAprices = convert(TimeData.Timematr, originalPrices2, true)
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-04 170 130
2010-01-05 200 150

For the return series, however, we then have a – probably large – multi-day price jump that seems to be a single-day return. In our example, we suddenly observe a return of 50 for the first stock.

logRet = price2ret(noNAprices)
idx stock1 stock2
2010-01-02 20 10
2010-01-04 50 10
2010-01-05 30 20

A second way to eliminate NAs would be to remove them from the return series.

logRet = price2ret(originalPrices)
noNAlogRet = convert(TimeData.Timematr, logRet, true)
idx stock1 stock2
2010-01-02 20 10
2010-01-04 30 10
2010-01-05 30 20

However, deriving the associated price series for this processed return series will then lead to deviating end prices:

noNAprices = ret2price(noNAlogRet, originalPrices)
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-04 150 130
2010-01-05 180 150

as opposed to the real end prices

originalPrices
idx stock1 stock2
2010-01-01 100 110
2010-01-02 120 120
2010-01-03 140 NA
2010-01-04 170 130
2010-01-05 200 150

The first stock now suddenly ends with a price of only 180 instead of 200.

2 Summary

The first step when facing a missing price observation is to think whether it might make sense to treat only one return as missing, assigning the complete price movement to the other return. This is perfectly reasonable for days where the stock market really was closed. In all other cases, however, it might make more sense to calculate logarithmic returns as simple differences, leading to two NAs in the return series.

Once there are NA values present, we can chose between three options.

2.1 Keeping NAs

Keeping NA values might be cumbersome in some situations, since some functions might only be working for data without NA values.

2.2 Replacing NAs

In cases where NAs may not be present, there sometimes might exist ways of replacing them that perfectly make sense. However, manually replacing observations in some way means messing with the original data, and one should be cautious to not incorporate any artificial patterns this way.

2.3 Removing NAs

Obviously, when dates with NA values for only some variables are eliminated completely, we simultaneously lose data for those variables where observations originally were present. Furthermore, eliminating cases with NAs for returns will lead to price evolutions that are different to the original prices.

2.4 Overview

Possible prices:

idx simple diffs single NA replace \w 0 rm NA price rm NA return
2010-01-01 100, 110 100, 110 100, 100 100, 110 100, 110
2010-01-02 120, 120 120, 120 120, 120 120, 120 120, 120
2010-01-03 140, NA 140, NA 140, 120    
2010-01-04 170, 130 170, 130 170, 130 170, 130 150, 130
2010-01-05 200, 150 200, 150 200, 150 200, 150 180, 150

Possible returns:

idx simple diffs single NA replace \w 0 rm NA price rm NA return
2010-01-02 20, 10 20, 10 20, 10 20, 10 20, 10
2010-01-03 20, NA 20, NA 20, 0    
2010-01-04 30, NA 30, 10 30, 10 50, 10 30, 10
2010-01-05 30, 20 30, 20 30, 20 30, 20 30, 20

Element-wise mathematical operators and iterator slides

I recently did engage in a quite elaborate discussion on the julia-stats mailing list about mathematical operators for DataFrames in Julia. Although I still do not agree with all of the arguments that were stated (at least not yet), I did get a very comforting feeling about the lively and engaged Julia community once again. Even one of the most active and busiest community members, John Myles White, did take the time to elaborately explain his point of view in the discussion – and this just might be the even higher good to me. Different opinions will always be part of any community. But it is the transparency of the discussions that tell you how strong a community is.

Still, however, mathematical operators are important to me, as I am quite frequently working with strictly real numeric data: no Strings, and no columns of categorical IDs. Given Julia’s expressive language, it would be quite easy to implement any desired mathematical operators for DataFrames on my own. However, I decided to follow what seems to be the consensus of the DataFrame developers, and hence refrain from any individual deviations in this direction. Alternatively, I decided to simply relate any element-wise operators of multi-column DataFrames to DataArray arithmetic, which allow most mathematical operators for individual columns. Viewed from this perspective, element-wise DataFrame operators are nothing else than operators that are successively applied to individual columns of a DataFrame, which are DataArrays.

As a consequence of this, I had to deepen my understanding of iterators, comprehensions and functions like vcat, map and reduce. For future reference, I did sum up my insights in a slide deck, which anybody who is interested could find here, or as part of my IJulia notebook collection here.

For those of you who are using the TimeData package, the current road-map regarding mathematical operators will be the following: any types that are constrained to numeric values only (including the extension to NA values) will carry on providing mathematical operators. These operators do perform some minimal checks upfront, in order to minimize risk of meaningless applications (for example, only adding up columns with equal names, equal dates,…). Furthermore, for any type that allows values other than numeric data these mathematical operators will not be defined. Hence, anybody in need of element-wise arithmetic for numeric data could easily make use of either Timematr or Timenum types (even if you do not need any time index). If you do, however, make sure to not mix up real numeric data and categorical data: applying mathematical operators or statistical functions like mean to something like customer IDs most likely will lead to meaningless results.

Julia syntax features

In one of my last posts I already tried to point out some advantages of Julia. Two of the main arguments are quite easily made: Julia is comparatively fast and free and open source. In addition, however, Julia also has a very powerful and expressive syntax compared to other programming languages, but this advantage is maybe less obvious to understand. Hence, I recently gave a short talk where I tried to extend a little bit on this point, while simultaneously also showing some of the convenient publishing feature of the IJulia backend. I thought I’d just share the outcome with you, just in case that anyone else could use the slides to convince some people of Julia’s powerful syntax. In addition to the slides, you can also access the presentation rendered as ijulia notebook here.

Prediction model for the FIFA World Cup 2014

Like a last minute goal, so to speak, Andreas Groll and Gunther Schauberger of Ludwig-Maximilians-University Munich announced their predictions for the FIFA World Cup 2014 in Brazil – just hours before the opening game.

Andreas Groll, with his successful prediction of the European Championship 2012 already experienced in this field, and Gunther Schauberger did set out to predict the 2014 world cup champion based on statistical modeling techniques and R.

A bit surprisingly, Germany is estimated with highest probability of winning the trophy (28.80%), exceeding Brazil’s probability (the favorite according to most bookmakers) only marginally (27.65%). You can find all estimated probabilities compared to the respective odds from a German bookmaker in the graphic on their homepage (http://www.statistik.lmu.de/~schauberger/research.html), together with the most likely world cup evolution simulated from their model. The evolution also shows the neck-and-neck race between Germany and Brazil: they are predicted to meet each other in the semi-finals, where Germany’s probability of winning the game is a hair’s breadth above 50%. Although there does not exist a detailed technical report on the results yet, you still can get some insight into the model as well as the data used through a preliminary summary pdf on their homepage (http://www.statistik.lmu.de/~schauberger/WMGrollSchauberger.pdf).

probs-001-001.jpg tree-001-001.jpg

Last week, I had the chance to witness a presentation of their preliminary results at the research seminar of the Department of Statistics (a home game for both), where they presented an already solid first predictive model based on the glmmLasso R package. However, continuously refining the model to the last minute, it now did receive its final touch, as they published the predictions at their homepage.

As they pointed out, statistical prediction of the world cup champion builds on two separate components. First, you need to reveal the individual team strengths – “who is best?”, so to speak. Afterwards, you need to simulate the evolution of the championship, given the actual world cup group drawings. This accounts for the fact that even quite capable teams might still miss the playoffs, given that they were drawn into a group of hard competitors.

Revealing the team strength turns out to be the hard part of the problem, as there exists no simple linear ranking for teams from best to worst. A team that might win more games on average still could have its problems with a less successful team, simply because they fail to adjust to the opponents style of play. In other words: tough tacklings and fouls could be the skillful players’ death.

Hence, Andreas Groll and Gunther Schauberger chose a quite complex approach: they determine the odds of a game through the number of goals that each team is going to score. Thereby, again, the likelihood of scoring more goals than the opponent depends on much more than just a single measure of team strength. First, the number of own goals depends on both teams’ capabilities: your own, as well as that of your opponent. As mediocre team, you score more goals against underdogs than against title aspirants. And second, your strength might be unevenly distributed across different parts of the team: your defense might be more competitive than your offensive or the other way round. As an example, although Switzerland’s overall strength is not within reach to the most capable teams, their defense during the last world cup still was such insurmountable that they did not receive a single goal (penalty shooting excluded).

The first preliminary model shown in the research seminar did seem to do a great job in revealing overall team strength already. However, subtleties as the differentiation between offensive and defense were not included yet. The final version, in contrast, now even allows such a distinction. Furthermore, the previous random effects model did build its prediction mainly on the data of past results itself, referring to explanatory co-variates only minor. Although this in no way indicates any prediction inaccuracies, one still would prefer models to have a more interpretable structure: not only knowing WHICH teams are best, but also WHY. Hence, instead of directly estimating team strength from past results, it is much nicer to have them estimated as a result of two components: the strength predicted by co-variates like FIFA rank, odds, etc, plus a small deviation found by the model through past results itself. As a side effect, the model should also become more robust against structural breaks this way: a team with very poor performance in the past now still could be classified as good if indicators of current team strength (like the number of champions league players or the current odds) hint to higher team strength.

Building on explanatory variables, however, the efficient identification of variables with true explanatory power out of a large set of possible variables is the real challenge. Hence, instead of throwing in all variables at once, their regularization approach allows to gradually extend the model by incorporating the variable with best explanatory power among all not yet included variables. This variable selection seems to me to be the big selling point of their statistical model, and with both Andreas Groll and Gunther Schauberger having prior publications in the field already, they most likely should know what they are doing.

From what I have heard, I think we can expect a technical report with more detailed analysis within the next weeks. I’m already quite excited about getting to know how large the estimated distinction between offensive and defense actually turns out to be in their model. Hopefully, we will get these results at a still early stage of the running world cup. The problem, however, is that some explanatory variables within their model could only be determined completely when all the team’s actual squads were known, and hence they could start their analysis only very shortly prior to the beginning of the world cup. Although this obviously caused some delay for their analysis, this made sure that even possible changes of team strength due to injuries could be taken into account. I am quite sure, however, that they will catch up on the delay during the next days, as I think that they are quite big football fans themselves, and hence are most likely as curious about the detailed results as we are…

spotted elsewhere: SlideRule

Being a big Massive Open Online Course (MOOC) and Coursera fan already for quite some time, I stumbled upon another internet platform that promises to bring video education to you just today: SlideRule. It searches several online course providers and “helps you discover the world’s best online courses in every subject”. In extension, there also is iversity, which is not yet searched by SlideRule. Have fun studying!

spotted elsewhere: academic networking on LinkedIn

Although I myself do not have an account at LinkedIn yet, I’d like to share the following blog post entry on How to become an academic networking pro on LinkedIn. In light of this post, LinkedIn really seems to have some potential to letting people get in touch with other researchers.

Julia language: A letter of recommendation

After spending quite some time using Julia (a programming language for technical computing) during the last few months, I am confident enough to provide kind of a “letter of recommendation” by now. Hence, I decided to list some of the features that make Julia appealing to me, while also interspersing some resources on Julia that I found helpful and worth sharing.

Read the rest of this entry

spotted elsewhere: best practices for scientific computing

Nowadays, a lot of time of everyday research is spent in front of computers. Especially in data analysis, of course, computers are an elementary part of science. Nevertheless, most researchers still seem to have not gotten a real training in computer science, but tend to just develop their own manners for getting the job done.

Greg Wilson, together with the other members of the software training group Software Carpentry, devotes his time to promoting best practices of the computer science community into other fields of the scientific community. I highly recommend his newly published paper Best Practices for Scientific Computing, in which he lists a number of recommendations for an improved workflow in scientific computing. Also, make sure to check the Software Carpentry homepage, which provides a number of short video tutorials for a bunch of topics that are fundamental to any data analysis.

Inheriting type behavior in Julia

In object oriented programming languages, classes can inherit from classes of objects on a higher level in the class hierarchy. This way, methods of the superclass will apply to the subclass as well, given that they are not explicitly re-defined for the subclass. In many regards, super- and subclasses hence behave similarly, allowing the same methods to be applied and similar access behavior. Joined together, they hence build a coherent user interface.

In Julia, such a coherent interface for multiple types requires a little bit of extra work, since Julia does not allow subtyping for composite types. Nevertheless, Julia’s flexibility generally allows composite types to be constructed such that they emulate the behavior of some already existing type. This only requires a little bit of extra coding, but can be implemented efficiently through metaprogramming.

Read the rest of this entry

spotted elsewhere: The Setup

In case you sometimes wonder what might be the best tool for a job, just take a look at usesthis.com, and see what other people use to get stuff done. For example, the well known R programmer and developer of ggplot, Hadley Wickham, shares a list of his favorite tools.

Follow

Get every new post delivered to your Inbox.