# Blog Archives

## Downloading SP500 stock price data with Julia

**[Update]**

In order to provide a more convenient and stable way of getting data, I meanwhile started the Julia package EconDatasets.jl. It basically bundles many download routines for different datasets in a common interface, and is more easy to keep maintain through automated testing on travis. With this package, you also can download the same data with just two lines of code:

Pkg.add("EconDatasets") using EconDatasets

# download data into package subdirectory data getDataset("SP500")

# load data into session workspace data = dataset("SP500")

**Original post**

In a previous post I already described how to download stock price data for all constituents of the SP500 with `R`

. Meanwhile, however, I shifted all of my coding to the new and very promising technical computing language julia. I obviously wanted to also enjoy the convenience of downloading stock price data directly in julia. Hence, in this post I will describe how one could use julia to download stock price data from **Yahoo!Finance** in general, and – as an advanced use case – how to download adjusted closing prices for all SP500 constituents. We also will make use of julia’s parallel computing capabilities, in order to tremendously speed up the downloading and processing steps compared to my original code in `R`

.

Before we start, we first need to make sure that we have all relevant packages installed. As a format for storing time series data I always rely on the TimeData package from the official package repository:

Pkg.add("TimeData")

The functions required for data download are bundled in the not yet registered EconDatasets package, which we can clone from github:

Pkg.clone("https://github.com/JuliaFinMetriX/EconDatasets.jl.git")

## 1 Download data using readYahooFinance

Now that all required packages are installed, we will first demonstrate a simple use case for the low-level function `readYahooFinance`

. This functions allows easy access to the complete stock data provided by **Yahoo!Finance**, comprising fields open, high, low, close, volume and adjusted close. As an example, we now download data for NASDAQ-100, S&P 500 and EURO STOXX 50 Index.

using TimeData using Dates using EconDatasets tickerSymbs = ["^NDX" "^GSPC" "^STOXX50E"] dates = Date(1960,1,1):Date(2014,7,20) indexData = [readYahooFinance(dates, symb) for symb in tickerSymbs] ## display first five dates of NASDAQ indexData[1][1:5, :]

idx | Open | High | Low | Close | Volume | Adj_Close |

1985-10-01 | 221.24 | 224.32 | 221.13 | 224.28 | 153160000 | 112.14 |

1985-10-02 | 224.28 | 225.08 | 221.56 | 221.65 | 164640000 | 110.82 |

1985-10-03 | 221.68 | 222.37 | 220.24 | 221.74 | 147300000 | 110.87 |

1985-10-04 | 221.74 | 221.74 | 219.71 | 220.15 | 147900000 | 110.07 |

1985-10-07 | 220.15 | 220.27 | 216.35 | 216.4 | 128640000 | 108.2 |

Using comprehension, the function `readYahooFinance`

can be applied to all ticker symbols in an `Array`

. The output will be an `Array`

of type `Any`

, with individual entries being of type `Timematr`

.

When we focus on one variable for each stock only, we can store the data more concisely in a single `TimeData`

object. Therefore, we join individual stocks at the their `idx`

entries. We do not want to lose any data at this step, so that we will use an outer join in order to get a row for each date that occurs for at least one of the individual stocks. Missing values will be replaced by `NA`

, so that we now get an object of type `Timenum`

, as `Timematr`

objects are not allowed to contain `NAs`

.

adjCloseData = indexData[1][:Adj_Close] names!(adjCloseData.vals, [symbol(tickerSymbs[1])]) for ii=2:3 nextStock = indexData[ii] |> x -> x[:Adj_Close] names!(nextStock.vals, [symbol(tickerSymbs[ii])]) adjCloseData = joinSortedIdx_outer(adjCloseData, nextStock) end adjCloseData[[1:10; (end-10):end], :]

idx | ^NDX | ^GSPC | ^STOXX50E |

1950-01-03 | NA | 16.66 | NA |

1950-01-04 | NA | 16.85 | NA |

1950-01-05 | NA | 16.93 | NA |

1950-01-06 | NA | 16.98 | NA |

1950-01-09 | NA | 17.08 | NA |

1950-01-10 | NA | 17.03 | NA |

1950-01-11 | NA | 17.09 | NA |

1950-01-12 | NA | 16.76 | NA |

1950-01-13 | NA | 16.67 | NA |

1950-01-16 | NA | 16.72 | NA |

2014-07-04 | NA | NA | 3270.47 |

2014-07-07 | 3910.71 | 1977.65 | 3230.92 |

2014-07-08 | 3864.07 | 1963.71 | 3184.38 |

2014-07-09 | 3892.91 | 1972.83 | 3203.1 |

2014-07-10 | 3880.04 | 1964.68 | 3150.59 |

2014-07-11 | 3904.58 | 1967.57 | 3157.05 |

2014-07-14 | 3929.46 | 1977.1 | 3185.86 |

2014-07-15 | 3914.46 | 1973.28 | 3153.75 |

2014-07-16 | 3932.33 | 1981.57 | 3202.94 |

2014-07-17 | 3878.01 | 1958.12 | 3157.82 |

2014-07-18 | 3939.89 | 1978.22 | 3164.21 |

## 2 Download adjusted closing prices of SP500 constituents

Now that we already have seen a first use case of function `readYahooFinance`

, we now want to try the capabilities of julia and the `EconDatasets`

package with a more challenging task. Hence, we want to download adjusted stock prices for all constituents of the SP500 in a fully automated way. Therefore, we first need to get a list of the ticker symbols of all constituents, which we can get from the S&P homepage. However, this list is stored as an Excel sheet with .xls extension, and we need to read in this binary file with package `Taro`

.

To make `Taro`

work, you first need to make sure that it is able to find Java on your system. If your path deviates from the default settings, just make sure to set the respective `JAVA_LIB`

environment variable in your .bashrc file. In my case, the variable is set as follows:

# 64-bit machine # JAVA_LIB="/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/" # 32-bit machine JAVA_LIB="/usr/lib/jvm/java-7-openjdk-i386/jre/lib/i386/server/" export JAVA_LIB

We can now install and load package `Taro`

:

Pkg.add("Taro")

using Taro Taro.init()

Found libjvm @ /usr/lib/jvm/java-7-openjdk-i386/jre/lib/i386/server/

If something with your `Taro`

configuration is not correct, you will get an error at this step. In this case, you could simply download and export the Excel sheet to .csv manually, which you then can read in with function `readtable`

from the `DataFrames`

package.

Otherwise, you can use `Taro`

to download and read in the respective part of the Excel sheet:

url = "http://us.spindices.com/idsexport/file.xls?hostIdentifier=48190c8c-42c4-46af-8d1a-0cd5db894797&selectedModule=Constituents&selectedSubModule=ConstituentsFullList&indexId=340" filepath = download(url) constituents = Taro.readxl(filepath, "Constituents", "A10:B511") head(constituents)

Constituent | Symbol |

3M Co | MMM |

Abbott Laboratories | ABT |

AbbVie Inc. | ABBV |

Accenture plc | ACN |

ACE Limited | ACE |

Actavis plc | ACT |

We now should have name and ticker symbol of each SP500 constituent stored as a `DataFrame`

. In my case, however, there even is one ticker symbol too much, although I do not know why:

(nTicker, nVars) = size(constituents)

501 |

2 |

An inconsistency that I will not further invest at this point. In addition, however, some of the ticker symbols are automatically read in as boolean values, and we will have to convert them to strings first. Let’s display all constituents with boolean values:

isBoolTicker = [isa(tickerSymbol, Bool) for tickerSymbol in constituents[:Symbol]] constituents[find(isBoolTicker), :]

Constituent | Symbol |

AT&T Inc | true |

Ford Motor Co | false |

The reason for this is that the respective ticker symbols are “T” and “F”, which will be interpreted as boolean values. Once we did correct for this mistake, we transform the array of ticker symbols into an `Array`

of type `ASCIIString`

.

indTrue = find(constituents[2] .== true) indFalse = find(constituents[2] .== false) constituents[indTrue, 2] = "T" constituents[indFalse, 2] = "F" tickerSymb = ASCIIString[constituents[:Symbol]...] tickerSymb[1:5]

MMM |

ABT |

ABBV |

ACN |

ACE |

Now that we already have a list of all ticker symbols, in principle we could apply the same procedure as before: download each stock, extract the adjusted closing prices, and join all individual price series. However, as we have 500 stocks, this procedure would already take approximately 15 minutes if each individual stock took only 2 seconds. Hence, we strive for a much faster result using julia’s parallel computing capabilities, and this is already implemented as function `readYahooAdjClose`

.

Under the hood, `readYahooAdjClose`

uses a map-reduce structure. As the map step, for any given ticker symbol we download the data, extract the adjusted closing prices and rename the column to its ticker symbol. As reduce step we need to specify some operation that combines the individual results of the map step – in our case, this is function `joinSortedIdx_outer`

.

Let’s now set the stage for parallel computation, add three additional processes and load the required packages on each process.

addprocs(3) @everywhere using Dates @everywhere using DataFrames @everywhere using TimeData @everywhere using EconDatasets

3-element Array{Any,1}: 2 3 4

To run the parallelized code, simply call function `readYahooAdjClose`

:

dates = Date(1960,1,1):Date(2014,7,20) ## measure time t0 = time() @time vals = readYahooAdjClose(dates, tickerSymb, :d) t1 = time() elapsedTime = t1-t0 mins, secs = divrem(elapsedTime, 60)

Downloading of all 500 stocks did only take:

println("elapsed time: ", int(mins), " minutes, ", ceil(secs), " seconds")

elapsed time: 3 minutes, 45.0 seconds

Now we convert the data of type `Timedata`

to type `Timenum`

and store the result in the `EconDatasets`

data directory:

valsTn = convert(Timenum, vals) pathToStore = joinpath(Pkg.dir("EconDatasets"), "data", "SP500.csv") writeTimedata(pathToStore, valsTn)

## 3 Visualize missing values

From previous experiences I already know that the saying “you get what you pay for” also holds for the free Yahoo!Finance database: the data comes with a lot of missing values. In order to get a feeling for the data quality, we want to visualize all missing values. Therefore, we sort all assets with respect to the number of missing values.

(nObs, nStocks) = size(valsTn) NAorNot = Array(Int, nObs, nStocks) for ii=1:nStocks NAorNot[:, ii] = isna(vals.vals[ii])*1 end nNAs = sum(NAorNot, 1) p = sortperm(nNAs[:]) tickSorted = tickerSymb[p] NAorNotSorted = NAorNot[:, p]

For the visualization, we rely on plot.ly, as we get an interactive graphic this way. This allows the identification of individual stocks in the graphic by simple mouse clicks. If you want to replicate the figure, you will need to sign in with an own and free plot.ly account.

Pkg.clone("https://github.com/plotly/Plotly.jl") using Plotly

## sign in with your account ## Plotly.signin("username", "authentication") nToPlot = nStocks datsStrings = ASCIIString[string(dat, " 00:00:00") for dat in valsTn.idx] data = [ [ "z" => NAorNotSorted[:, 1:nToPlot], "y" => tickSorted[1:nToPlot], "x" => datsStrings, "type" => "heatmap" ] ] response = Plotly.plot([data], ["filename" => "Published graphics/SP500 missing values", "fileopt" => "overwrite"]) plot_url = response["url"]

Although the resulting graphic easily could be directly embedded into this html file, you will need to follow this link to watch it, since the plot.ly graphic is a large data file and hence takes quite some time to load. Nevertheless, I also exported a .png version of the graphic, which you can find below. Thereby, red dots are representing missing values.

## 4 Get logarithmic returns

So now we already have our closing price data at our hands. However, in financial econometrics we usually analyze return data, as prices are non-stationary almost always. Hence, we now want to derive returns from our price series.

For this step I rely on the also not yet registered Econometrics package:

Pkg.clone("https://github.com/JuliaFinMetriX/Econometrics.jl.git") using Econometrics

The reason for this is that I can make use of function `price2ret`

then, which implements a slightly more sophisticated approach to return calculation. This can best be described through an example, which we can set up using function `testcase`

from the `TimeData`

package:

logPrices = testcase(Timenum, 4)

idx | prices1 | prices2 |

2010-01-01 | 100 | 110 |

2010-01-02 | 120 | 120 |

2010-01-03 | 140 | NA |

2010-01-04 | 170 | 130 |

2010-01-05 | 200 | 150 |

As you can see, the example logarithmic prices have a single missing observation at January 3rd. Straightforward application of the logarithmic return calculation formula hence will result in two missing values in the return series:

logRetSimple = logPrices[2:end, :] .- logPrices[1:(end-1),:]

idx | prices1 | prices2 |

2010-01-02 | 20 | 10 |

2010-01-03 | 20 | NA |

2010-01-04 | 30 | NA |

2010-01-05 | 30 | 20 |

In contrast, function `price2ret`

assumes that single missing values in the middle of a price series are truly non-existent: there is no observation, because the stock exchange was closed and there simply was no trading. For example, this could easily happen if you have stocks from multiple countries with different holidays. Note, that this kind of missingness is different to a case where the stock exchange was open and trading did occur, but we were not able to observe the resulting price (for a more elaborate discussion on this point take a look at my blog post about missing stock price data). Using function `price2ret`

, you ultimately will end up with a stock return series with only one `NA`

for this example:

## get real prices logRetEconometrics = price2ret(logPrices; log = true)

idx | prices1 | prices2 |

2010-01-02 | 20 | 10 |

2010-01-03 | 20 | NA |

2010-01-04 | 30 | 10 |

2010-01-05 | 30 | 20 |

So the final step in our logarithmic return calculation is to apply `price2ret`

to the logarithmic prices, specify the usage of differences for return calculations through `log = true`

, and multiply the results by 100 in order to get percentage returns.

percentLogRet = price2ret(log(valsTn); log = true).*100

Alternatively, you could get discrete returns with:

percentDiscRet = price2ret(valsTn; log = false).*100

## Analyzing volatility patterns for SP500 stocks

## Table of Contents

Now that we already have a cleaned high-dimensional stock return dataset at our hands, we want to use it in order to examine stock market volatility in a high-dimensional context. Thereby, we are searching for patterns in the joint evolution of individual stock volatilities. We have in mind the following goal: to reduce the high-dimensional set of individual volatilities to some low-dimensional common factors.

What we will find is that all of the assets’ volatilities seem to follow some common market trend of volatility. However, the variation explained through this market trend is about 50 percent only. Additionally, we find that, during each financial crisis, the one sector where the crisis originates will show levels of volatility above the market trend. Our sample contains two examples for this: the IT sector during the dot-com crash, and the financial sector during the financial crisis in 2008. However, based on findings of principal component analysis, there still is a large part of variation in volatilities that remains unexplained by common factors. This indicates that we should think about volatility as consisting of two parts: one part arising from common factors, and one part that is specific to the individual asset.

## 1 Motivation

In a portfolio context, individual asset volatility levels are probably still the most important risk factor, together with the components’ dependence structure. However, in large dimensional portfolios, it becomes extremely difficult to keep track of all of the individual volatility levels. Hence, if there was some pattern inherent in the joint evolution of volatilities, this could enable us to simplify risk management approaches through reduction of dimensionality. Whenever multiple stochastic variables show some common patterns, this is caused by dependency. And, given that this dependency is large enough, there is some redundancy in keeping track of all of the variables separately. A better way then would be to relate the components of the high-dimensional problem to some lower dimensional common factors.

So, why should there be any patterns in the joint evolution of individual volatilities? Well, as a first indication for this we can take financial crises, where usually almost every individual stock will exhibit an increased level of volatility. Slightly related to this, we later examine the explanatory power of shifts in mean market volatility. The justification for this is that usually market volatility indices react quite sensitive to financial crises, with index levels high above normal times. And, in turn, volatility indices show large dependency with mean market volatility levels. This way, mean market volatility can be seen as a proxy for financial crises.

When analyzing volatility, we are not dealing with a directly observable quantity. Hence, the first step of the analysis will be to derive some estimates for the individual asset volatility time paths. Of course, any shortcomings that we bring in at this step will directly influence the subsequent analysis. Nevertheless, we will refrain from overly sophisticated approaches for now, since we only want to get a feeling about the data first. Hence, we will derive estimates for individual stock volatilities simply by assuming a GARCH(1,1) process for each individual stock. If necessary, this assumption could easily be relaxed, and volatility paths could be estimated by different volatility models like EGARCH or some multi-dimensional extensions. Or, we could refrain from hard model assumptions, and simply estimate volatility with high-frequency data as realized volatility.

## 2 Estimation of volatility paths with GARCH(1,1)

As already explained, we first need to derive some estimates of the unobservable volatility evolutions over time. Hence, we will now estimate a GARCH(1,1) process with -distributed innovations for each of the individual assets of our S&P500 dataset. Therefore, we make use of the garchFit function from the fGarch package. As we are primarily interested in the derivation of the unobservable volatility paths, we will neither take a look at the estimated parameters, nor save them to disk. The only two things that we extract are the estimated sigma series and the standardized residuals, which are the residuals divided by their respective estimated volatility.

## clean workspace rm(list=ls()) ## load packages library(zoo) library(fGarch) ## read sp500 return data z <- read.zoo(file = "data/all_sp500_clean_logRet.csv", header = T) logRet <- coredata(z) #################### ## fit GARCH(1,1) ## #################### ## get dimensions nAss <- length(logRet[1, ]) nObs <- length(logRet[, 1]) ## preallocation stdResid <- data.frame(matrix(data=NA, nrow=nObs, ncol=nAss)) vola <- data.frame(matrix(data=NA, nrow=nObs, ncol=nAss)) ## fitting procedure for (ii in 1:nAss) { ## display current iteration step print(ii) ## sometimes garchFit failed result <- try(gfit <- garchFit(data=logRet[, ii], cond.dist = "std")) if(class(result) == "try-error") { next } else { ## extract relevant information stdResid[, ii] <- gfit@residuals / gfit@sigma.t vola[, ii] <- gfit@sigma.t } } ## label results colnames(stdResid) <- colnames(logRet) rownames(stdResid) <- index(z) colnames(vola) <- colnames(logRet) rownames(vola) <- index(z) ## round to five decimal places to reduce file size stdResid <- round(stdResid, 5) vola <- round(vola, 5) ################## ## data storage ## ################## write.csv(stdResid, file = "data/standResid.csv", row.names=T) write.csv(vola, file = "data/estimatedSigmas.csv", row.names=T)

## 3 Visualization of volatility paths

Now that we have an estimate for each asset’s volatility path, we want to get a first impression about the data through visualization. Therefore, we now will set up our workspace first, and load required data and packages.

###################### ## set up workspace ## ###################### ## load packages library(zoo) library(reshape) library(ggplot2) library(fitdistrplus) library(gridExtra) library(plyr) ## clean workspace rm(list=ls()) ## source multiplot function source("rlib/multiplot.R") ## load cleaned logarithmic percentage returns as dataframe logRetDf <- read.table(file = "data/all_sp500_clean_logRet.csv", header = T) names(logRetDf)[names(logRetDf) == "Index"] <- "time" ## get dimensions and dates nObs <- nrow(logRetDf) nAss <- ncol(logRetDf)-1 dates <- as.Date(logRetDf$time) ## load volatilities volaDf <- read.csv(file = "data/estimatedSigmas.csv", header = T) names(volaDf)[names(volaDf) == "X"] <- "time"

As first visualization, we now plot all volatility paths that where estimated with GARCH(1,1). Therefore, we define the function **plotLinePaths**, which we can reuse later on.

############################################## ## make multi-line plot of individual paths ## ############################################## ## plot multi-line path of df with column "time" or row.names plotLinePaths <- function(data, alpha=0.2){ "plot dataframe with or without time column" ## test for time column and rename if required if(any(colnames(data) == "time") | any(colnames(data) == "Time")){ if(any(colnames(data) == "Time")){ ## rename to lower case names(data)[names(data)=="Time"] <- "time" } }else{ ## take row.names as x-index data$time <- row.names(data) } ## convert index to date format data$time <- as.Date(data$time) ## melt data to use ggplot dataMt <- melt(data, id.vars = "time") ## create plot p <- (ggplot(dataMt) + geom_line(aes(x=time, y=value, group=variable), alpha = alpha)) } ## calculate mean volatility volaOnly <- as.matrix(volaDf[, -1]) meanVola <- adply(volaOnly, .margins=1, .fun=mean) colnames(meanVola) <- c("time", "value") meanVola$time <- dates plotOnly <- plotLinePaths(volaDf,0.2) p1 <- plotOnly + xlab("time") + ylab("volatility") p1 <- p1 + geom_line(mapping = aes(x=time, y=value), data = meanVola, color = "red")

As can be seen, this graphics is not very enlightening, as the volatility paths remain in a very small interval at the bottom most of the time. Hence, we will immediately proceed by looking at logarithmic volatility paths, so that differences at the bottom will get shown in higher resolution.

############################################# ## plot log-transformation of volatilities ## ############################################# logVola <- data.frame(time=dates, log(as.matrix(volaDf[, -1]))) p2 <- plotLinePaths(logVola, 0.2) + xlab("time") + ylab("logarithmic volatility") + geom_line(mapping = aes(x=time, y=log(value)), data = meanVola, color = "red") p2

What appeared as one big black hose of nearly constant height before, now exhibits some shifts in its level over time. Clearly, times of crises do stand out from the plot: both in the aftermath of the dot-com bubble and during the financial crisis 2008, volatility paths were significantly above average levels. Even more remarkable is the fact that not only most volatility paths and their mean value are affected, but even the minimum of all volatility paths shifts significantly upwards. Hence, this is most likely not only a phenomenon that holds for most assets, but it really seems to affect all assets without exception.

Furthermore, we observe some noisy border at the top, where there are assets that individually enter high-volatility regimes in calm market periods as well. However, this also could be some artificial effect due to the estimation of volatilities with GARCH. Due to the structure of GARCH, a single large return of an asset will always automatically let its volatility estimate spike up. In contrast to the upper border, the lower border of all volatilities seems rather stable, without extreme outliers. This is at large due to the fact that volatility is bounded downwards by zero, and even the logarithmic transformation is not strong enough in order to fully compensate for this.

Still, however, even with logarithmic volatility paths, it is not possible to get more detailed insights into the individual evolutions, as they still overlap too much. This way, it is not possible to draw any conclusion of whether, for example, most paths are rather above the mean volatility, so that volatility levels would be asymmetrically distributed. Therefore, we now will take a look at the empirical quantiles of the volatilities, also. Again, we show them on normal scale, as well as on logarithmic scale.

############################# ## get empirical quantiles ## ############################# ## specify quantiles alphas <- c(0, 0.05, 0.25, 0.5, 0.75, 0.95, 1) nQs <- length(alphas) ## preallocation of empirical quantiles quants <- data.frame(matrix(data=NA, nrow=nObs, ncol=nQs), row.names=as.Date(dates)) names(quants) <- as.character(alphas) ## for each day get empirical quantiles for (ii in 1:nObs) { quants[ii, ] <- quantile(volaDf[ii, -1], probs = alphas) } ## process results for plotting quantsDf <- data.frame(time=as.Date(dates), quants) names(quantsDf) <- c("time", as.character(alphas)) logQuantsDf <- data.frame(time=as.Date(dates), log(quants)) names(logQuantsDf) <- c("time", as.character(alphas)) p3 <- plotLinePaths(quantsDf, 0.99) + ylab("quantiles of volatilities") p4 <- plotLinePaths(logQuantsDf, 0.99) + ylab("quantiles of logarithmic volatilities") p5 <- multiplot(p1, p2, p3, p4, cols=2)

Besides the variation in height, the logarithmic quantiles look quite well-behaved and almost symmetric around the median. The symmetry only seems to fade out at the 5 and 95 percent quantiles. This could still be related to the fact that volatility, on a normal scale, is bounded from below by zero, while it is unbounded from above.

Again, the shift in height due to crises periods also becomes apparent for the empirical quantiles, and again it looks like all quantiles are affected similarly. Looking at this picture, one might be tempted to conclude that all individual volatility paths strongly follow some overall market volatility, which is represented through the median in this case. However, this conclusion is premature at this point. Even if we can get a quite good guess about the overall distribution of volatilities once we know the market volatility, it is still unsure how much this will tell us about any individual stock. We will further clarify what is meant by this in some later part.

For now, we first want to look at an at least somewhat finer resolution than the complete overall distribution of all assets. Therefore, we will split up individual assets into groups. This way, we decrease overlap of volatility paths, as the total number of paths becomes significantly lower. Hence, patterns and changes will become better visible. The division into groups that we will apply here shall follow the classification of assets provided by sector affiliation. Hence, for each of the remaining assets in our SP500 dataset, we need to determine its sector. Usually, you can access this information from any big proprietary data provider. Here, we will rely on the classification provided by Standard & Poor’s, where you can download a list of constituents for almost all sector indices. I already brought this information together in a csv file, where each of the remaining assets of our dataset is listed together with its sector. However, by the time of writing, Standard & Poor’s did not provide lists of constituents for sectors materials, telecommunication services and utilities. Hence, in the csv file, any asset that was not mentioned in one of the other sectors simply bears the label “matTelUtil”.

In order to plot volatility paths for the individual sectoral subgroups, we need to calculate the mean value of each subgroup for each day of our sample period. This task exactly matches the split-apply-combine procedure that Hadley Wickham describes in [Wickham], and that he implemented in the plyr package. Using this package, we now can easily split the data into subgroups determined by date and sector, and apply the **meanValue** function to each group.

## refine volatility for incorporation of sectors volaMt <- melt(volaDf, id.vars="time") names(volaMt) <- c("time", "symbol", "value") ## load sectors sectors <- read.table(file="data/sectorAffiliation.csv", header=T) ## merge volaSectorsMt <- merge(volaMt, sectors, by="symbol") volaSectors <- volaSectorsMt[names(volaSectorsMt) != "symbol"] ## define function to get mean on each subgroup meanValue <- function(data){ meanValue <- mean(data$value) } ## for each day and sector, get mean volatility sectorVola <- ddply(.data=volaSectors, .variables=c("time", "sector"), .fun=meanValue) sectorVola$time <- as.Date(sectorVola$time) p <- ggplot(sectorVola) + geom_line(aes(x=time, y=V1, color=sector)) p

As we can see, at large, all sectors follow some joint market volatility evolution. However, there exist some slight differences over time. From 1995 until 2005, the information technology sector did exhibit the highest volatility of all sectors, while its volatility level is rather average afterwards. Then, in 2005, the energy sector temporarily becomes the sector with highest volatility, before it gets replaced again by the financial sector in 2008. Remarkably, although the financial sector exhibits large volatility in the last third of the sample, its volatility level did reside near the bottom of all sector volatilities for more than the first half. Hence, there clearly is some further variation in volatility levels besides the variation that is caused through up- and downwards shifting market volatility. The relative position of individual sectors changes over time: sectors that where beneath the market volatility in the first half, suddenly show higher levels than average in the second half. Whether this influences the explanatory power of the mean market volatility is what we want to see in the next part.

## 4 Explanatory power of shifting mean

So far, we have seen that there is some large variation in the level of stock volatilities during periods of market turmoil. In these periods, not only the average market volatility increases, but even the minimum of all volatilities increases. Hence, knowledge of the market volatility clearly reveals some information about the current distribution of volatilities. However, the market volatility might still reveal very little information about any given single volatility.

Let’s just assume for a second, that individual logarithmic volatilities are normally distributed around the market volatility. Then, with knowledge of the market volatility, you also know the overall distribution. However, without any further assumptions, you still do not know anything about the individual volatilities within this distribution. Any given single stock still could be anywhere around the mean, provided that it follows the probabilities determined by the normal distribution. You only know that it will be within a range with a given probability, but you still can not predict whether it is above or below the mean. Hence, without further structure, the market volatility might still be a bad estimator for any given single volatility.

However, through imposing some additional structure, one could easily extend this example into a system, where the mean of the normal distribution tells you a lot about the individual stocks as well. Let’s now additionally assume that the individual volatilities follow some ranking. Hence, the volatility of some stock A now will always be larger than the volatility of some stock B. Suddenly, we now will be able to pin down the exact location of any individual volatility very accurately. Let’s just convince ourselves of this effect with a little simulated example.

For that, we will artificially create a first increasing and then decreasing sequence of (logarithmic) market volatility levels. This can be achieved by letting the market volatility follow the first half of a sine oscillation. From this course of oscillation, we extract 100 points in time. Then, for each of the 100 points, we simulate 30 values of a normal distribution with mean value given by the market volatility sequence. This already represents the first case, where only the overall distribution is affected by the market volatility, but where still a large variation of the individual volatilities around the mean exists.

Then, for the second case, we now impose the additional requirement that individual volatilities follow a certain ranking. In order to achieve this, we simply need to sort all the simulated observations at a given point in time. Thereby, we make use of the **plyr** package, so that we can easily apply the **sort** function to each of the 100 rows of the matrix of simulated values. We then make use again of the **plotLinePaths** function in order to plot the resulting 30 simulated volatility paths for both cases. Clearly, the second case imposes much more structure on the evolution of joint volatilities.

library("plyr") ## init sample size and assets nObs <- 100 nAss <- 30 ## create increasing and decreasing market mean sequence mus <- sin(1:nObs/(nObs/pi)) ## specify standard deviation sigma <- 0.5 ## define simulation function simulateForGivenMean <- function(mu){ rnorm(n=nAss, mean=mu, sd=sigma) } ## simulate for each mu values: individual assets in columns simVals <- t(sapply(mus, simulateForGivenMean)) simValsDf <- data.frame(time = 1:nObs, simVals) ## for each day, sort simulated values simValsSorted <- aaply(.data = simVals, .margins=1, .fun=sort) simValsSortedDf <- data.frame(time = 1:nObs, simValsSorted) p1 <- plotLinePaths(simValsDf) p2 <- plotLinePaths(simValsSortedDf) p3 <- multiplot(p1, p2, cols=1)

For the sake of completeness, we now additionally want to examine whether this increased structure really results in higher predictive power of the market volatility. Therefore, we extract the sample correlation between each of the simulated individual volatility paths and the market volatility. Hence, for each of the two market situations we get 30 correlations, which we compare in a kernel density plot. Clearly, the situation where volatility paths are sorted results in larger correlations. The market volatility thus will have greater explanatory power in this case.

######################################## ## get correlations with market trend ## ######################################## corrs <- vector(length = nAss) for(ii in 1:nAss){ corrs[ii] <- cor(simVals[, ii], mus) } corrsSorted <- vector(length = nAss) for(ii in 1:nAss){ corrsSorted[ii] <- cor(simValsSorted[, ii], mus) } corrsDf <- data.frame(index = 1:nAss, unSorted = corrs, sorted = corrsSorted) corrsMt <- melt(corrsDf, id.vars = "index") p <- ggplot(corrsMt) + geom_density(aes(x=value, color=variable)) p

So which of the two situations is closer to the behavior of real world markets? We will try to find out in the next part.

## 5 Principal component analysis

We now want to add some quantitative rigor to the qualitative analysis of the preceding parts. So far, we have narrowed our focus down to financial crises and market volatility as common underlying factors of the evolution of stock market volatilities. The reason for this is twofold. First, the relation with these factors became obvious in the first pictures of joint volatility paths. And second, they have the big advantage of being nicely interpretable. However, with the help of quantitative methods, we now have the chance to extend our search of common factors to uninterpretable factors also. Interpretable factors would only be the best case that we could hope for, and even unobservable and uninterpretable factors would be welcome in risk and asset management applications. Hence, we will now use a more general approach to check volatility paths for commonalities. And, as we will see, the market volatility will play a big role here again.

Without going too much into details, you can think of principal component analysis as a technique that tries to explain the variation of a high-dimensional dataset as efficient as possible. Therefore, you construct synthetic and uncorrelated factors (called principal components) from the original data sample. Each component is a linear combination of the original variables of the dataset. The components are calculated successively, and in each step you choose the orthogonal linear combination that explains as much of the remaining variation as possible. This way, the individual principal components will show decreasing proportions of variation that they explain. Ideally, you will end up with a situation, where most of the variation of the high-dimensional data is already explained through a small number of principal components.

For the computation of the principal components we can simply rely on the implementation of the base R package “stats”. Then, in order to see how much the high-dimensional setting boils down to some factors only, we plot the cumulated variation explained by the first components.

## get volatilities without time for PCA analysis volaNoTime <- volaDf[, -1] ## get first two principal components PCAresults <- princomp(x=volaNoTime) PC1 <- PCAresults$scores[, 1] PC2 <- PCAresults$scores[, 2] ## get cumulative proportion of variance explainedByPCA <- summary(PCAresults) props <- explainedByPCA$sdev^2 / sum(explainedByPCA$sdev^2) ## plot proportions of first k values k <- 10 propsDf <- data.frame(PC=1:k, proportion=props[1:k], cumulated=cumsum(props[1:k])) propsMt <- melt(propsDf, id.vars = "PC") p <- ggplot(propsMt, aes(x=factor(PC), y=value, fill=variable)) + geom_bar(position="dodge") + xlab("Principal Components") + ylab("proportion of variance") p

As the picture shows, the first component already explains roughly 50 percent of the total variation of the data. However, the explanatory power of the following components decreases substantially, with the second component already explaining only 15 percent of the variation. Subsequently, the next three components explain roughly 5 percent each. From the sixth component onwards the explanatory power is already so low that they probably capture already very specific variation that concerns a few assets only. And, even with 10 components, we hardly exceed a variation of 80 percent.

Hence, it looks like if the evolution of volatility paths was driven by too many different factors to be efficiently reduced to a low-dimensional system. One way we could think about this is that asset volatilities roughly do follow one, two or maybe even five common factors. However, at any point in time, there also will be a substantial number of assets that temporarily follow some individual dynamics, deviating from the larger trend. For example, even if the common factors indicate a calm market period, there still will be assets that individually exhibit stressed behavior and high volatility. Or, following a more traditional way of thinking, the evolution of a given asset is driven by two different factors: a common market factor, and an idiosyncratic, firm specific component. Sadly, the idiosyncratic components seem too large to be neglected.

Nevertheless, even though it is clear that we will not achieve the dimensionality reduction that we where striving for, we will now take a more detailed look at the first components. Maybe we can at least give them some meaningful interpretation, so that we get a feeling about when we have to expect a high average market volatility.

The first thing to do is to look at the evolution of the first two principal components. Thereby, we want to plot the path of each principal component individually, as well as the path that they follow when considered together. As ggplot does not support three-dimensional graphics yet, we will express the third dimension, time, through colors. To simplify matters, we do only map numerical indices to the color axes, instead of real dates. However, you can still read off the date-color mapping from the respective one-dimensional paths of the individual components. For reasons that will become obvious in subsequent parts, we also plot the first component with negative sign.

## scatterplot of first two PCs: first one with negative sign pCs <- data.frame(time=1:nObs, cbind(PC1, PC2)) pCsMt <- melt(pCs, id.vars=c("PC1","PC2")) p1 <- ggplot(pCsMt, aes(x=-PC1, y=PC2), colour=value) + geom_point(aes(colour=value)) ## plot individual time series PC1 and PC2 pC3 <- data.frame(time=as.Date(dates), time2=1:nObs, -PC1, PC2) names(pC3) <- c("time", "time2", "-PC1", "PC2") pC3Mt <- melt(pC3, id.vars=c("time", "time2")) ## plot individual time series PC1 and PC2 p4 <- ggplot(pC3Mt, aes(x=time, y=value)) + geom_path(aes(colour=time2)) + facet_grid( variable ~ .) grid.arrange(p1, p4, ncol=2)

Looking at the first principal component, one should already recognize some familiar pattern: it’s peaks coincide with financial crises. As we will show later, this first principal component has an exceptional high correspondence with the market volatility. We can recognize crisis periods also in the path of the second principal component. Here, the component takes on extreme values during both financial crises, too. However, the values differ in sign for the two crises periods: the component shows negative values in the aftermath of the dot-com bubble, while it takes on positive values during the financial crisis of 2008. Hence, when considering both components together, crisis periods can be found in the upper right and lower right corners of a two-dimensional space of both components.

As the first component’s sensitivity to financial crises is equally strong as the market volatility’s sensitivity, they both should also exhibit some mutual relationship. Hence, we now want to examine the strength of the similarity between first component and market volatility. Thereby, we simply use mean and median of all volatility paths as a proxy for the market volatility. Additionally, we also allow the principal component to be linearly transformed (shifted and rescaled) to better match the path of the market volatility. We regress each market volatility proxy on the first component, in order to get a best linear transformation. Afterwards, we choose the market volatility proxy with the largest value, and plot it together with the rescaled version of the first principal component.

In order to get mean volatility and median for each given date, we again use the plyr package of Hadley Wickham. This way, we can easily split the data into subsets, and then apply a function to each part.

library("plyr") ## load volatilities volaDf <- read.csv(file = "data/estimatedSigmas.csv", header = T) names(volaDf)[names(volaDf) == "X"] <- "time" volaNoTime <- volaDf[, -1] ## get first two principal components PCAresults <- princomp(x=volaNoTime) PC1 <- PCAresults$scores[, 1] PC2 <- PCAresults$scores[, 2] ## get mean of volatilities as market volatility proxy volaMatrix <- as.matrix(volaDf[, -1]) marketVola <- aaply(volaMatrix, .margins=1, .fun=mean) marketMedian <- aaply(volaMatrix, .margins=1, .fun=median) ## apply linear rescaling according to regression coefficients fit <- lm(marketVola ~ PC1) fit2 <- lm(marketMedian ~ PC1) ## get r squared for both mean and median rSquared1 <- summary(fit)$r.squared rSquared2 <- summary(fit2)$r.squared (sprintf("R^2 for mean=%s, R^2 for median=%s", round(rSquared1, 3), round(rSquared2, 3))) pC1MarketVolaDf <- data.frame(time=dates, rescaled_PC1=fit$fitted.values, marketVola=marketVola) pC1MarketVolaMt <- melt(pC1MarketVolaDf, id.vars="time") p1 <- ggplot(pC1MarketVolaMt, aes(x=time, y=value, colour=variable)) + geom_line()

[1] "R^2 for mean=0.983, R^2 for median=0.966"

Comparing the values of both regressions, we choose the mean volatility as proxy for the market volatility. The value of 0.983 already shows the strong relation between principal component and market volatility. Something that is also confirmed visually:

As the first principal component nearly exactly matches the market volatility, we can relate its properties to the market volatility. Hence, we now can conclude that roughly 50 percent of the variation of individual asset volatilities is explained through a general market volatility level. The rest of the variation thus arises from fluctuations around the mean, and we will try to find some explanation for this, too. First, however, we want to find out how the influence of the market volatility exactly affects the individual assets. Therefore, we need to take a more detailed look at the factor loadings, which will be distinguished according to sectoral affiliation again. As with the principal component itself, we also change the sign of its loadings, so that the graphics will be consistent, and the component becomes directly interpretable as market volatility.

While the first plot shows a histogram of the loadings in absolute terms, the second plot shows relative proportions for the individual sectors.

## get loadings for first component loadings1 <- PCAresults$loadings[, 1] loadDf1 <- data.frame(symbol=names(loadings1), loadings=loadings1) ## load sectors sectors <- read.table(file="data/sectorAffiliation.csv", header=T) loadSectorDf <- merge(loadDf1, sectors, by="symbol") ## adjust bin size to 20 bins binw <- diff(range(loadSectorDf$loadings))/10 ## colorize according to sector p1 <- ggplot(loadSectorDf, aes(x=-loadings, fill=sector)) + geom_histogram(binwidth=binw) + xlab("loadings of 1st principal component") p2 <- ggplot(loadSectorDf, aes(x=-loadings, fill=sector)) + geom_bar(position="fill", binwidth=binw) + xlab("loadings of 1st principal component") multiplot(p1, p2, cols=1)

The most important point here certainly is the fact that all individual assets have a positive loading for the first component. Hence, each single asset follows the variations of the market volatility. Thereby, companies from the financial sector have loadings that are larger than average, while sectors consumer staples, health care and industrials are below average. Still, this does not directly translate into the strength of the market volatility’s influence on individual sectors. To measure this strength, we additionally regress each asset’s volatility path on the path of the first principal component, and extract the values of .

## preallocation rSquared1 <- vector(mode="logical", length=nAss) ## regress each volatility on the first principal component for(ii in 1:nAss){ dataDf <- data.frame(y=volaDf[, ii+1], x=PC1) lmFit <- lm(y~x, data=dataDf) temp <- summary(lmFit) rSquared1[ii] <- temp$r.squared } ## plot R^2 rSquaredDf1 <- data.frame(symbol=row.names(PCAresults$loadings), rSquared=rSquared1) rSquaredSectors <- merge(rSquaredDf1, sectors, sort=F) ## adjust bin size to 10 bins binw <- diff(range(rSquared1))/10 ## colorize according to sector p1 <- ggplot(rSquaredSectors, aes(x=rSquared, fill=sector)) + geom_histogram(binwidth=binw) + xlab("R squared for individual assets and PC1") p2 <- ggplot(rSquaredSectors, aes(x=rSquared, fill=sector)) + geom_bar(position="fill", binwidth=binw) + xlab("R squared for individual assets and PC1") multiplot(p1, p2, cols=1)

As we can see, the values of nearly spread over the complete range from zero to one, with a majority of assets between values of 0.25 and 0.75. The effect of the market volatility differs for sectors, and is more pronounced for financial and industrial companies. On the other hand, companies from the consumer staples sector are influenced by the market volatility only slightly. This can be seen even better with boxplots:

## preallocation rSquared1 <- vector(mode="logical", length=nAss) ## regress each volatility on the first principal component for(ii in 1:nAss){ dataDf <- data.frame(y=volaDf[, ii+1], x=PC1) lmFit <- lm(y~x, data=dataDf) temp <- summary(lmFit) rSquared1[ii] <- temp$r.squared } ## plot R^2 rSquaredDf1 <- data.frame(symbol=row.names(PCAresults$loadings), rSquared=rSquared1) rSquaredSectors <- merge(rSquaredDf1, sectors, sort=F) ## boxplot p <- ggplot(rSquaredSectors) + geom_boxplot(aes(x=sector, y=rSquared)) + xlab("sectors") + theme(axis.text.x=element_text(angle=90)) + ylab(expression(R^{2})) p

As next step, we now want to find some interpretation for the second component as well. However, the path of the second component did not immediately suggest any easy interpretation, so that we now will need to take a look at the loadings first.

## get loadings for first component loadings2 <- PCAresults$loadings[, 2] loadDf2 <- data.frame(symbol=names(loadings2), loadings=loadings2) ## load sectors sectors <- read.table(file="data/sectorAffiliation.csv", header=T) loadSectorDf <- merge(loadDf2, sectors, by="symbol") ## plot histogram of negative loadings p <- ggplot(loadSectorDf) + geom_histogram(aes(x=-loadings, fill=sector)) ## adjust bin size to 20 bins binw <- diff(range(loadSectorDf$loadings))/10 ## colorize according to sector p1 <- ggplot(loadSectorDf, aes(x=loadings, fill=sector)) + geom_histogram(binwidth=binw) + xlab("loadings of 2nd principal component") p2 <- ggplot(loadSectorDf, aes(x=loadings, fill=sector)) + geom_bar(position="fill", binwidth=binw) + xlab("loadings of 2nd principal component") multiplot(p1, p2, cols=1)

As it turns out, the loadings of the assets are almost fairly cut in halves with respect to their sign: approximately one half of the loadings has a positive sign, while the other half has a negative sign. However, this fair split does not hold for individual sectors. While financial companies nearly show a positive loading exclusively, companies from health care and information technology sectors have negative signs only. Hence, the second component picks up differences in volatility between these sectors. Let’s see, what these different signs of the loadings will actually mean for the volatilities of the sectors. Therefore, we need to look at loadings and the path of the second component simultaneously.

## evolution 2nd factor PC2df <- data.frame(time=dates, PC2=PC2) p2 <- ggplot(PC2df, aes(x=time, y=PC2)) + geom_line()

As we can see from the figure, the second principal component takes on negative values constantly for roughly the first half of the sample period, before it shifts into positive numbers until the end of the sample. Due to the loadings, a negative second component coincides with decreased volatility levels for the financial sector, and increased levels for health care and information technology sectors. This effect becomes even more pronounced during financial crises, where the component takes on its most extreme values.

In other words, the first component corresponds to the level of the market volatility, and it shifts the overall distribution of asset volatilities up and down. The second component, however, only captures changes within a given distribution. For the first half of the sample, it adds some extra volatility on the information technology sector, while it decreases the volatility of the financial sector. In the second half, it is the other way round.

To further clarify the role of the second component, we create another simulated example as visualization of this effect.

## simulate points from normal distribution nSim <- 20 mu <- 0 xx <- rnorm(nSim, mu, 1) ## points below the mean value are listed as group1 indSign <- factor(xx<mu, levels=c("TRUE", "FALSE"), labels=c("group1", "group2")) ## append points with changed sign - groups affiliation unchanged xxDf <- data.frame(vals=c(xx,-xx), time=factor(c(rep(1, nSim), rep(2, nSim)), levels=c(1, 2), labels=c("positive factor", "negative factor")), group=rep(indSign, 2), y=rep(0,(2*nSim))) ## show points for different realizations of factor p <- ggplot(xxDf, aes(x=vals, y=y, color=group)) + geom_point() + facet_grid(time ~ .) + stat_function(fun=dnorm, color="black") + scale_x_continuous(limits = c(-5, 5)) + labs(title="variation explained by PC2") p

The figure shall convey the following intuition: the realization of the second component does not change the overall distribution, but it only explains variation within the distribution. Hence, for positive values, the first group of points will be located to the left, while they are located to the right for negative realizations.

As with the first component, we now want to see how the loadings translate into real influence. Therefore, we regress each asset’s volatility path on the path of the second component, and extract the values of the individual regressions.

## preallocation rSquared2 <- vector(mode="logical", length=nAss) ## regress each volatility on the first principal component for(ii in 1:nAss){ dataDf <- data.frame(y=volaDf[, ii+1], x=PC2) lmFit <- lm(y~x, data=dataDf) temp <- summary(lmFit) rSquared2[ii] <- temp$r.squared } ## plot R^2 rSquaredDf2 <- data.frame(symbol=row.names(PCAresults$loadings), rSquared=rSquared2) rSquaredSectors <- merge(rSquaredDf2, sectors, sort=F) ## adjust bin size to 10 bins binw <- diff(range(rSquared2))/10 ## colorize according to sector p1 <- ggplot(rSquaredSectors, aes(x=rSquared, fill=sector)) + geom_histogram(binwidth=binw) + xlab("R squared for individual assets and PC2") p2 <- ggplot(rSquaredSectors, aes(x=rSquared, fill=sector)) + geom_bar(position="fill", binwidth=binw) + xlab("R squared for individual assets and PC2") multiplot(p1, p2, cols=1)

## preallocation rSquared2 <- vector(mode="logical", length=nAss) ## regress each volatility on the first principal component for(ii in 1:nAss){ dataDf <- data.frame(y=volaDf[, ii+1], x=PC2) lmFit <- lm(y~x, data=dataDf) temp <- summary(lmFit) rSquared2[ii] <- temp$r.squared } ## plot R^2 rSquaredDf2 <- data.frame(symbol=row.names(PCAresults$loadings), rSquared=rSquared2) rSquaredSectors <- merge(rSquaredDf2, sectors, sort=F) ## boxplot p <- ggplot(rSquaredSectors) + geom_boxplot(aes(x=sector, y=rSquared)) + xlab("sectors") + theme(axis.text.x=element_text(angle=90)) + ylab(expression(R^{2})) p

As we can see from the pictures, the influence of the second component measured by the values is substantially lower than it was for the first component. However, as already indicated by the loadings, it does have a great influence on the IT sector nevertheless.

Now let’s try to embed these quantitative findings into a more qualitative perception that we have of the sample period, and think about their relevance for practical applications. First, the quantitative findings do match very well with the respective origins of the crises periods. During the dot-com crash, which was labeled after its origin in the IT sector, our quantitative findings also do show some enlarged volatility levels of the same sector. And, for the financial crisis of 2008, the second component assigns a higher volatility level to the financial sector, while the IT sector simply follows some more general market conditions.

## 6 Conclusion

How much should we rely on these findings when trying to assess future risks? Well, in my opinion, we probably should focus rather on the larger patterns pointed out by the analysis, instead of sticking too much with the details. For example, although the second component captures differences between both crises very appropriately, I would not bet too much on its predictive power in future. The point is, it overfits the exact origin of the crises, such that it mainly distinguishes between IT and financial sector. However, I’d rather focus on the bigger story that this component tells: crises may originate in one sector, which then will exhibit volatility levels above the general market trend. Nevertheless, the exact sector, where the next crisis will originate, could be any. Our data simply did contain only crises that were caused by IT and financial sector.

Another pattern that seems rather robust to me is the variation captured by the first principal component. Hence, the general market volatility is not constant over time, but it increases during crises periods. And these shifts in market volatility seem to affect many, if not all, individual assets. Besides, however, at each point in time, some individual assets will always deviate from the larger trend of volatility. Hence, there must be some idiosyncratic component to volatility, too.

## References

[Wickham(2011)] | Hadley Wickham. The split-apply-combine strategy for data analysis. Journal of Statistical Software, 400 (1):0 1-29, 4 2011. ISSN 1548-7660. URL http://www.jstatsoft.org/v40/i01. |

## Downloading SP500 stock price data with R

## 1 Summary

In this blog post I want to show how to download data of all constituents of the **S&P500** index, which is one of the leading stock market indices for US equity, from **Yahoo!Finance**. According to Standard and Poor’s, “the index includes 500 leading companies and captures approximately 80% coverage of available market capitalization.” The target will be to **get a clean and broadly applicable dataset**, which then can be used for many types of financial applications, such as risk management and asset management.

~~Instead of running the R code snippets on your own, you can also~~ **directly download** the most recent version of the resulting dataset~~which I keep for my own research: ~~ *historic prices* and *clean return series*. These datasets, however, could slightly deviate from the~~results obtained in this blog post, since they probably build on a more recent version of the input files.~~

**Update:** Due to the Yahoo!Finance data disclaimer, I will not provide the application-ready and cleaned-up final dataset. I am sorry, but in order to get the data, you need to run the steps on your own.

**2nd Update:** Meanwhile I did replicate the whole exercise in my new statistical software language of choice: julia. Not only does this dramatically speed up the procedure, but I also did process the data with regards to missing values in a slightly improved manner. You can check out this new version here, and in addition you can find a comparison of different approaches to deal with missing values at this blog post.

## 2 Data considerations

The best way to test the adequacy of a model is testing its performance on out-of-sample data. With this application in mind, the dataset should be **large enough** so that it can be divided into **a part** that is used **to fit the model** and **another part** that can be **used for out-of-sample testing**. The problem with financial data is that asset return distributions vary over time, thereby posing additional challenges to risk and asset management. This most easily can be seen in times of crisis, where markets clearly exhibit higher levels of volatility. Any econometric **model** now should be able to replicate reality at any given point in time, and therefore needs to be able to **capture both calm and turbulent market periods**. Therefore, your model ideally should also be tested on both market states, so that your out-of-sample period should entail a crisis period as well. Similarly, of course, your model ideally should be fitted to both types of market states as well. For example, fitting a model to calm market times only will most likely lead to improper results during financial crisis, since crisis would be one region of the unconditional distribution that the model has never seen before. For a very extreme example, think about fitting a model to positive returns only. You would clearly fail to capture the overall unconditional distribution, and not to speak of the conditional distribution of negative returns.

According to this argumentation, we will pick **January the 1st, 1995**, as the beginning of the data sample. The data sample hence entails two financial crisis: the bursting of the **dot-com bubble in 2001**, and the **2008 financial crisis** following the bursting of the US housing bubble. The first crisis now could be used to fit some econometric model, while we can test its performance in both risk or asset management applications out-of-sample during the second crisis.

## 3 Data download

The first step – of course – is downloading the data from Yahoo!Finance. This means that we need to get all the individual assets that comprise the index. A **list of all current constituents** can be accessed on the S&P homepage, and can be directly downloaded as xls-file here, or – if the direct link is broken – by manually clicking on the full constituents list on the S&P 500 homepage. However, to make things easier, you can download this already processed csv-file, which then can be **directly read into R**. Following my convention, you should copy the file in the subfolder *raw_data* of the project’s directory, to indicate that this list comprises externally created data.

In the past, this list was where I did encounter the first problem with the data: the list of 500 components was already missing the symbol of one stock. A problem that seems to be fixed by now.

After reading in the list of constituents, the function **get.hist.quote** from the tseries package is used to download the associated data in a for-loop. Note, that I did not conduct any linearization or preallocation with this code, so that the code’s processing time could be further decreased in general. Anyways, since individual historic price series need to be merged to common dates, acceleration most likely would result in significantly less readable code.

## downloads historic prices for all constituents of SP500 library(zoo) library(tseries) ## read in list of constituents, with company name in first column and ## ticker symbol in second column spComp <- read.csv("raw_data/sp500_constituents.csv" ) ## specify time period dateStart <- "1995-01-01" dateEnd <- "2013-05-01" ## extract symbols and number of iterations symbols <- spComp[, 2] nAss <- length(symbols) ## download data on first stock as zoo object z <- get.hist.quote(instrument = symbols[1], start = dateStart, end = dateEnd, quote = "AdjClose", retclass = "zoo", quiet = T) ## use ticker symbol as column name dimnames(z)[[2]] <- as.character(symbols[1]) ## download remaining assets in for loop for (i in 2:nAss) { ## display progress by showing the current iteration step cat("Downloading ", i, " out of ", nAss , "\n") result <- try(x <- get.hist.quote(instrument = symbols[i], start = dateStart, end = dateEnd, quote = "AdjClose", retclass = "zoo", quiet = T)) if(class(result) == "try-error") { next } else { dimnames(x)[[2]] <- as.character(symbols[i]) ## merge with already downloaded data to get assets on same dates z <- merge(z, x) } } ## save data write.zoo(z, file = "data/all_sp500_price_data.csv", index.name = "time")

Although this code did work for me, please note that it is not implemented in a completely robust way, as the download of the first asset is not executed in a try block. Hence, if problems occur with this asset, you might encounter an error.

## 4 Visualizing missing values

The best way to identify any data errors with such high-dimensional data always is through visualization. As first step, we will identify assets and dates with missing values. One easy way of looking at this is to order assets with respect to the number of occurring NAs, and **highlighting entries with missing values**. Usually, I tend to create my visualizations with **ggplot2** whenever possible. However, given the size of the dataset, this will be quite time-consuming. Hence, I will also implement a comparable version of the visualization using only low-level graphical tools of R with faster execution. Missing values are shown in light blue and pink respectively.

## load libraries library(zoo) library(ggplot2) library(lattice) library(reshape) ## read in sp500 prices z <- read.zoo(file = "data/all_sp500_price_data.csv", header = T) nAss <- ncol(z) ## decode missing values with numeric ones miss <- is.na(coredata(z))*1 # multiplication converts # logical to numeric matrix ## summing up missing values per column and ordering with respect to ## number of missing values ranks <- rank(colSums(miss), ties.method = "random") miss[, ranks] <- miss ################################ ## visualization with ggplot2 ## ################################ ## missing values as dataframe missDf <- data.frame(miss) ## append time index as first column missDf$time <- factor(index(z)) missDf <- missDf[c(nAss+1, 1:nAss)] ## processing dataframe for ggplot2 missMt <- melt(missDf, id.vars = c("time")) ## plotting values p <- (ggplot(missMt, aes(x = time, y = variable, colour = value)) + geom_tile(aes(fill = value))) ## getting rid of date factor labels p2 <- p + theme(axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), legend.position = "none", panel.background = element_blank(), panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.background = element_blank()) + xlab("time") + ylab("stocks") ################################## ## visualization with levelplot ## ################################## ## getting rid of column names colnames(miss) <- NULL levelplot(miss, aspect = 1, col.regions = cm.colors, xlab = "time", ylab = "Stocks", main = "Missing values", colorkey = F)

As can be seen, there are a significant number of missing values for about one fifth of the 500 S&P components. In addition, there are also a few dates where data is missing for a substantial share of stocks. We now want to simply **eliminate stocks with multiple missing values**, since we assume that these data problems are only arising due to deficiencies of the data provider. In other words, it generally should be possible to get clean data on the chosen sample period through usage of a proprietary and more comprehensive database. We hence do not tackle stocks with large ratios of missing values in any ad-hoc way, since we do not want to include any spurious effects into the subsequent analysis.

As threshold value we choose to **exclude all assets with more than nine missing values**, and afterwards any **dates with more than two** missing values. Eliminated dates are shown together with the number of missing observations in a **Google Charts** html table. After completely removing stocks and dates with multiple NAs, all occasionally occurring remaining NAs will simply be set to 0 in the associated return series. For example, for a case of one missing observation for one asset, like

incompleteData <- cbind(c(4, NA, 6), c(6, 9, NA))

4 | 6 |

nil | 9 |

6 | nil |

this procedure will lead to a return of zero for the first pair of days, and a return corresponding to the full price change for the second pair of days.

missingValues <- which(is.na(incompleteData)) filledValues <- incompleteData filledValues[missingValues] <- incompleteData[missingValues-1] (filledValues) (diff(filledValues))

[,1] [,2] [1,] 4 6 [2,] 4 9 [3,] 6 9 [,1] [,2] [1,] 0 3 [2,] 2 0

At the end, we will have a matrix of approximately **4600 observations for** each of the approximately **350 remaining stocks**.

## load packages library(zoo) library(googleVis) library(fGarch) rm(list=ls()) ## read sp500 prices z <- read.zoo(file = "data/all_sp500_price_data.csv", header = T) ###################################################### ## eliminate stocks with more than 2 missing values ## ###################################################### ## get indices of stocks with max 9 missing values miss <- is.na(coredata(z))*1 indices <- colSums(miss) < 10 suitableStocks <- z[, indices] ## show number of remaining stocks print(length(suitableStocks[1, ])) ## create table with missing values to show dates with many NAs miss <- is.na(coredata(suitableStocks))*1 datesWithNAs <- rowSums(miss) missingValues <- data.frame(dates = index(z)[datesWithNAs > 3], number = datesWithNAs[datesWithNAs > 3]) ## show missing values in html file p <- gvisTable(missingValues) write(p$html$chart, file = "pics/table_of_missingValues.html") #################################################### ## remove dates with more than two missing values ## #################################################### datesToKeep <- rowSums(miss) < 3 pricesNaDatesAndStocksRemoved <- suitableStocks[datesToKeep, ] ############################################################# ## calculate percentage log returns and replace NAs with 0 ## ############################################################# ## show number of remaining NAs print(sum(is.na(pricesNaDatesAndStocksRemoved))) ## avoid logical indexing with zoo object allPrices <- coredata(pricesNaDatesAndStocksRemoved) missingValues <- which(is.na(allPrices)) ## fill missing values with value of previous day filledValues <- allPrices filledValues[missingValues] <- allPrices[missingValues - 1] logRetNumeric <- 100*diff(log(filledValues)) ## create zoo object again returnDates <- index(pricesNaDatesAndStocksRemoved)[-1] logRet <- zoo(x = logRetNumeric, order.by = returnDates) ## check that no NAs exist anymore if(sum(is.na(logRet)) != 0){ stop("still NAs in return series!!") ## could be due to NAs on consecutive days } ## round to five decimal places logRet <- round(logRet, 5) ## store data write.zoo(logRet, file = "data/all_sp500_clean_logRet.csv")

Again, at this point the code is not implemented in a completely robust manner. If you have an NA at the first day of the sample, you would try to access an element at index 0. This, of course, would lead to an error. Furthermore, I do not deal explicitly with the case of consecutive NAs. If this case occurred, you would be warned, as the last if-statement would throw an exception. For both cases one should rather easily find some go-around with for loops. However, this most likely would kill your performance. And any more sophisticated solution I did skip in order to keep the code simple.

## 5 Zoo object pitfalls

When removing all of the remaining NAs, bear in mind that **zoo objects can not be accessed by logical indexing**! Hence, the following assignment does not work:

```
logRet[is.na(logRet)] <- 0
```

Instead, we need:

```
coredata(logRet)[is.na(logRet)] <- 0
```