Downloading SP500 stock price data with R

1 Summary

In this blog post I want to show how to download data of all constituents of the S&P500 index, which is one of the leading stock market indices for US equity, from Yahoo!Finance. According to Standard and Poor’s, “the index includes 500 leading companies and captures approximately 80% coverage of available market capitalization.” The target will be to get a clean and broadly applicable dataset, which then can be used for many types of financial applications, such as risk management and asset management.

Instead of running the R code snippets on your own, you can also directly download the most recent version of the resulting dataset which I keep for my own research: historic prices and clean return series. These datasets, however, could slightly deviate from the results obtained in this blog post, since they probably build on a more recent version of the input files.

Update: Due to the Yahoo!Finance data disclaimer, I will not provide the application-ready and cleaned-up final dataset. I am sorry, but in order to get the data, you need to run the steps on your own.

2nd Update: Meanwhile I did replicate the whole exercise in my new statistical software language of choice: julia. Not only does this dramatically speed up the procedure, but I also did process the data with regards to missing values in a slightly improved manner. You can check out this new version here, and in addition you can find a comparison of different approaches to deal with missing values at this blog post.

2 Data considerations

The best way to test the adequacy of a model is testing its performance on out-of-sample data. With this application in mind, the dataset should be large enough so that it can be divided into a part that is used to fit the model and another part that can be used for out-of-sample testing. The problem with financial data is that asset return distributions vary over time, thereby posing additional challenges to risk and asset management. This most easily can be seen in times of crisis, where markets clearly exhibit higher levels of volatility. Any econometric model now should be able to replicate reality at any given point in time, and therefore needs to be able to capture both calm and turbulent market periods. Therefore, your model ideally should also be tested on both market states, so that your out-of-sample period should entail a crisis period as well. Similarly, of course, your model ideally should be fitted to both types of market states as well. For example, fitting a model to calm market times only will most likely lead to improper results during financial crisis, since crisis would be one region of the unconditional distribution that the model has never seen before. For a very extreme example, think about fitting a model to positive returns only. You would clearly fail to capture the overall unconditional distribution, and not to speak of the conditional distribution of negative returns.

According to this argumentation, we will pick January the 1st, 1995, as the beginning of the data sample. The data sample hence entails two financial crisis: the bursting of the dot-com bubble in 2001, and the 2008 financial crisis following the bursting of the US housing bubble. The first crisis now could be used to fit some econometric model, while we can test its performance in both risk or asset management applications out-of-sample during the second crisis.

3 Data download

The first step – of course – is downloading the data from Yahoo!Finance. This means that we need to get all the individual assets that comprise the index. A list of all current constituents can be accessed on the S&P homepage, and can be directly downloaded as xls-file here, or – if the direct link is broken – by manually clicking on the full constituents list on the S&P 500 homepage. However, to make things easier, you can download this already processed csv-file, which then can be directly read into R. Following my convention, you should copy the file in the subfolder raw_data of the project’s directory, to indicate that this list comprises externally created data.

In the past, this list was where I did encounter the first problem with the data: the list of 500 components was already missing the symbol of one stock. A problem that seems to be fixed by now.

After reading in the list of constituents, the function get.hist.quote from the tseries package is used to download the associated data in a for-loop. Note, that I did not conduct any linearization or preallocation with this code, so that the code’s processing time could be further decreased in general. Anyways, since individual historic price series need to be merged to common dates, acceleration most likely would result in significantly less readable code.

## downloads historic prices for all constituents of SP500
library(zoo)
library(tseries)                        

## read in list of constituents, with company name in first column and
## ticker symbol in second column
spComp <- read.csv("raw_data/sp500_constituents.csv" ) 

## specify time period
dateStart <- "1995-01-01"               
dateEnd <- "2013-05-01"

## extract symbols and number of iterations
symbols <- spComp[, 2]
nAss <- length(symbols)

## download data on first stock as zoo object
z <- get.hist.quote(instrument = symbols[1], start = dateStart,
                    end = dateEnd, quote = "AdjClose",
                    retclass = "zoo", quiet = T)

## use ticker symbol as column name 
dimnames(z)[[2]] <- as.character(symbols[1])

## download remaining assets in for loop
for (i in 2:nAss) {
   ## display progress by showing the current iteration step
   cat("Downloading ", i, " out of ", nAss , "\n")

   result <- try(x <- get.hist.quote(instrument = symbols[i],
                                     start = dateStart,
                                     end = dateEnd, quote = "AdjClose",
                                     retclass = "zoo", quiet = T))
   if(class(result) == "try-error") {
      next
   }
   else {
      dimnames(x)[[2]] <- as.character(symbols[i])

      ## merge with already downloaded data to get assets on same dates 
      z <- merge(z, x)                      

   }


}

## save data
write.zoo(z, file = "data/all_sp500_price_data.csv", index.name = "time")

Although this code did work for me, please note that it is not implemented in a completely robust way, as the download of the first asset is not executed in a try block. Hence, if problems occur with this asset, you might encounter an error.

4 Visualizing missing values

The best way to identify any data errors with such high-dimensional data always is through visualization. As first step, we will identify assets and dates with missing values. One easy way of looking at this is to order assets with respect to the number of occurring NAs, and highlighting entries with missing values. Usually, I tend to create my visualizations with ggplot2 whenever possible. However, given the size of the dataset, this will be quite time-consuming. Hence, I will also implement a comparable version of the visualization using only low-level graphical tools of R with faster execution. Missing values are shown in light blue and pink respectively.

## load libraries
library(zoo)
library(ggplot2)
library(lattice)
library(reshape)

## read in sp500 prices
z <- read.zoo(file = "data/all_sp500_price_data.csv", header = T)
nAss <- ncol(z)

## decode missing values with numeric ones
miss <- is.na(coredata(z))*1            # multiplication converts
                                        # logical to numeric matrix

## summing up missing values per column and ordering with respect to
## number of missing values 
ranks <- rank(colSums(miss), ties.method = "random")
miss[, ranks] <- miss

################################
## visualization with ggplot2 ##
################################

## missing values as dataframe 
missDf <- data.frame(miss)

## append time index as first column
missDf$time <- factor(index(z))
missDf <- missDf[c(nAss+1, 1:nAss)]

## processing dataframe for ggplot2
missMt <- melt(missDf, id.vars = c("time"))

## plotting values
p <- (ggplot(missMt, aes(x = time, y = variable, colour = value)) +
      geom_tile(aes(fill = value)))

## getting rid of date factor labels
p2 <- p + theme(axis.line = element_blank(), 
                axis.text.x = element_blank(),
                axis.text.y = element_blank(), 
                axis.ticks = element_blank(),
                legend.position = "none",
                panel.background = element_blank(),
                panel.border = element_blank(), 
                panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                plot.background = element_blank()) +
  xlab("time") + ylab("stocks")

##################################
## visualization with levelplot ##
##################################

## getting rid of column names
colnames(miss) <- NULL
levelplot(miss, aspect = 1, col.regions = cm.colors,
          xlab = "time", ylab = "Stocks", main = "Missing values",
          colorkey = F)

missing_values-1.png

missing_values-2

As can be seen, there are a significant number of missing values for about one fifth of the 500 S&P components. In addition, there are also a few dates where data is missing for a substantial share of stocks. We now want to simply eliminate stocks with multiple missing values, since we assume that these data problems are only arising due to deficiencies of the data provider. In other words, it generally should be possible to get clean data on the chosen sample period through usage of a proprietary and more comprehensive database. We hence do not tackle stocks with large ratios of missing values in any ad-hoc way, since we do not want to include any spurious effects into the subsequent analysis.

As threshold value we choose to exclude all assets with more than nine missing values, and afterwards any dates with more than two missing values. Eliminated dates are shown together with the number of missing observations in a Google Charts html table. After completely removing stocks and dates with multiple NAs, all occasionally occurring remaining NAs will simply be set to 0 in the associated return series. For example, for a case of one missing observation for one asset, like

incompleteData <- cbind(c(4, NA, 6), c(6, 9, NA))
4 6
nil 9
6 nil

this procedure will lead to a return of zero for the first pair of days, and a return corresponding to the full price change for the second pair of days.

missingValues <- which(is.na(incompleteData))

filledValues <- incompleteData
filledValues[missingValues] <- incompleteData[missingValues-1] 
(filledValues)
(diff(filledValues))
     [,1] [,2]
[1,]    4    6
[2,]    4    9
[3,]    6    9
     [,1] [,2]
[1,]    0    3
[2,]    2    0

At the end, we will have a matrix of approximately 4600 observations for each of the approximately 350 remaining stocks.

## load packages
library(zoo)
library(googleVis)
library(fGarch)

rm(list=ls())

## read sp500 prices
z <- read.zoo(file = "data/all_sp500_price_data.csv", header = T)

######################################################
## eliminate stocks with more than 2 missing values ##
######################################################

## get indices of stocks with max 9 missing values
miss <- is.na(coredata(z))*1
indices <- colSums(miss) < 10           
suitableStocks <- z[, indices]

## show number of remaining stocks
print(length(suitableStocks[1, ]))

## create table with missing values to show dates with many NAs
miss <- is.na(coredata(suitableStocks))*1
datesWithNAs <- rowSums(miss)
missingValues <- data.frame(dates = index(z)[datesWithNAs > 3],
                            number = datesWithNAs[datesWithNAs > 3])

## show missing values in html file
p <- gvisTable(missingValues)
write(p$html$chart, file = "pics/table_of_missingValues.html")

####################################################
## remove dates with more than two missing values ##
####################################################

datesToKeep <- rowSums(miss) < 3
pricesNaDatesAndStocksRemoved <- suitableStocks[datesToKeep, ]


#############################################################
## calculate percentage log returns and replace NAs with 0 ##
#############################################################

## show number of remaining NAs
print(sum(is.na(pricesNaDatesAndStocksRemoved)))

## avoid logical indexing with zoo object
allPrices <- coredata(pricesNaDatesAndStocksRemoved)
missingValues <- which(is.na(allPrices))

## fill missing values with value of previous day
filledValues <- allPrices
filledValues[missingValues] <- allPrices[missingValues - 1]
logRetNumeric <- 100*diff(log(filledValues))

## create zoo object again
returnDates <- index(pricesNaDatesAndStocksRemoved)[-1]
logRet <- zoo(x = logRetNumeric, order.by = returnDates)

## check that no NAs exist anymore
if(sum(is.na(logRet)) != 0){
   stop("still NAs in return series!!")
   ## could be due to NAs on consecutive days
}

## round to five decimal places
logRet <- round(logRet, 5)

## store data
write.zoo(logRet, file = "data/all_sp500_clean_logRet.csv")

Again, at this point the code is not implemented in a completely robust manner. If you have an NA at the first day of the sample, you would try to access an element at index 0. This, of course, would lead to an error. Furthermore, I do not deal explicitly with the case of consecutive NAs. If this case occurred, you would be warned, as the last if-statement would throw an exception. For both cases one should rather easily find some go-around with for loops. However, this most likely would kill your performance. And any more sophisticated solution I did skip in order to keep the code simple.

5 Zoo object pitfalls

When removing all of the remaining NAs, bear in mind that zoo objects can not be accessed by logical indexing! Hence, the following assignment does not work:

logRet[is.na(logRet)] <- 0

Instead, we need:

coredata(logRet)[is.na(logRet)] <- 0
Advertisements

Posted on 2013/08/07, in financial econometrics, R and tagged , , . Bookmark the permalink. 5 Comments.

  1. Nathan T. Johnson

    Excellent program and works well with some R understanding

  2. This is really nice work. I will steal it all :) thx for the effort.

  3. Thanks to my father who shared with me on the topic of this
    webpage, this weblog is really amazing.

  1. Pingback: Analyzing volatility patterns for SP500 stocks | Quantifying Information

  2. Pingback: Downloading SP500 stock price data with Julia | Quantifying Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: