Kalman Smoothing for Time Series Missing Value Imputation

Using the imputeTS package

I was recently given a task to impute some time series missing values for a prediction problem. Python has the TSFRESH package which is pretty well documented but I wanted to apply something using R. I opted for a model from statistics and control theory, called Kalman Smoothing which is available in the imputeTS package in R.

I went with smoothing over filtering since the Kalman filter takes a forward pass through the data and uses all the data upto the current time point and can be done in real time. Kalman smoothing adds a backward pass through the data and thus uses all the data. I guess it can be considered an extention to filtering.

Note: I use stock prices here only for easy time series data collection and to just apply Kalman Smoothing to a time series problem, you cannot build a trading strategy using smoothing for the reason given. You can check out a Kalman Filtering Pairs Trading Strategy here.

Download some data:

I just collect some simple data for Google and Walmart. I end up with 3 parts, the actual observations, the predicted observation and the randomly generated missing time series.

library(knitr)
library(kableExtra)
library(quantmod)
library(imputeTS)
library(Metrics)

symbols <- c("GOOG", "WMT")

myenv <- new.env()
symnames <- getSymbols(symbols, env = myenv)

li <- eapply(myenv, function(x) x) 
ts <- do.call(merge, eapply(myenv, Ad)[symnames])

actual <-as.data.frame(ts)
names(actual) <- paste(symbols, "actual", sep = "_")

Actual observations:

Table 1: Actual Observations
GOOG_actual WMT_actual
2007-01-03 232.9220 35.17886
2007-01-04 240.7277 35.34902
2007-01-05 242.6853 35.06050
2007-01-08 240.8871 34.77196
2007-01-09 241.8435 35.06050
2007-01-10 243.8161 34.97911

Here I create the 100 random missing values that we want to predcit /impute.

Create some NA values:

ts[sample(nrow(ts), 100), ] <- NA
names(ts) <- paste(symbols, "missing", sep = "_")
Table 2: Data with random missing observations
GOOG_missing WMT_missing
232.9220 35.17886
240.7277 35.34902
242.6853 35.06050
240.8871 34.77196
NA NA
243.8161 34.97911

I wrap the na_kalman into a function to apply lapply which is useful should I want to apply the model to multivariate time series data. That is you might have a data set with many explanatory variables for each time stamp and one or more variables will have missing values. The function should impute across the columns of your data frame.

I use a state space representation of an ARIMA model here but I suggest taking a look at the auto.arima model from the forecast package for the best model.There is also a model available for a strucutred model fitted by maximum likelihood. I just use a simple ARIMA model for this example.

Run the model

# Model 1:
# Kalman Smoothing missing value imputation in a state space representation of an ARIMA model
KalmanSmoothing <- function(data){
  df <- data
  arima_model <- arima(df[, ], order = c(1, 0, 1))$model
  KalmanImputation <- na_kalman(df[, ], model = arima_model)
  return(KalmanImputation)
}

results <- lapply(ts, KalmanSmoothing)
results <- as.data.frame(results)
names(results) <- paste(symbols, "predictions", sep = "_")

ts <- as.data.frame(ts)

Compute the RMSE for both time series:

I subset the first 10 predictions for each observation but we will have 100 predictions (one for each NA value we randomly generated earlier).

final <- cbind(ts, results, actual)

GOOG_error <- subset(final, is.na(GOOG_missing), select = c("GOOG_predictions", "GOOG_actual"))
Table 3: Predictions vs Observed
GOOG_predictions GOOG_actual
2007-01-09 242.2819 241.8435
2007-01-29 246.6176 245.3155
2007-03-09 226.5736 225.6343
2007-04-23 239.1593 238.6455
2007-05-03 233.3537 235.7314
2007-09-17 264.9737 261.6692
2007-10-05 296.0007 295.9158
2007-10-22 328.9113 324.1600
2007-11-01 353.2276 350.2920
2008-06-05 283.8127 292.0553

Apply the same to the Walmart predictions:

WMT_error <- subset(final, is.na(WMT_missing), select = c("WMT_predictions", "WMT_actual"))
Table 4: Predictions vs Observed
WMT_predictions WMT_actual
2007-01-09 34.88928 35.06050
2007-01-29 35.14012 35.23806
2007-03-09 35.17471 35.08268
2007-04-23 36.56807 36.37310
2007-05-03 35.89254 35.94939
2007-09-17 32.93606 32.51551
2007-10-05 33.87928 34.05422
2007-10-22 33.38332 33.96415
2007-11-01 33.55119 33.04843
2008-06-05 44.16592 45.49237
Avatar
Matthew Smith
Researcher in Dept Finance

I am a researcher with a focus on Machine Learning methods applied to economics and finance.

Related

comments powered by Disqus