# Kalman Smoothing for Time Series Missing Value Imputation

Using the imputeTS package

I was recently given a task to impute some time series missing values for a prediction problem. Python has the `TSFRESH`

package which is pretty well documented but I wanted to apply something using R. I opted for a model from statistics and control theory, called Kalman Smoothing which is available in the `imputeTS`

package in R.

I went with smoothing over filtering since the Kalman filter takes a forward pass through the data and uses all the data upto the current time point and can be done in real time. Kalman smoothing **adds** a backward pass through the data and thus uses all the data. I guess it can be considered an extention to filtering.

**Note:** I use stock prices here only for easy time series data collection and to just apply Kalman Smoothing to a time series problem, you cannot build a trading strategy using smoothing for the reason given. You can check out a Kalman Filtering Pairs Trading Strategy here.

#### Download some data:

I just collect some simple data for `Google`

and `Walmart`

. I end up with 3 parts, the actual observations, the predicted observation and the randomly generated *missing* time series.

```
library(knitr)
library(kableExtra)
library(quantmod)
library(imputeTS)
library(Metrics)
symbols <- c("GOOG", "WMT")
myenv <- new.env()
symnames <- getSymbols(symbols, env = myenv)
li <- eapply(myenv, function(x) x)
ts <- do.call(merge, eapply(myenv, Ad)[symnames])
actual <-as.data.frame(ts)
names(actual) <- paste(symbols, "actual", sep = "_")
```

#### Actual observations:

GOOG_actual | WMT_actual | |
---|---|---|

2007-01-03 | 232.9220 | 35.17886 |

2007-01-04 | 240.7277 | 35.34902 |

2007-01-05 | 242.6853 | 35.06050 |

2007-01-08 | 240.8871 | 34.77196 |

2007-01-09 | 241.8435 | 35.06050 |

2007-01-10 | 243.8161 | 34.97911 |

Here I create the 100 random missing values that we want to predcit /impute.

#### Create some NA values:

```
ts[sample(nrow(ts), 100), ] <- NA
names(ts) <- paste(symbols, "missing", sep = "_")
```

GOOG_missing | WMT_missing |
---|---|

232.9220 | 35.17886 |

240.7277 | 35.34902 |

242.6853 | 35.06050 |

240.8871 | 34.77196 |

NA | NA |

243.8161 | 34.97911 |

I wrap the `na_kalman`

into a function to apply `lapply`

which is useful should I want to apply the model to multivariate time series data. That is you might have a data set with many explanatory variables for each time stamp and one or more variables will have missing values. The function should impute across the columns of your data frame.

I use a state space representation of an ARIMA model here but I suggest taking a look at the `auto.arima`

model from the `forecast`

package for the best model.There is also a model available for a strucutred model fitted by maximum likelihood. I just use a simple ARIMA model for this example.

#### Run the model

```
# Model 1:
# Kalman Smoothing missing value imputation in a state space representation of an ARIMA model
KalmanSmoothing <- function(data){
df <- data
arima_model <- arima(df[, ], order = c(1, 0, 1))$model
KalmanImputation <- na_kalman(df[, ], model = arima_model)
return(KalmanImputation)
}
results <- lapply(ts, KalmanSmoothing)
results <- as.data.frame(results)
names(results) <- paste(symbols, "predictions", sep = "_")
ts <- as.data.frame(ts)
```

#### Compute the RMSE for both time series:

I subset the first 10 predictions for each observation but we will have `100`

predictions (one for each NA value we randomly generated earlier).

```
final <- cbind(ts, results, actual)
GOOG_error <- subset(final, is.na(GOOG_missing), select = c("GOOG_predictions", "GOOG_actual"))
```

GOOG_predictions | GOOG_actual | |
---|---|---|

2007-01-09 | 242.2819 | 241.8435 |

2007-01-29 | 246.6176 | 245.3155 |

2007-03-09 | 226.5736 | 225.6343 |

2007-04-23 | 239.1593 | 238.6455 |

2007-05-03 | 233.3537 | 235.7314 |

2007-09-17 | 264.9737 | 261.6692 |

2007-10-05 | 296.0007 | 295.9158 |

2007-10-22 | 328.9113 | 324.1600 |

2007-11-01 | 353.2276 | 350.2920 |

2008-06-05 | 283.8127 | 292.0553 |

#### Print the RMSE for Google

`rmse(GOOG_error$GOOG_actual, GOOG_error$GOOG_predictions)`

`## [1] 7.889578`

#### Apply the same to the Walmart predictions:

`WMT_error <- subset(final, is.na(WMT_missing), select = c("WMT_predictions", "WMT_actual"))`

WMT_predictions | WMT_actual | |
---|---|---|

2007-01-09 | 34.88928 | 35.06050 |

2007-01-29 | 35.14012 | 35.23806 |

2007-03-09 | 35.17471 | 35.08268 |

2007-04-23 | 36.56807 | 36.37310 |

2007-05-03 | 35.89254 | 35.94939 |

2007-09-17 | 32.93606 | 32.51551 |

2007-10-05 | 33.87928 | 34.05422 |

2007-10-22 | 33.38332 | 33.96415 |

2007-11-01 | 33.55119 | 33.04843 |

2008-06-05 | 44.16592 | 45.49237 |

#### Print the RMSE for Walmart

`rmse(WMT_error$WMT_actual, WMT_error$WMT_predictions)`

`## [1] 0.6258724`