Time Series Classification Synthetic vs Real Financial Time Series
Distinguishing between real financial time series and synthetic time series using XGBoost
::Note:: This is a long post but I talk about the procedure I took when dealing with a specific time series classification task.
I was given a “Data Science” challenge as part of an interview in which I had to distinguish between real financial time series and synthetic time series. I document the results here, the data was anonymous and I have no idea which assets were which or from what time series the assets came from.
To conclude I obtained an in-sample-test-accuracy of 67% and an out-of-sample-test-accuracy of 65% (based on what the interviewers told me)
All I knew was that I had 12,000 real time series and 12,000 synthetically created time series. (apologies for no data but this was the companies data and not mine, I have uploaded the train and test data sets discussed later here where you should be able to run the final XGBoost model). In total there were 24,000 observations. I show the code here for methodological purposes and if you are interested in visualising time series in R and ggplot2
. The time series features used here are taken from the following papers:
- Large Scale Unusual Time Series Detection by R.Hyndman, E.Wang and N.Laptev
- Visualising forecasting algorithm performance using time series instance spaces by Y.Kang, Rob.Hyndman and Kate Smith-Miles
You can check out my Jupyter Notebook version here.
I added a lot of notes to the code throughout the document which might be of additional interest.
A brief overview of the notebook:
Part 1 of the notebook:
- Cleans the data and puts it into a better format for analysis. The data I recieved removed all dates, assest names etc. for anonymity.
- Simple plot of some returns for the Synthetic and Real financial time series.
- Box-plots of average returns and standard deviations.
- Computes the Durbin-Watson test statistics for both Synthetic and Real time series and box-plots.
- Plot the 10 day rolling mean and standard deviations for a random time series for Synthetic and real data.
- Dickey Fuller test on both the Synthetic and real time series.
- Jarque-Bera Test For Normality on the Synthetic and real time series.
- ACF Plots for both the Synthetic and real time series.
Part 2 of the notebook:
- Creates the time series features.
- Splits the train.csv into “train” and “validation” data sets.
- Puts the data into the correct format for XGBoost.
- Sets up and searches over a parameter space to find the most optimal parameters for this data set (on the train data).
- Outputs these parameters into a data frame.
- Train the model using the optimal parameters found from the grid-search.
- Plot the feature importance scores - i.e. the most “important” variables that the model found when making its predictions.
- Assign a cut-off on the probability scores (> 0.5 then assign a 1 - real time series, <= 0.5 then assign a 0 for Synthetic).
- Compute the Confusion Matrix and analyse the ‘in-sample’ validation results.
Part 3 of the notebook:
- Create the “test.csv” features just as before and save as “TSfeatures_test.csv”.
- Load in the “TSfeatures_train_val.csv” and “TSfeatures_test.csv” which were created from “train.csv” and “test.csv”.
- Set up and run the XGBoost model using the optimal parameters found from the cross-validation grid search in “Part 2”.
- Plot the predicted probability density plot as before in “Part 2”.
- Set the cut-off threshold as the mean prediction score (0.465) which is close to the (0.500) score from “Part 2”.
- Save the results as “submission.csv”.
Lets get started…
I often remove all other data in my environment before hand and turn scientific notation off which is what the first 2 lines does. The shhh
command is useful for Jupyter Notebooks which outputs all the warning messages, adding shhh
suppresses these warning messaged when loading in the packages. (In R markdown I can set warning = FALSE
but there is no option on Notebooks. - that I know of - )
rm(list = ls())
options(scipen=999)
setwd('C:/Users/Matt/Desktop/Data Science Challenge')
shhh <- suppressPackageStartupMessages
shhh(library(dplyr))
library(readr)
library(TSrepr)
library(ggplot2)
library(data.table)
library(cluster)
library(clusterCrit)
library(fractalrock)
library(cowplot)
library(tidyr)
library(tidyquant)
library(lmtest)
library(aTSA)
library(tsoutliers)
library(tsfeatures)
library(xgboost)
library(caret)
library(purrr)
train_val <- read_csv("train.csv")
test <- read_csv("test.csv")
NOTE:
I have 2 data sets, the train_Val.csv
for training and validation data set and the test.csv
data set. I do not touch the test.csv
data set until the very end in part 3. All the analysis and optimisation is performed only on the train_val.csv
data set. The train_val.csv
contains 12,000 observations and the test.csv
contains 12,000 observations.
Part 1
The data was given to me in this format:
head(train_val[, 1:5], 1)
## # A tibble: 1 x 5
## feature1 feature2 feature3 feature4 feature5
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.00629 0.00441 -0.0381 0.0253 -0.00658
The names of the columns are as follows:
colnames(train_val) %>%
data.frame() %>%
setNames(c("features")) %>%
split(as.integer(gl(nrow(.), 20, nrow(.)))) %>%
kable(caption = "Time series variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There are 260 “features” in the train data along with a class variable which is excluded from the testing data. With ~253 trading days in a year the feature1, feature2, … featureN were daily time series. From my initial observation (and plots) I believed this data to be “returns” data. I firstly clean a little the data since time series does not do so well with feature1, feature2, … featureN as its input. I chose a year at random and renamed the columns with the function getTradingDates
(there is always an R package for everything…).
######################################################################
################# Clean the data #####################################
# Since the "features" are daily time series, I just choose a random year and rename the feautres into more meaningful names
# Such as "2010-01-01", "2010-01-02", "2010-01-03" instead of "feature1", "feature2", "feature3" etc.
# Theres a "trading dates" package in R to get only the dates which are trading dates.
colnames(train_val) <- getTradingDates('2010-01-01', obs = 260)
colnames(train_val)[ncol(train_val)] <- "class"
colnames(test) <- getTradingDates('2010-01-01', obs = 260)
test$dataset <- "test"
train_val$dataset <- "train"
Here (if I were to do things differently) I would keep to tidy
data principles and use test %>% add_column(dataset = "test)
and train %>% add_colum(dataset = "train")
instead of test$dataset <- "test
and train_val$dataset <- "train"
. But that doesn’t matter much.
How the training data looks after cleaning:
2009-01-05 | 2009-01-06 | 2009-01-07 | 2009-01-08 | 2009-01-09 |
---|---|---|---|---|
0.0062865 | 0.0044074 | -0.0380887 | 0.0252850 | -0.0065788 |
0.0008491 | 0.0025729 | 0.0013584 | -0.0054742 | -0.0098234 |
0.0142292 | -0.0252533 | -0.0100752 | -0.0319871 | -0.0065087 |
-0.0215930 | -0.0102866 | -0.0210674 | -0.0086876 | 0.0371876 |
0.0092523 | -0.0235778 | 0.0170582 | 0.0037303 | 0.0171185 |
0.0143528 | 0.0094828 | 0.0042109 | -0.0038064 | 0.0084914 |
How the testing data looks after cleaning:
2009-01-05 | 2009-01-06 | 2009-01-07 | 2009-01-08 | 2009-01-09 |
---|---|---|---|---|
0.0331039 | 0.0086225 | 0.0040622 | 0.0082554 | 0.0558741 |
0.0020681 | -0.0034293 | 0.0134305 | -0.0109182 | -0.0184851 |
0.0147834 | -0.0113800 | -0.0046055 | -0.0008757 | -0.0011536 |
-0.0094855 | 0.0113410 | -0.0213286 | 0.0033220 | -0.0111519 |
0.0381690 | -0.0037092 | -0.0010865 | -0.0062307 | 0.0232117 |
0.0004257 | -0.0042553 | 0.0029915 | 0.0017043 | 0.0012760 |
The goal: Was to classify which financial time series were real vs which were synthetically created (by some algorithm I have no knowledge of how it generated the synthetic time series)
I re-arranged the data using the melt
function in R, however I suggest anybody reading this to use the pivol_longer
function from the tidyverse
packages. The pivot_longer
package was released a few weeks after writing the code for this problem.
######################################################################
################# Rearrange the data #################################
# I melt the data for easier analysis, now the data is in a long format.
# "Class" corresponds to whether the asset is Synthetic or Real
# "Dataset" tells me where the data came from
# "row_id" - corresponds to a unique ID assigned to each asset both "(Synthetic & Real)"
# "Variable" is the column names of the original dataset (feature1, feature2, ... , featureN) converted to some date
# "Value" is the daily returns
df <- train_val %>%
mutate(row_id = row_number()) %>%
melt(., measure.vars = 1:260) %>%
arrange(row_id)
head(df)
## class dataset row_id variable value
## 1 0 train 1 2009-01-05 0.006286455
## 2 0 train 1 2009-01-06 0.004407363
## 3 0 train 1 2009-01-07 -0.038088652
## 4 0 train 1 2009-01-08 0.025285012
## 5 0 train 1 2009-01-09 -0.006578773
## 6 0 train 1 2009-01-12 0.005713677
dim(df)
## [1] 3120000 5
Note: I call the training data df
which in hindsight is probably bad practice and it should be called something related to the train_Val
named data set. Just keep in mind that df
refers to the train_Val
data set. (and does not include the test.csv
data set data)
As we can see the data has 3,120,000 rows which is 12,000 assets * 260 trading days. Next I plot the returns series using ggplot
.
# Plot some returns - I only plot a random sample of 20 assets for each Synthetic vs Real.
ret_plot0 <- df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(20) %>%
unnest() %>%
ggplot(aes(x = variable, y = value)) +
geom_line(aes(group = factor(row_id), color = factor(row_id))) +
ggtitle("Synthetic Financial Time Series") +
theme_classic() +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())
ret_plot1 <- df %>%
filter(class == 1) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(20) %>%
unnest() %>%
ggplot(aes(x = variable, y = value)) +
geom_line(aes(group = factor(row_id), color = factor(row_id))) +
ggtitle("Real Financial Time Series") +
theme_classic() +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())
plot_grid(ret_plot0, ret_plot1)
Next I plot boxplots for the Average returns and secondly the standard deviations.
ave_box <- df %>%
group_by(class, row_id) %>%
summarise(mean = mean(value)) %>%
ggplot(aes(x = factor(class), y = mean, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Syn vs Real Average Returns") +
xlab("Class") +
ylab("Average Returns") +
theme_tq()
sd_box <- df %>%
group_by(class, row_id) %>%
summarise(sd = sd(value)) %>%
ggplot(aes(x = factor(class), y = sd, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Syn vs Real Standard Deviations") +
xlab("Class") +
ylab("Standard Deviation") +
theme_tq()
plot_grid(ave_box, sd_box)
I next calculate the Durbin-Watson statistic. I mostly code using R’s tidy data principles and therefore use the tidy
function from the broom
package to tidy the output of the DW statistic up a little. I do this for both the synthetic time series and real time series.
# I calculate the Durbin-Watson statistic and use the "tidy()" function to summarise the key information from the calculation.
dw_test_class_zero <- df %>%
dplyr::filter(class == 0) %>%
nest(-row_id) %>%
mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
unnest(dw_res) %>%
mutate(class = "0")
dw_test_class_zero %>%
head()
## # A tibble: 6 x 7
## row_id data statistic p.value method alternative class
## <int> <list<df[,4> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 [260 x 4] 1.98 0.426 Durbin-Wat~ true autocorrelation ~ 0
## 2 2 [260 x 4] 2.01 0.521 Durbin-Wat~ true autocorrelation ~ 0
## 3 4 [260 x 4] 2.08 0.747 Durbin-Wat~ true autocorrelation ~ 0
## 4 5 [260 x 4] 2.49 1.000 Durbin-Wat~ true autocorrelation ~ 0
## 5 6 [260 x 4] 1.90 0.214 Durbin-Wat~ true autocorrelation ~ 0
## 6 9 [260 x 4] 1.87 0.138 Durbin-Wat~ true autocorrelation ~ 0
# Here I do the exact same thing as above but this time for the class == 1 data.
dw_test_class_one <- df %>%
filter(class == 1) %>%
nest(-row_id) %>%
mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
unnest(dw_res) %>%
mutate(class = "1")
dw_test_class_one %>%
head()
## # A tibble: 6 x 7
## row_id data statistic p.value method alternative class
## <int> <list<df[,4> <dbl> <dbl> <chr> <chr> <chr>
## 1 3 [260 x 4] 2.08 0.728 Durbin-Wat~ true autocorrelation ~ 1
## 2 7 [260 x 4] 1.81 0.0654 Durbin-Wat~ true autocorrelation ~ 1
## 3 8 [260 x 4] 1.93 0.296 Durbin-Wat~ true autocorrelation ~ 1
## 4 13 [260 x 4] 2.05 0.644 Durbin-Wat~ true autocorrelation ~ 1
## 5 15 [260 x 4] 2.07 0.715 Durbin-Wat~ true autocorrelation ~ 1
## 6 16 [260 x 4] 2.07 0.709 Durbin-Wat~ true autocorrelation ~ 1
Next I plot the boxplot statistics for each of the Durbin Watson tests.
# I bind the rows together and plot a box-plot.
bind_rows(dw_test_class_zero, dw_test_class_one) %>%
group_by(class) %>%
ggplot(aes(x = factor(class), y = statistic, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Durbin Watson Box Plot Statistics") +
xlab("Class") +
ylab("Durbin Watson") +
theme_tq()
I compute the 10 day rolling mean and standard deviation using the tq_mutate
function from the tidyquant
package. value
corresponds to the returns of the financial time series and is plotted in blue with the 10 day rolling average and standard deviation plotted over the returns. (I use melt
again here but look into the pivot_longer
function for a more intuitive application)
# Rolling mean and standard deviations
# I only use a random sample of 1 of each class of the grouped observations to save on memory and to make the plot more readable.
# The rollowing window is 10 days
# I use the tq_mutate functionality from the "tidyquant" package to keep things in a "tidy" format as per the "tidyverse" 'rules'.
# In the plot "value" is the returns, "mean_10" is the 10 day rolling mean and "sd_10" is the 10 day rolling standard deviation.
plot0 <- df %>%
filter(class == 0) %>%
as_tibble() %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
mutate(variable = as.Date(variable)) %>%
tq_mutate(
select = value,
mutate_fun = rollapply,
width = 10,
align = "right",
FUN = mean,
na.rm = TRUE,
col_rename = "mean_10"
) %>%
tq_mutate(
select = value,
mutate_fun = rollapply,
width = 10,
align = "right",
FUN = sd,
na.rm = TRUE,
col_rename = "sd_10"
) %>%
melt(measure.vars = 5:7) %>%
setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value, colour = variable)) +
ggtitle("Synthetic Financial Time Series Rolling Mean and Standard Deviation") +
theme_classic() +
scale_colour_manual(values = c("#1f77b4", "red", "black")) +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())
plot1 <- df %>%
filter(class == 1) %>%
as_tibble() %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
mutate(variable = as.Date(variable)) %>%
tq_mutate(
select = value,
mutate_fun = rollapply,
width = 10,
align = "right",
FUN = mean,
na.rm = TRUE,
col_rename = "mean_10"
) %>%
tq_mutate(
select = value,
mutate_fun = rollapply,
width = 10,
align = "right",
FUN = sd,
na.rm = TRUE,
col_rename = "sd_10"
) %>%
melt(measure.vars = 5:7) %>%
setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value, colour = variable)) +
ggtitle("Real Financial Time Series Rolling Mean and Standard Deviation") +
theme_classic() +
scale_colour_manual(values = c("#1f77b4", "red", "black")) +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())
plot_grid(plot0, plot1)
An important note in the code here is that I randomly sample by group, that is, I do not take a random sample from all observations across all groups. Instead I group_by
each time series (each of the 6,000 observations after I filtered by the class == 0
, likewise when I filter by the class == 1
), I then nest()
the data to collapse the daily time series for each asset into a list
. From here I will have 6,000 observations, each of which has their time series nested inside a list. Thus, I can sample 1 of the 6,000 observations and then unnest()
and obtain a full time series set of one of the random assets selected, - instead of sampling randomly over all assets time series data (which would be completely wrong).
For example the following commented out code group_by()
the ID variable and nest()
the data, takes a random sample_n()
of the grouped data and then unnest()
the data to its original form, this time with a random sample of the IDs.
# group_by(row_id) %>%
# nest() %>%
# ungroup() %>%
# sample_n(1) %>%
# unnest() %>%
Next I compute the Dickey Fuller test on both series for a single random observation, hence the sample_n(1)
argument (it’s too computationally expensive to compute it on all 12,000 observations).
For the synthetically created series.
# Dickey Fuller test on the 0 class
# I only randomly sample 1 of the assets for the 0 class to save on output space
df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(adf_res = map(data, ~ adf.test(.x$value))) %>%
unnest(adf_res)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -17.94 0.01
## [2,] 1 -11.75 0.01
## [3,] 2 -8.66 0.01
## [4,] 3 -7.62 0.01
## [5,] 4 -7.13 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -17.94 0.01
## [2,] 1 -11.76 0.01
## [3,] 2 -8.67 0.01
## [4,] 3 -7.64 0.01
## [5,] 4 -7.15 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -18.00 0.01
## [2,] 1 -11.83 0.01
## [3,] 2 -8.77 0.01
## [4,] 3 -7.74 0.01
## [5,] 4 -7.26 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
## # A tibble: 3 x 3
## row_id data adf_res
## <int> <list<df[,4]>> <named list>
## 1 7807 [260 x 4] <dbl[,3] [5 x 3]>
## 2 7807 [260 x 4] <dbl[,3] [5 x 3]>
## 3 7807 [260 x 4] <dbl[,3] [5 x 3]>
The same but on the real financial series.
# Dickey Fuller test on the 1 class
# I only randomly sample 1 of the assets for the 1 class to save on output space
df %>%
filter(class == 1) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(adf_res = map(data, ~ adf.test(.x$value))) %>%
unnest(adf_res)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -15.99 0.01
## [2,] 1 -10.71 0.01
## [3,] 2 -9.12 0.01
## [4,] 3 -8.74 0.01
## [5,] 4 -7.58 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -16.10 0.01
## [2,] 1 -10.84 0.01
## [3,] 2 -9.27 0.01
## [4,] 3 -8.93 0.01
## [5,] 4 -7.81 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -16.27 0.01
## [2,] 1 -10.99 0.01
## [3,] 2 -9.46 0.01
## [4,] 3 -9.18 0.01
## [5,] 4 -8.06 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
## # A tibble: 3 x 3
## row_id data adf_res
## <int> <list<df[,4]>> <named list>
## 1 10833 [260 x 4] <dbl[,3] [5 x 3]>
## 2 10833 [260 x 4] <dbl[,3] [5 x 3]>
## 3 10833 [260 x 4] <dbl[,3] [5 x 3]>
Next the Jarque-Bera tests for normality. Firstly on the synthetically created series.
# For both classes I take a random sample of 1 observation from each class (Synthetic and Real financial series)
jb_zero <- df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value)))
print("Jarque-Bera Test on the 0 - Synthetic class")
## [1] "Jarque-Bera Test on the 0 - Synthetic class"
jb_zero$JarqueBeraTest
## [[1]]
##
## Jarque Bera Test
##
## data: .x$value
## X-squared = 0.3088, df = 2, p-value = 0.8569
##
##
## Skewness
##
## data: .x$value
## statistic = 0.045794, p-value = 0.7631
##
##
## Kurtosis
##
## data: .x$value
## statistic = 2.8582, p-value = 0.6406
Also on the real financial series.
jb_one <- df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value)))
print("Jarque-Bera Test on the 1 - Real class")
## [1] "Jarque-Bera Test on the 1 - Real class"
jb_one$JarqueBeraTest
## [[1]]
##
## Jarque Bera Test
##
## data: .x$value
## X-squared = 25.14, df = 2, p-value = 0.000003474
##
##
## Skewness
##
## data: .x$value
## statistic = 0.084191, p-value = 0.5794
##
##
## Kurtosis
##
## data: .x$value
## statistic = 4.514, p-value = 0.0000006251
Autocorrelation plots:
I plot the Autocorrelation Function for a “random” sample of observations time series. I selected 4 observations and filtered the data by them.
######################################################################
################# ACF plots ##########################################
# I only use 4 observations for these plots, 2 from the "synthetic" class and 2 from the "real" class.
df %>%
filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id == 5734) %>%
mutate(date = as.Date(variable)) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value), color = "red", alpha = 0.4) +
geom_hline(yintercept = 0) +
facet_wrap(~ row_id + class) +
theme_tq()
acf_data <- df %>%
dplyr::filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id == 5734) %>%
mutate(date = as.Date(variable))
df_acf <- acf_data %>%
group_by(row_id) %>%
summarise(list_acf = list(acf(value, plot=FALSE))) %>%
mutate(acf_vals = purrr::map(list_acf, ~as.numeric(.x$acf))) %>%
select(-list_acf) %>%
unnest() %>%
group_by(row_id) %>%
mutate(lag = row_number() - 1)
df_ci <- acf_data %>%
group_by(row_id) %>%
summarise(ci = qnorm((1 + 0.95)/2)/sqrt(n()))
ggplot(df_acf, aes(x = lag, y = acf_vals)) +
geom_bar(stat="identity", width=.05) +
geom_hline(yintercept = 0) +
geom_hline(data = df_ci, aes(yintercept = -ci), color="blue", linetype="dotted") +
geom_hline(data = df_ci, aes(yintercept = ci), color="blue", linetype="dotted") +
labs(x="Lag", y="ACF") +
facet_wrap(~ row_id) +
theme_tq()
Thats enough data analysis I could probably fit the PACF plots also along with a few more exploratory data analysis but I move on to generating the financial time series features using the tsfeatures
package.
What I do in the below code is to take a random sample of 5 groups (Using the whole data set takes too long to calculate the time series features) and then apply all the functions in the tsfeatures
package to each of the time series assets data which is does by mapping over each assets data and computing the time series features.
This section takes some time to process and compute (especially on the whole sample) and I already saved the results as a csv which I will just work from and load in the pre-computed time series features.
################# Generate Time Series Features ######################
# I create some time series features from the package "tsfeatures". There are 40+ functions in the "tsfeatures" package
# which can generate approximately 106 time series features.
# Due to memory issues I am only able to create a few of the features, therefore I randomly sample 10 features from the
# "tsfeatures" package. We could also add in technical indicators from the "PerformanceAnalytics" or "TTR" packages (I omit these
# here, however creating 'functions2 <- ls("package:TTR")' and adding it to the 'summarise' command will work.)
functions <- ls("package:tsfeatures")[1:42]
# functions <- sample(functions, 20)
Stats <- df %>%
group_by(row_id, class) %>%
nest() %>%
ungroup() %>%
sample_n(5) %>%
unnest() %>%
nest(-row_id, -class) %>%
group_by(row_id, class) %T>%
{options(warn = -1)} %>%
summarise(Statistics = map(data, ~ data.frame(
bind_cols(
tsfeatures(.x$value, functions))))) %>%
unnest(Statistics)
# I saved to whole dataset as "Stats" next I split it between training and test.
Stats <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")
Note: Again, bad practice by me. I just called the df
data Stats
which consists of only the time series features. This still only refers to the train_val.csv
data and not the test.csv
data.
The training data looks like: (after computing the time series features). Now each asset has been collapsed from ~260 days down to 1 signal time series feature observation.
Recall the goal here was to classify synthetic time series vs real time series and not what the next days price is going to be. For each asset I have a signal observation and based on this I can train a classifying algorithm to distinguish between real vs synthetic time series.
How the training data looks:
X | row_id | class | ac_9_ac_9 | acf_features_x_acf1 | acf_features_x_acf10 | acf_features_diff1_acf1 | acf_features_diff1_acf10 | acf_features_diff2_acf1 | acf_features_diff2_acf10 | ARCH.LM | autocorr_features_embed2_incircle_1 | autocorr_features_embed2_incircle_2 | autocorr_features_ac_9 | autocorr_features_firstmin_ac | autocorr_features_trev_num | autocorr_features_motiftwo_entro3 | autocorr_features_walker_propcross | binarize_mean_binarize_mean | binarize_mean_NA | compengine_embed2_incircle_1 | compengine_embed2_incircle_2 | compengine_ac_9 | compengine_firstmin_ac | compengine_trev_num | compengine_motiftwo_entro3 | compengine_walker_propcross | compengine_localsimple_mean1 | compengine_localsimple_lfitac | compengine_sampen_first | compengine_std1st_der | compengine_spreadrandomlocal_meantaul_50 | compengine_spreadrandomlocal_meantaul_ac2 | compengine_histogram_mode_10 | compengine_outlierinclude_mdrmd | compengine_fluctanal_prop_r1 | crossing_points | dist_features_histogram_mode_10 | dist_features_outlierinclude_mdrmd | embed2_incircle | entropy | firstmin_ac | firstzero_ac | flat_spots | fluctanal_prop_r1_fluctanal_prop_r1 | arch_acf | garch_acf | arch_r2 | garch_r2 | histogram_mode | alpha | beta | hurst | hw_parameters_hw_parameters | hw_parameters_NA | localsimple_taures | lumpiness | max_kl_shift | time_kl_shift | max_level_shift | time_level_shift | max_var_shift | time_var_shift | motiftwo_entro3 | nonlinearity | outlierinclude_mdrmd | x_pacf5 | diff1x_pacf5 | diff2x_pacf5 | pred_features_localsimple_mean1 | pred_features_localsimple_lfitac | pred_features_sampen_first | sampen_first_sampen_first | sampenc | scal_features_fluctanal_prop_r1 | spreadrandomlocal_meantaul | stability | station_features_std1st_der | station_features_spreadrandomlocal_meantaul_50 | station_features_spreadrandomlocal_meantaul_ac2 | std1st_der_std1st_der | nperiods | seasonal_period | trend | spike | linearity | curvature | e_acf1 | e_acf10 | trev_num | tsfeatures_frequency | tsfeatures_nperiods | tsfeatures_seasonal_period | tsfeatures_trend | tsfeatures_spike | tsfeatures_linearity | tsfeatures_curvature | tsfeatures_e_acf1 | tsfeatures_e_acf10 | tsfeatures_entropy | tsfeatures_x_acf1 | tsfeatures_x_acf10 | tsfeatures_diff1_acf1 | tsfeatures_diff1_acf10 | tsfeatures_diff2_acf1 | tsfeatures_diff2_acf10 | unitroot_kpss | unitroot_pp | walker_propcross |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | -0.0675275 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0627825 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 1.788841 | 1.408737 | 1.68 | 1.43 | -0.25 | -0.2865385 | 0.1627907 | 132 | -0.25 | -0.2865385 | 0.3929961 | 0.9840151 | 1 | 3 | 4 | 0.1627907 | 0.0652585 | 0.0154406 | 0.0627825 | 0.0253367 | -0.25 | 0.0013330 | 0.0013330 | 0.5000458 | NA | NA | 1 | 0.3556536 | 1.783636 | 103 | 1.297736 | 97 | 2.819828 | 46 | 2.071663 | 0.0752319 | -0.2865385 | 0.0108653 | 0.4457792 | 1.0525222 | 1 | 1 | 1.788841 | 1.788841 | 1.788841 | 0.1627907 | 1.76 | 0.0562693 | 1.408737 | 1.74 | 1.36 | 1.408737 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.1208750 | 1 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.9840151 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0993829 | -249.7732 | 0.5405405 |
2 | 2 | 0 | -0.0421577 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0558032 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 1.780390 | 1.419266 | 1.95 | 1.00 | 0.50 | 0.2615385 | 0.1627907 | 123 | 0.50 | 0.2615385 | 0.4285714 | 0.9864332 | 1 | 1 | 4 | 0.1627907 | 0.0664358 | 0.0657859 | 0.0558032 | 0.0554355 | 0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.4636768 | 1.733008 | 247 | 1.311861 | 141 | 2.625772 | 221 | 2.077581 | 0.0273335 | 0.2615385 | 0.0256032 | 0.4606850 | 1.0171377 | 1 | 1 | 1.780390 | 1.780390 | 1.780390 | 0.1627907 | 2.05 | 0.0892206 | 1.419266 | 2.12 | 1.00 | 1.419266 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | -0.4765229 | 1 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | 0.9864332 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0414599 | -256.0485 | 0.5019305 |
3 | 3 | 1 | 0.0099598 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0325470 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 0 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 1 | 1.706841 | 1.443315 | 1.38 | 1.00 | -0.50 | -0.2538462 | 0.1395349 | 132 | -0.50 | -0.2538462 | 0.4671815 | 0.9868568 | 1 | 1 | 6 | 0.1395349 | 0.0388513 | 0.0039162 | 0.0325470 | 0.0041902 | -0.50 | 0.0014557 | 0.0014557 | 0.5000458 | NA | NA | 1 | 1.2670493 | 7.746711 | 95 | 1.403784 | 87 | 5.235499 | 84 | 2.069233 | 0.2436499 | -0.2538462 | 0.0223069 | 0.5356408 | 0.9954919 | 1 | 1 | 1.706841 | 1.706841 | 1.706841 | 0.1395349 | 1.42 | 0.0716499 | 1.443315 | 1.42 | 1.00 | 1.443315 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | -0.8755173 | 1 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | 0.9868568 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0775698 | -258.1295 | 0.5328185 |
4 | 4 | 0 | -0.0428748 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.1275576 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 0 | 0 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 1 | 1 | 1.660825 | 1.445807 | 1.24 | 1.00 | 0.25 | 0.0153846 | 0.1395349 | 127 | 0.25 | 0.0153846 | 0.4555985 | 0.9790521 | 2 | 1 | 7 | 0.1395349 | 0.0694296 | 0.0112709 | 0.0579144 | 0.0123884 | 0.25 | 0.0480021 | 0.0001000 | 0.5000458 | NA | NA | 1 | 1.0068624 | 4.994753 | 132 | 1.258758 | 173 | 5.886911 | 156 | 2.068744 | 0.3840091 | 0.0153846 | 0.0503205 | 0.5402603 | 1.1070217 | 1 | 1 | 1.660825 | 1.660825 | 1.660825 | 0.1395349 | 1.10 | 0.1065111 | 1.445807 | 1.14 | 1.00 | 1.445807 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | -0.9943808 | 1 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | 0.9790521 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.2129633 | -262.0781 | 0.4903475 |
5 | 5 | 0 | 0.0259312 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.2014861 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 0 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 1 | 1.347789 | 1.580825 | 1.08 | 0.98 | -0.50 | 0.7961538 | 0.1627907 | 133 | -0.50 | 0.7961538 | 0.6563707 | 0.9723766 | 1 | 1 | 9 | 0.1627907 | 0.2718058 | 0.2229375 | 0.1765130 | 0.1330761 | -0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 2.8846415 | 11.474426 | 80 | 1.772392 | 229 | 8.468236 | 236 | 2.059764 | 0.2143595 | 0.7961538 | 0.1008392 | 0.7538746 | 1.2926800 | 1 | 1 | 1.347789 | 1.347789 | 1.347789 | 0.1627907 | 1.08 | 0.0797924 | 1.580825 | 1.06 | 0.98 | 1.580825 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | -0.7167079 | 1 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | 0.9723766 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.1506344 | -323.5672 | 0.5289575 |
6 | 6 | 0 | -0.0761166 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0432628 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 1.751575 | 1.381854 | 2.69 | 1.71 | -0.25 | -0.0846154 | 0.3488372 | 134 | -0.25 | -0.0846154 | 0.4352941 | 0.9806218 | 1 | 5 | 5 | 0.3488372 | 0.0500806 | 0.0502154 | 0.0627968 | 0.0620877 | -0.25 | 0.0286244 | 0.0001000 | 0.5188805 | NA | NA | 1 | 0.2189481 | 3.145763 | 141 | 1.447883 | 80 | 2.077936 | 84 | 2.068914 | 0.0137733 | -0.0846154 | 0.0172321 | 0.4345976 | 1.0881798 | 1 | 1 | 1.751575 | 1.751575 | 1.751575 | 0.3488372 | 2.61 | 0.1479673 | 1.381854 | 2.63 | 1.81 | 1.381854 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.0898648 | 1 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.9806218 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0259414 | -262.3484 | 0.5250965 |
## [1] 12000 109
The dimensions of the data as still 12,000 with 109 features (created from the tsfeatures package). That is we have 6,000 synthetic and 6,000 real financial time series (12,000 * ~260 = 3,120,000 but we applied tsfeatures to collapse the ~260 down to 1 single observation for each asset)
I collapsed this problem down from a time series prediction problem to a pure classification problem. I split the data between training and validation set next… I also split the data into X_train
, Y_train
… etc.
I split the df/Stats
data set into a train set of 75% of the observations and an in-sample test data set of 25% of the observations.
######################################################################
################# Train and XGBoost model on the TS Features #########
#Stats <- Stats %>%
# select_if(~sum(!is.na(.)) > 0)
# Split the training set up between train and a small validation set
smp_size <- floor(0.75 * nrow(Stats))
#set.seed(123)
train_ind <- sample(seq_len(nrow(Stats)), size = smp_size)
train <- Stats[train_ind, ]
val <- Stats[-train_ind, ]
# We have 106 time series features for the model to learn from.
x_train <- train %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()
x_val <- val %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()
y_train <- train %>%
ungroup() %>%
pull(class)
y_val <- val %>%
ungroup() %>%
pull(class)
How the training X (input variables) data looks:
ac_9_ac_9 | acf_features_x_acf1 | acf_features_x_acf10 | acf_features_diff1_acf1 | acf_features_diff1_acf10 | acf_features_diff2_acf1 | acf_features_diff2_acf10 | ARCH.LM | autocorr_features_embed2_incircle_1 | autocorr_features_embed2_incircle_2 | autocorr_features_ac_9 | autocorr_features_firstmin_ac | autocorr_features_trev_num | autocorr_features_motiftwo_entro3 | autocorr_features_walker_propcross | binarize_mean_binarize_mean | binarize_mean_NA | compengine_embed2_incircle_1 | compengine_embed2_incircle_2 | compengine_ac_9 | compengine_firstmin_ac | compengine_trev_num | compengine_motiftwo_entro3 | compengine_walker_propcross | compengine_localsimple_mean1 | compengine_localsimple_lfitac | compengine_sampen_first | compengine_std1st_der | compengine_spreadrandomlocal_meantaul_50 | compengine_spreadrandomlocal_meantaul_ac2 | compengine_histogram_mode_10 | compengine_outlierinclude_mdrmd | compengine_fluctanal_prop_r1 | crossing_points | dist_features_histogram_mode_10 | dist_features_outlierinclude_mdrmd | embed2_incircle | entropy | firstmin_ac | firstzero_ac | flat_spots | fluctanal_prop_r1_fluctanal_prop_r1 | arch_acf | garch_acf | arch_r2 | garch_r2 | histogram_mode | alpha | beta | hurst | hw_parameters_hw_parameters | hw_parameters_NA | localsimple_taures | lumpiness | max_kl_shift | time_kl_shift | max_level_shift | time_level_shift | max_var_shift | time_var_shift | motiftwo_entro3 | nonlinearity | outlierinclude_mdrmd | x_pacf5 | diff1x_pacf5 | diff2x_pacf5 | pred_features_localsimple_mean1 | pred_features_localsimple_lfitac | pred_features_sampen_first | sampen_first_sampen_first | sampenc | scal_features_fluctanal_prop_r1 | spreadrandomlocal_meantaul | stability | station_features_std1st_der | station_features_spreadrandomlocal_meantaul_50 | station_features_spreadrandomlocal_meantaul_ac2 | std1st_der_std1st_der | nperiods | seasonal_period | trend | spike | linearity | curvature | e_acf1 | e_acf10 | trev_num | tsfeatures_frequency | tsfeatures_nperiods | tsfeatures_seasonal_period | tsfeatures_trend | tsfeatures_spike | tsfeatures_linearity | tsfeatures_curvature | tsfeatures_e_acf1 | tsfeatures_e_acf10 | tsfeatures_entropy | tsfeatures_x_acf1 | tsfeatures_x_acf10 | tsfeatures_diff1_acf1 | tsfeatures_diff1_acf10 | tsfeatures_diff2_acf1 | tsfeatures_diff2_acf10 | unitroot_kpss | unitroot_pp | walker_propcross | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6801 | 0.0498492 | -0.0642025 | 0.0542648 | -0.4423482 | 0.2575236 | -0.5981303 | 0.4149592 | 0.0271444 | 0.4710425 | 0.7181467 | 0.0498492 | 2 | 0.8754566 | 2.057333 | 0.5598456 | 0 | 1 | 0.4710425 | 0.7181467 | 0.0498492 | 2 | 0.8754566 | 2.057333 | 0.5598456 | 1 | 1 | 1.704503 | 1.460466 | 1.33 | 1.00 | -0.50 | 0.1115385 | 0.8604651 | 139 | -0.50 | 0.1115385 | 0.4710425 | 0.9888208 | 2 | 1 | 3 | 0.8604651 | 0.0332257 | 0.0244434 | 0.0370423 | 0.0287773 | -0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.7769640 | 3.827223 | 209 | 1.027671 | 131 | 3.254518 | 195 | 2.057333 | 0.0695918 | 0.1115385 | 0.0474059 | 0.5669070 | 1.0663179 | 1 | 1 | 1.704503 | 1.704503 | 1.704503 | 0.8604651 | 1.41 | 0.0639649 | 1.460466 | 1.42 | 1.00 | 1.460466 | 0 | 1 | 0.0069481 | 0.0000643 | -0.8628963 | 0.2636951 | -0.0719026 | 0.0587799 | 0.8754566 | 1 | 0 | 1 | 0.0069481 | 0.0000643 | -0.8628963 | 0.2636951 | -0.0719026 | 0.0587799 | 0.9888208 | -0.0642025 | 0.0542648 | -0.4423482 | 0.2575236 | -0.5981303 | 0.4149592 | 0.1777957 | -246.9618 | 0.5598456 |
4209 | -0.0037257 | -0.0166400 | 0.0302609 | -0.5444182 | 0.3391695 | -0.7025401 | 0.5898760 | 0.0369855 | 0.3976834 | 0.6409266 | -0.0037257 | 1 | 0.0772589 | 2.065480 | 0.5598456 | 1 | 1 | 0.3976834 | 0.6409266 | -0.0037257 | 1 | 0.0772589 | 2.065480 | 0.5598456 | 1 | 1 | 1.752028 | 1.427591 | 1.39 | 1.00 | -0.25 | -0.1000000 | 0.4651163 | 137 | -0.25 | -0.1000000 | 0.3976834 | 0.9866480 | 1 | 1 | 4 | 0.4651163 | 0.0328564 | 0.0286941 | 0.0369855 | 0.0347972 | -0.25 | 0.0008843 | 0.0008843 | 0.5000458 | NA | NA | 1 | 0.2267605 | 3.549229 | 215 | 1.390319 | 3 | 2.017745 | 143 | 2.065480 | 0.0236440 | -0.1000000 | 0.0060988 | 0.4859730 | 1.0685267 | 1 | 1 | 1.752028 | 1.752028 | 1.752028 | 0.4651163 | 1.49 | 0.0831999 | 1.427591 | 1.53 | 1.00 | 1.427591 | 0 | 1 | 0.0431696 | 0.0000288 | -0.6356332 | 1.0362897 | -0.0608160 | 0.0358936 | 0.0772589 | 1 | 0 | 1 | 0.0431696 | 0.0000288 | -0.6356332 | 1.0362897 | -0.0608160 | 0.0358936 | 0.9866480 | -0.0166400 | 0.0302609 | -0.5444182 | 0.3391695 | -0.7025401 | 0.5898760 | 0.0372919 | -268.4757 | 0.5598456 |
11168 | 0.0236704 | -0.0269749 | 0.0299079 | -0.4943006 | 0.2640054 | -0.6626027 | 0.4906038 | 0.1265569 | 0.4401544 | 0.6640927 | 0.0236704 | 2 | -0.4569401 | 2.075666 | 0.4633205 | 1 | 1 | 0.4401544 | 0.6640927 | 0.0236704 | 2 | -0.4569401 | 2.075666 | 0.4633205 | 1 | 1 | 1.709466 | 1.431144 | 1.52 | 1.00 | 0.25 | -0.0961538 | 0.1627907 | 122 | 0.25 | -0.0961538 | 0.4401544 | 0.9882937 | 2 | 1 | 4 | 0.1627907 | 0.1453674 | 0.1490540 | 0.1265569 | 0.1247021 | 0.25 | 0.0411075 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.3863291 | 2.834691 | 227 | 1.096209 | 123 | 2.760158 | 197 | 2.075666 | 0.1218026 | -0.0961538 | 0.0088598 | 0.4643608 | 1.0505751 | 1 | 1 | 1.709466 | 1.709466 | 1.709466 | 0.1627907 | 1.61 | 0.0691848 | 1.431144 | 1.50 | 1.00 | 1.431144 | 0 | 1 | 0.0134781 | 0.0000342 | -0.6468298 | -1.1770328 | -0.0419291 | 0.0376999 | -0.4569401 | 1 | 0 | 1 | 0.0134781 | 0.0000342 | -0.6468298 | -1.1770328 | -0.0419291 | 0.0376999 | 0.9882937 | -0.0269749 | 0.0299079 | -0.4943006 | 0.2640054 | -0.6626027 | 0.4906038 | 0.1743418 | -260.0758 | 0.4633205 |
5794 | -0.0007087 | 0.1194830 | 0.0616705 | -0.4062897 | 0.2206195 | -0.6016700 | 0.4137913 | 0.1556551 | 0.4806202 | 0.6782946 | -0.0007087 | 2 | -0.5797405 | 2.066637 | 0.4787645 | 1 | 0 | 0.4806202 | 0.6782946 | -0.0007087 | 2 | -0.5797405 | 2.066637 | 0.4787645 | 1 | 1 | 1.558307 | 1.328565 | 2.03 | 1.18 | -0.25 | -0.3000000 | 0.2325581 | 120 | -0.25 | -0.3000000 | 0.4806202 | 0.9815963 | 2 | 2 | 5 | 0.2325581 | 0.2198692 | 0.0941053 | 0.1406280 | 0.0756639 | -0.25 | 0.0125856 | 0.0001000 | 0.5477543 | NA | NA | 1 | 0.7772726 | 8.411092 | 48 | 1.573682 | 146 | 3.802986 | 149 | 2.066637 | 0.1381103 | -0.3000000 | 0.0193037 | 0.3959500 | 0.9255264 | 1 | 1 | 1.558307 | 1.558307 | 1.558307 | 0.2325581 | 1.98 | 0.1331827 | 1.328565 | 2.01 | 1.27 | 1.328565 | 0 | 1 | 0.0139233 | 0.0000358 | -0.8988748 | 0.9389128 | 0.1079346 | 0.0661260 | -0.5797405 | 1 | 0 | 1 | 0.0139233 | 0.0000358 | -0.8988748 | 0.9389128 | 0.1079346 | 0.0661260 | 0.9815963 | 0.1194830 | 0.0616705 | -0.4062897 | 0.2206195 | -0.6016700 | 0.4137913 | 0.1182423 | -224.0670 | 0.4787645 |
8693 | -0.0814496 | -0.0984498 | 0.1142883 | -0.4688008 | 0.3181153 | -0.6166136 | 0.4555893 | 0.1508792 | 0.4054054 | 0.6602317 | -0.0814496 | 2 | 0.3988370 | 2.060571 | 0.5250965 | 0 | 1 | 0.4054054 | 0.6602317 | -0.0814496 | 2 | 0.3988370 | 2.060571 | 0.5250965 | 1 | 1 | 1.651243 | 1.484233 | 1.19 | 1.00 | -0.50 | -0.0576923 | 0.3488372 | 136 | -0.50 | -0.0576923 | 0.4054054 | 0.9745764 | 2 | 1 | 6 | 0.3488372 | 0.0946062 | 0.0937635 | 0.1057152 | 0.1052409 | -0.50 | 0.0269522 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.5495742 | 7.853783 | 195 | 1.039641 | 191 | 4.458772 | 187 | 2.060571 | 0.1164590 | -0.0576923 | 0.0467339 | 0.5896074 | 1.1095330 | 1 | 1 | 1.651243 | 1.651243 | 1.651243 | 0.3488372 | 1.24 | 0.0998210 | 1.484233 | 1.35 | 1.00 | 1.484233 | 0 | 1 | 0.0033231 | 0.0000574 | 0.1887497 | 0.4564879 | -0.1022983 | 0.1171558 | 0.3988370 | 1 | 0 | 1 | 0.0033231 | 0.0000574 | 0.1887497 | 0.4564879 | -0.1022983 | 0.1171558 | 0.9745764 | -0.0984498 | 0.1142883 | -0.4688008 | 0.3181153 | -0.6166136 | 0.4555893 | 0.0391658 | -262.9010 | 0.5250965 |
1073 | -0.1253873 | 0.1511912 | 0.0608605 | -0.3832523 | 0.2048003 | -0.5832067 | 0.3861283 | 0.0876692 | 0.4031008 | 0.6356589 | -0.1253873 | 2 | 0.2463431 | 2.061698 | 0.4594595 | 1 | 1 | 0.4031008 | 0.6356589 | -0.1253873 | 2 | 0.2463431 | 2.061698 | 0.4594595 | 1 | 1 | 1.763381 | 1.304792 | 2.44 | 1.13 | -0.25 | 0.1230769 | 0.1395349 | 121 | -0.25 | 0.1230769 | 0.4031008 | 0.9867903 | 2 | 2 | 4 | 0.1395349 | 0.0779468 | 0.0618625 | 0.0695878 | 0.0601294 | -0.25 | 0.0778294 | 0.0001000 | 0.5663347 | NA | NA | 1 | 0.3151884 | 7.528904 | 185 | 2.069230 | 177 | 2.340804 | 169 | 2.061698 | 0.0279574 | 0.1230769 | 0.0310540 | 0.3527793 | 0.8978003 | 1 | 1 | 1.763381 | 1.763381 | 1.763381 | 0.1395349 | 2.45 | 0.0816322 | 1.304792 | 2.35 | 1.23 | 1.304792 | 0 | 1 | 0.0213244 | 0.0000306 | -0.5577693 | 0.6111726 | 0.1329904 | 0.0758345 | 0.2463431 | 1 | 0 | 1 | 0.0213244 | 0.0000306 | -0.5577693 | 0.6111726 | 0.1329904 | 0.0758345 | 0.9867903 | 0.1511912 | 0.0608605 | -0.3832523 | 0.2048003 | -0.5832067 | 0.3861283 | 0.0849681 | -208.4546 | 0.4594595 |
How the training Y (predictor variable) data looks:
. |
---|
1 |
0 |
1 |
0 |
0 |
1 |
I set the data up for an XGBoost model:
I create a grid search in order search over a parameter space to locate the optimal parameters for the data set. It needs a little more work but it’s a pretty good starting point. I can just add code to the expand.grid
function. That is, say I want to increase the depth of the tree I can add to max_depth = c(5, 8, 14)
more parameters such as max_depth = c(5, 8, 14, 1, 2, 3, 4, 6, 7)
. Note Adding parameters to the grid search increases computational time exponentially. Every parameter you add a value to, the model has to search all possible combinations associated with that parameter. That is, adding an eta = c(0.1)
and max_depth = c(5)
would give me the optimal parameter for one iteration/loop through the training model, i.e. an eta = c(0.1)
mapped onto a max_depth = c(5)
. Adding an additional value to the eta = c(0.1, 0.3)
and max_depth = c(5)
would map eta = 0.1
onto max_depth = 5
and eta = 0.3
on to max_depth = 5
. If I add another value such that eta = c(0.1, 0.3, 0.4)
then all 3 of these values will be mapped to max_depth = c(5)
. Adding values to the max_depth = c(5)
parameter would add an extra layer of complexity to the grid search. This added into the fact that there are many parameters to optimize in an XGBoost model can drastically increase computational complexity. Thus, understanding the statistics behind the models in Machine Learning is important when trying to avoid getting stuck in a local minimum (which any greedy algorithm using gradient descent optimisation can do: greedy algorithm).
######################################################################
################# XGBoost Grid Search to locate Optimal Parameters ###
##############################################################################################################################
# NOTE: This section was taken from the first chapter of my PhD where I needed to search over a parameter space to locate the
# most optimal parameters - I have just adapted it for this problem of Time Series Classification.
# Its simple enough to add parameters and different values - I just optimise a few important parameters from domain knowledge
# of the XGBoost model for this task, i.e depth and eta are quite important in gradient boosting.
# 1) I create a "grid" with different parameter values or combinations of parameter values
# 2) I apply cross validation over the parameter space to fine the most optimal values for the XGBoost model.
# 3) I print the model parameters which give the best train / (in-sample test) results in a data table.
##############################################################################################################################
# Grid Search Parameters:
# 1)
searchGridSubCol <- expand.grid(subsample = c(1), #Range (0,1], default = 1, set to 0.5 will prevent overfitting
colsample_bytree = c(1), #Range (0,1], default = 1
max_depth = c(5, 8, 14), #Range (0, inf], default = 6
min_child = c(1), #Range (0, inf], default = 1
eta = c(0.1, 0.05, 0.3), #Range (0,1], default = 0.3
gamma = c(0), #Range (0, inf], default = 0
lambda = c(1), #Default = 1, L2 regularisation on weights, higher the more conservative the model
alpha = c(0), #Default = 0, L1 regularisation on weights, higher the more conservative the model
max_delta_step = c(0), #Range (0, inf], default = 0 (Helpful for logisitc regression when class is extremely imbalanced, set to value 1-10 may help control the update)
colsample_bylevel = c(1) #Range (0,1], default = 1
)
ntrees = 200
nfold <- 10 # I use nfold = 10 which is probably too many folds, 5 should be sufficient.
watchlist <- list(train = dtrain, test = dval)
# 2)
system.time(
AUCHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["sub_sample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentDepth <- parameterList[["max_depth"]]
currentEta <- parameterList[["eta"]]
currentMinChild <- parameterList[["min_child"]]
gamma <- parameterList[["gamma"]]
lambda <- parameterList[["lambda"]]
alpha <- parameterList[["alpha"]]
max_delta_step <- parameterList[["max_delta_step"]]
colsample_bylevel <- parameterList[["colsample_bylevel"]]
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = ntrees,
nfold = nfold,
showsd = TRUE,
metrics = c("auc", "logloss", "error"),
verbose = TRUE,
"eval_metric" = c("auc", "logloss", "error"),
"objective" = "binary:logistic", #Outputs a probability "binary:logitraw" - outputs score before logistic transformation
"max.depth" = currentDepth,
"eta" = currentEta,
"gamma" = gamma,
"lambda" = lambda,
"alpha" = alpha,
"subsample" = currentSubsampleRate,
"colsample_bytree" = currentColsampleRate,
print_every_n = 50, # print ever 50 trees to reduce the outputs printed.
"min_child_weight" = currentMinChild,
booster = "gbtree", #booster = "dart" #using dart can help improve accuracy.
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
xvalidationScores <<- as.data.frame(xgboostModelCV$evaluation_log)
train_auc_mean <- tail(xvalidationScores$train_auc_mean, 1)
test_auc_mean <- tail(xvalidationScores$test_auc_mean, 1)
train_logloss_mean <- tail(xvalidationScores$train_logloss_mean, 1)
test_logloss_mean <- tail(xvalidationScores$test_logloss_mean, 1)
train_error_mean <- tail(xvalidationScores$train_error_mean, 1)
test_error_mean <- tail(xvalidationScores$test_error_mean, 1)
output <- return(c(train_auc_mean, test_auc_mean, train_logloss_mean, test_logloss_mean, train_error_mean, test_error_mean, xvalidationScores, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, gamma, lambda, alpha, max_delta_step, colsample_bylevel, currentMinChild))
hypemeans <- which.max(AUCHyperparameters[[1]]$test_auc_mean)
output2 <- return(hypemeans)
}))
The output of the grid search can be set into a nice data frame using the following code. However I did not save this output to file and therefore cannot read it in. You can view the output on the original Jupyter Notebook In [49]
here
# 3)
output <- as.data.frame(t(sapply(AUCHyperparameters, '[', c(1:6, 20:29))))
varnames <- c("TrainAUC", "TestAUC", "TrainLogloss", "TestLogloss", "TrainError", "TestError", "SubSampRate", "ColSampRate", "Depth", "eta", "gamma", "lambda", "alpha", "max_delta_step", "col_sample_bylevel", "currentMinChild")
colnames(output) <- varnames
data.table(output)
According to the results at the time the optimal parameters were:
- ntrees = 95,
- eta = 0.1,
- max_depth = 5,
With the other parameters left to default settings for simplicity.
Plug the optimal parameters into the model.
#################################################################################
################# XGBoost Optimal Parameters from Cross Validation ##############
# This is the final training model where I use the most optimal parameters found over the grid space and plug them in here.
watchlist <- list("train" = dtrain)
params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
"objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
"colsample_bylevel" = 1, "eval_metric"= "auc",
"set.seed" = 176)
nround <- 95
Now that I have the optimal parameters from the cross validation grid search I can train the final XGBoost model on the whole train_val.csv
data set. (Whereas before the optimal parameters were obtained from different folds in the model. More info on k-fold cross validation here)
# Train the XGBoost model
xgb.model <- xgb.train(params, dtrain, nround, watchlist)
## [1] train-auc:0.700790
## [2] train-auc:0.720114
## [3] train-auc:0.735281
## [4] train-auc:0.741159
## [5] train-auc:0.748016
## [6] train-auc:0.752070
## [7] train-auc:0.754637
## [8] train-auc:0.759151
## [9] train-auc:0.762538
## [10] train-auc:0.769652
## [11] train-auc:0.776582
## [12] train-auc:0.780015
## [13] train-auc:0.782065
## [14] train-auc:0.782815
## [15] train-auc:0.788966
## [16] train-auc:0.791026
## [17] train-auc:0.793545
## [18] train-auc:0.797363
## [19] train-auc:0.799069
## [20] train-auc:0.802015
## [21] train-auc:0.802583
## [22] train-auc:0.806938
## [23] train-auc:0.808239
## [24] train-auc:0.811255
## [25] train-auc:0.813142
## [26] train-auc:0.816767
## [27] train-auc:0.817697
## [28] train-auc:0.820239
## [29] train-auc:0.821589
## [30] train-auc:0.823343
## [31] train-auc:0.823939
## [32] train-auc:0.825701
## [33] train-auc:0.827316
## [34] train-auc:0.829365
## [35] train-auc:0.832646
## [36] train-auc:0.833297
## [37] train-auc:0.837006
## [38] train-auc:0.838857
## [39] train-auc:0.839923
## [40] train-auc:0.842968
## [41] train-auc:0.844877
## [42] train-auc:0.845940
## [43] train-auc:0.846583
## [44] train-auc:0.847330
## [45] train-auc:0.848292
## [46] train-auc:0.850215
## [47] train-auc:0.851641
## [48] train-auc:0.852670
## [49] train-auc:0.854706
## [50] train-auc:0.855752
## [51] train-auc:0.856772
## [52] train-auc:0.857806
## [53] train-auc:0.860245
## [54] train-auc:0.861337
## [55] train-auc:0.864178
## [56] train-auc:0.865290
## [57] train-auc:0.865808
## [58] train-auc:0.866386
## [59] train-auc:0.867751
## [60] train-auc:0.870032
## [61] train-auc:0.870500
## [62] train-auc:0.872442
## [63] train-auc:0.873391
## [64] train-auc:0.875188
## [65] train-auc:0.877767
## [66] train-auc:0.879196
## [67] train-auc:0.880079
## [68] train-auc:0.879969
## [69] train-auc:0.880638
## [70] train-auc:0.881389
## [71] train-auc:0.882066
## [72] train-auc:0.882515
## [73] train-auc:0.883854
## [74] train-auc:0.884654
## [75] train-auc:0.885104
## [76] train-auc:0.885922
## [77] train-auc:0.887100
## [78] train-auc:0.888646
## [79] train-auc:0.889833
## [80] train-auc:0.890387
## [81] train-auc:0.891815
## [82] train-auc:0.892281
## [83] train-auc:0.894417
## [84] train-auc:0.895006
## [85] train-auc:0.897079
## [86] train-auc:0.899254
## [87] train-auc:0.901114
## [88] train-auc:0.902460
## [89] train-auc:0.902939
## [90] train-auc:0.903763
## [91] train-auc:0.903792
## [92] train-auc:0.904433
## [93] train-auc:0.904986
## [94] train-auc:0.907339
## [95] train-auc:0.907761
# Note: Plot AUC on for the in-sample train / validation scores - this was a note for me at the time of writing this R file - I never did get around to plotting the AUC for the in-sample train / validation scores...
What is nice about tree based models is that we can obtain importance scores from the model and find which variables contributed most to the gain in the model. The original paper explains more about the gain in Algorithm 1 and Algorithm 3 here.
# We can obtain "feature" importance results from the model.
xgb.imp <- xgb.importance(model = xgb.model)
xgb.plot.importance(xgb.imp, top_n = 10)
That is, the XGBoost model found that the spike
was the most important variable. The spike
comes from the stl_features
function of the tsfeatures
package in R
. It computes various measures of trend and seasonality based on Seasonal and Trend Decomposition (STL) and measures the spikiness
of a time series based on the variance of the leave-one-out variances of component e_t
.
The second variable is interesting also and comes from the compengine feature set
from the CompEngine database. It groups variables as autocorrelation, prediction, stationarity, distribution and scaling.
The ARCH.LM
comes from the arch_stat
function of the tsfeatures
package and is based on the Lagrange Multiplier for Autoregressive Conditional Heteroscedasticity (ARCH) Engle 1982.
These are just a few of the variables the XGBoost model found to be the most important. A full overview and more information of the variables used in the model can be found here.
Predictions using the in-sample test set
Now that I have trained the model using the optimal parameters I want to see if it scores the same or better based on the cross validation phase using the validation data. I use the dval
which is the validation data set from the training split to test the model.
# I next make the predictions on the 'in-sample' held out test set, that is, originally I took the 12,000 training samples
# and split them between 75% training and 25% 'in-sample' testing (9000 training vs 3000 in-sample testing)
# I plot the probabilities from the model - the "dashed" line is the average predicted probability.
xgb.pred <- predict(xgb.model, dval, type = 'prob')
results <- cbind(y_val, xgb.pred)
results %>%
as.tibble() %>%
ggplot(aes(x = xgb.pred)) +
geom_density(color = "darkblue", fill = "lightblue") +
geom_vline(aes(xintercept = mean(xgb.pred)),
color = "blue", linetype = "dashed", size = 1) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") +
ggtitle("Predicted probability density plot") +
theme_tq()
# The average predicted probability sits around 0.48 / 0.49, for simplicity I will just select 0.50 as the cut off threshold.
# That is, all observations <= 0.50 are assigned a "0" class or "synthetic" data and all observations >= are assigned a "1" or
# "real" data.
# Finally I output the confusion matrix on the 'in-sample' testing data.
results <- results %>%
as_tibble() %>%
mutate(pred = case_when(
xgb.pred > 0.5 ~ 1,
xgb.pred <= 0.5 ~ 0
))
confusionMatrix(factor(results$pred), factor(results$y_val))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1041 537
## 1 465 957
##
## Accuracy : 0.666
## 95% CI : (0.6488, 0.6829)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.3319
##
## Mcnemar's Test P-Value : 0.0249
##
## Sensitivity : 0.6912
## Specificity : 0.6406
## Pos Pred Value : 0.6597
## Neg Pred Value : 0.6730
## Prevalence : 0.5020
## Detection Rate : 0.3470
## Detection Prevalence : 0.5260
## Balanced Accuracy : 0.6659
##
## 'Positive' Class : 0
##
A balanced accuracy score of 67% isn’t so bad considering I threw the kitchen sink at the classification problem and that this is a time series (stock market) classification problem. By kitchen sink I refer to all the time series functions found in the tsfeatures
package.
From here I end the training and validation model. I have obtained the optimal values based on the training and validation data sets and now I want to test it on the unknown data the test.csv
data.
I read in the test data and compute the time series features from the tsfeatures
package just as I did with the training data.
test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/test.csv") %>%
mutate(row_id = row_number()) %>%
melt(., measure.vars = 1:260) %>%
arrange(row_id)
How the test features look - (they look similar to the train data set):
row_id | variable | value |
---|---|---|
1 | feature1 | 0.0331039 |
1 | feature2 | 0.0086225 |
1 | feature3 | 0.0040622 |
1 | feature4 | 0.0082554 |
1 | feature5 | 0.0558741 |
1 | feature6 | -0.0061266 |
I call this test_final
and not test
for no reason what so ever - its the same test.csv
from the beginning.
Next I create the same time series features on the test data set as I do on the training data set. I save this as TSfeatures_test.csv
.
functions <- sample(functions, 20)
test_final <- test_final %>%
group_by(row_id) %>%
# nest() %>%
# sample_n(5) %>%
# ungroup() %>%
# unnest() %>%
nest(-row_id) %>%
group_by(row_id) %T>%
{options(warn = -1)} %>%
summarise(Statistics = map(data, ~ data.frame(
bind_cols(
tsfeatures(.x$value, functions))))) %>%
unnest(Statistics)
#print("Generated 106 Time Series features")
#write.csv(test_final, "TSfeatures_test.csv")
I have computed all the tsfeatures
for the train
data set and also for the test
data set. I saved these two as TSfeatures_train_val.csv
and TSfeatures_test.csv
.
Load in the train and test features data sets
I uploaded these files here
# I have already created the features for the training dataset so I can just load them right back in as
train_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")
test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_test.csv")
The final data for the training and test looks like:
train_final %>%
head() %>%
kable(caption = "Final training data set") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
X | row_id | class | ac_9_ac_9 | acf_features_x_acf1 | acf_features_x_acf10 | acf_features_diff1_acf1 | acf_features_diff1_acf10 | acf_features_diff2_acf1 | acf_features_diff2_acf10 | ARCH.LM | autocorr_features_embed2_incircle_1 | autocorr_features_embed2_incircle_2 | autocorr_features_ac_9 | autocorr_features_firstmin_ac | autocorr_features_trev_num | autocorr_features_motiftwo_entro3 | autocorr_features_walker_propcross | binarize_mean_binarize_mean | binarize_mean_NA | compengine_embed2_incircle_1 | compengine_embed2_incircle_2 | compengine_ac_9 | compengine_firstmin_ac | compengine_trev_num | compengine_motiftwo_entro3 | compengine_walker_propcross | compengine_localsimple_mean1 | compengine_localsimple_lfitac | compengine_sampen_first | compengine_std1st_der | compengine_spreadrandomlocal_meantaul_50 | compengine_spreadrandomlocal_meantaul_ac2 | compengine_histogram_mode_10 | compengine_outlierinclude_mdrmd | compengine_fluctanal_prop_r1 | crossing_points | dist_features_histogram_mode_10 | dist_features_outlierinclude_mdrmd | embed2_incircle | entropy | firstmin_ac | firstzero_ac | flat_spots | fluctanal_prop_r1_fluctanal_prop_r1 | arch_acf | garch_acf | arch_r2 | garch_r2 | histogram_mode | alpha | beta | hurst | hw_parameters_hw_parameters | hw_parameters_NA | localsimple_taures | lumpiness | max_kl_shift | time_kl_shift | max_level_shift | time_level_shift | max_var_shift | time_var_shift | motiftwo_entro3 | nonlinearity | outlierinclude_mdrmd | x_pacf5 | diff1x_pacf5 | diff2x_pacf5 | pred_features_localsimple_mean1 | pred_features_localsimple_lfitac | pred_features_sampen_first | sampen_first_sampen_first | sampenc | scal_features_fluctanal_prop_r1 | spreadrandomlocal_meantaul | stability | station_features_std1st_der | station_features_spreadrandomlocal_meantaul_50 | station_features_spreadrandomlocal_meantaul_ac2 | std1st_der_std1st_der | nperiods | seasonal_period | trend | spike | linearity | curvature | e_acf1 | e_acf10 | trev_num | tsfeatures_frequency | tsfeatures_nperiods | tsfeatures_seasonal_period | tsfeatures_trend | tsfeatures_spike | tsfeatures_linearity | tsfeatures_curvature | tsfeatures_e_acf1 | tsfeatures_e_acf10 | tsfeatures_entropy | tsfeatures_x_acf1 | tsfeatures_x_acf10 | tsfeatures_diff1_acf1 | tsfeatures_diff1_acf10 | tsfeatures_diff2_acf1 | tsfeatures_diff2_acf10 | unitroot_kpss | unitroot_pp | walker_propcross |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | -0.0675275 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0627825 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 0.3929961 | 0.6147860 | -0.0675275 | 1 | 0.1208750 | 2.071663 | 0.5405405 | 1 | 1 | 1.788841 | 1.408737 | 1.68 | 1.43 | -0.25 | -0.2865385 | 0.1627907 | 132 | -0.25 | -0.2865385 | 0.3929961 | 0.9840151 | 1 | 3 | 4 | 0.1627907 | 0.0652585 | 0.0154406 | 0.0627825 | 0.0253367 | -0.25 | 0.0013330 | 0.0013330 | 0.5000458 | NA | NA | 1 | 0.3556536 | 1.783636 | 103 | 1.297736 | 97 | 2.819828 | 46 | 2.071663 | 0.0752319 | -0.2865385 | 0.0108653 | 0.4457792 | 1.0525222 | 1 | 1 | 1.788841 | 1.788841 | 1.788841 | 0.1627907 | 1.76 | 0.0562693 | 1.408737 | 1.74 | 1.36 | 1.408737 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.1208750 | 1 | 0 | 1 | 0.0043052 | 0.0000261 | 0.8421403 | -0.7069160 | 0.0052389 | 0.0588324 | 0.9840151 | 0.0097094 | 0.0526897 | -0.5005299 | 0.3297018 | -0.6772403 | 0.6124739 | 0.0993829 | -249.7732 | 0.5405405 |
2 | 2 | 0 | -0.0421577 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0558032 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 0.4285714 | 0.6563707 | -0.0421577 | 1 | -0.4765229 | 2.077581 | 0.5019305 | 1 | 1 | 1.780390 | 1.419266 | 1.95 | 1.00 | 0.50 | 0.2615385 | 0.1627907 | 123 | 0.50 | 0.2615385 | 0.4285714 | 0.9864332 | 1 | 1 | 4 | 0.1627907 | 0.0664358 | 0.0657859 | 0.0558032 | 0.0554355 | 0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.4636768 | 1.733008 | 247 | 1.311861 | 141 | 2.625772 | 221 | 2.077581 | 0.0273335 | 0.2615385 | 0.0256032 | 0.4606850 | 1.0171377 | 1 | 1 | 1.780390 | 1.780390 | 1.780390 | 0.1627907 | 2.05 | 0.0892206 | 1.419266 | 2.12 | 1.00 | 1.419266 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | -0.4765229 | 1 | 0 | 1 | 0.0177460 | 0.0000399 | 0.9249561 | 0.7665407 | -0.0218053 | 0.0411861 | 0.9864332 | -0.0075902 | 0.0387481 | -0.5171529 | 0.3129147 | -0.6727897 | 0.5379301 | 0.0414599 | -256.0485 | 0.5019305 |
3 | 3 | 1 | 0.0099598 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0325470 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 0 | 0.4671815 | 0.7065637 | 0.0099598 | 1 | -0.8755173 | 2.069233 | 0.5328185 | 1 | 1 | 1.706841 | 1.443315 | 1.38 | 1.00 | -0.50 | -0.2538462 | 0.1395349 | 132 | -0.50 | -0.2538462 | 0.4671815 | 0.9868568 | 1 | 1 | 6 | 0.1395349 | 0.0388513 | 0.0039162 | 0.0325470 | 0.0041902 | -0.50 | 0.0014557 | 0.0014557 | 0.5000458 | NA | NA | 1 | 1.2670493 | 7.746711 | 95 | 1.403784 | 87 | 5.235499 | 84 | 2.069233 | 0.2436499 | -0.2538462 | 0.0223069 | 0.5356408 | 0.9954919 | 1 | 1 | 1.706841 | 1.706841 | 1.706841 | 0.1395349 | 1.42 | 0.0716499 | 1.443315 | 1.42 | 1.00 | 1.443315 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | -0.8755173 | 1 | 0 | 1 | 0.0141368 | 0.0000929 | 0.8414359 | -0.0259311 | -0.0547484 | 0.0492987 | 0.9868568 | -0.0405929 | 0.0449036 | -0.5026683 | 0.3471209 | -0.6718885 | 0.6109006 | 0.0775698 | -258.1295 | 0.5328185 |
4 | 4 | 0 | -0.0428748 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.1275576 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 0 | 0 | 0.4555985 | 0.7027027 | -0.0428748 | 2 | -0.9943808 | 2.068744 | 0.4903475 | 1 | 1 | 1.660825 | 1.445807 | 1.24 | 1.00 | 0.25 | 0.0153846 | 0.1395349 | 127 | 0.25 | 0.0153846 | 0.4555985 | 0.9790521 | 2 | 1 | 7 | 0.1395349 | 0.0694296 | 0.0112709 | 0.0579144 | 0.0123884 | 0.25 | 0.0480021 | 0.0001000 | 0.5000458 | NA | NA | 1 | 1.0068624 | 4.994753 | 132 | 1.258758 | 173 | 5.886911 | 156 | 2.068744 | 0.3840091 | 0.0153846 | 0.0503205 | 0.5402603 | 1.1070217 | 1 | 1 | 1.660825 | 1.660825 | 1.660825 | 0.1395349 | 1.10 | 0.1065111 | 1.445807 | 1.14 | 1.00 | 1.445807 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | -0.9943808 | 1 | 0 | 1 | 0.0283540 | 0.0000482 | -1.2297854 | 0.2921899 | -0.0728152 | 0.0752389 | 0.9790521 | -0.0443619 | 0.0615867 | -0.4571442 | 0.3184053 | -0.5906478 | 0.4361178 | 0.2129633 | -262.0781 | 0.4903475 |
5 | 5 | 0 | 0.0259312 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.2014861 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 0 | 0.6563707 | 0.7992278 | 0.0259312 | 1 | -0.7167079 | 2.059764 | 0.5289575 | 1 | 1 | 1.347789 | 1.580825 | 1.08 | 0.98 | -0.50 | 0.7961538 | 0.1627907 | 133 | -0.50 | 0.7961538 | 0.6563707 | 0.9723766 | 1 | 1 | 9 | 0.1627907 | 0.2718058 | 0.2229375 | 0.1765130 | 0.1330761 | -0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 2.8846415 | 11.474426 | 80 | 1.772392 | 229 | 8.468236 | 236 | 2.059764 | 0.2143595 | 0.7961538 | 0.1008392 | 0.7538746 | 1.2926800 | 1 | 1 | 1.347789 | 1.347789 | 1.347789 | 0.1627907 | 1.08 | 0.0797924 | 1.580825 | 1.06 | 0.98 | 1.580825 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | -0.7167079 | 1 | 0 | 1 | 0.0121072 | 0.0001568 | -0.5488436 | 0.2255538 | -0.2599764 | 0.1558209 | 0.9723766 | -0.2447835 | 0.1469130 | -0.5810073 | 0.4796508 | -0.6799229 | 0.6232529 | 0.1506344 | -323.5672 | 0.5289575 |
6 | 6 | 0 | -0.0761166 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0432628 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 0.4352941 | 0.6627451 | -0.0761166 | 1 | 0.0898648 | 2.068914 | 0.5250965 | 1 | 1 | 1.751575 | 1.381854 | 2.69 | 1.71 | -0.25 | -0.0846154 | 0.3488372 | 134 | -0.25 | -0.0846154 | 0.4352941 | 0.9806218 | 1 | 5 | 5 | 0.3488372 | 0.0500806 | 0.0502154 | 0.0627968 | 0.0620877 | -0.25 | 0.0286244 | 0.0001000 | 0.5188805 | NA | NA | 1 | 0.2189481 | 3.145763 | 141 | 1.447883 | 80 | 2.077936 | 84 | 2.068914 | 0.0137733 | -0.0846154 | 0.0172321 | 0.4345976 | 1.0881798 | 1 | 1 | 1.751575 | 1.751575 | 1.751575 | 0.3488372 | 2.61 | 0.1479673 | 1.381854 | 2.63 | 1.81 | 1.381854 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.0898648 | 1 | 0 | 1 | 0.0077481 | 0.0000329 | -0.5473782 | 0.4505809 | 0.0410068 | 0.0873468 | 0.9806218 | 0.0468556 | 0.0858348 | -0.5253131 | 0.3438031 | -0.6901570 | 0.6130725 | 0.0259414 | -262.3484 | 0.5250965 |
test_final %>%
head() %>%
kable(caption = "Final testing data set") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
X | row_id | ac_9_ac_9 | acf_features_x_acf1 | acf_features_x_acf10 | acf_features_diff1_acf1 | acf_features_diff1_acf10 | acf_features_diff2_acf1 | acf_features_diff2_acf10 | ARCH.LM | autocorr_features_embed2_incircle_1 | autocorr_features_embed2_incircle_2 | autocorr_features_ac_9 | autocorr_features_firstmin_ac | autocorr_features_trev_num | autocorr_features_motiftwo_entro3 | autocorr_features_walker_propcross | binarize_mean_binarize_mean | binarize_mean_NA | compengine_embed2_incircle_1 | compengine_embed2_incircle_2 | compengine_ac_9 | compengine_firstmin_ac | compengine_trev_num | compengine_motiftwo_entro3 | compengine_walker_propcross | compengine_localsimple_mean1 | compengine_localsimple_lfitac | compengine_sampen_first | compengine_std1st_der | compengine_spreadrandomlocal_meantaul_50 | compengine_spreadrandomlocal_meantaul_ac2 | compengine_histogram_mode_10 | compengine_outlierinclude_mdrmd | compengine_fluctanal_prop_r1 | crossing_points | dist_features_histogram_mode_10 | dist_features_outlierinclude_mdrmd | embed2_incircle | entropy | firstmin_ac | firstzero_ac | flat_spots | fluctanal_prop_r1_fluctanal_prop_r1 | arch_acf | garch_acf | arch_r2 | garch_r2 | histogram_mode | alpha | beta | hurst | hw_parameters_hw_parameters | hw_parameters_NA | localsimple_taures | lumpiness | max_kl_shift | time_kl_shift | max_level_shift | time_level_shift | max_var_shift | time_var_shift | motiftwo_entro3 | nonlinearity | outlierinclude_mdrmd | x_pacf5 | diff1x_pacf5 | diff2x_pacf5 | pred_features_localsimple_mean1 | pred_features_localsimple_lfitac | pred_features_sampen_first | sampen_first_sampen_first | sampenc | scal_features_fluctanal_prop_r1 | spreadrandomlocal_meantaul | stability | station_features_std1st_der | station_features_spreadrandomlocal_meantaul_50 | station_features_spreadrandomlocal_meantaul_ac2 | std1st_der_std1st_der | nperiods | seasonal_period | trend | spike | linearity | curvature | e_acf1 | e_acf10 | trev_num | tsfeatures_frequency | tsfeatures_nperiods | tsfeatures_seasonal_period | tsfeatures_trend | tsfeatures_spike | tsfeatures_linearity | tsfeatures_curvature | tsfeatures_e_acf1 | tsfeatures_e_acf10 | tsfeatures_entropy | tsfeatures_x_acf1 | tsfeatures_x_acf10 | tsfeatures_diff1_acf1 | tsfeatures_diff1_acf10 | tsfeatures_diff2_acf1 | tsfeatures_diff2_acf10 | unitroot_kpss | unitroot_pp | walker_propcross |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | -0.0262073 | -0.0396281 | 0.0429784 | -0.4964245 | 0.3379915 | -0.6704837 | 0.6178088 | 0.1425744 | 0.5482625 | 0.7528958 | -0.0262073 | 1 | -0.5824739 | 2.063564 | 0.4826255 | 1 | 1 | 0.5482625 | 0.7528958 | -0.0262073 | 1 | -0.5824739 | 2.063564 | 0.4826255 | 1 | 1 | 1.383933 | 1.437946 | 1.91 | 1.00 | 0.50 | 0.4307692 | 0.1395349 | 117 | 0.50 | 0.4307692 | 0.5482625 | 0.9817288 | 1 | 1 | 7 | 0.1395349 | 0.1906443 | 0.0422059 | 0.1425744 | 0.0417531 | 0.50 | 0.0440489 | 0.0001000 | 0.5000458 | NA | NA | 1 | 1.1617874 | 4.857530 | 130 | 1.031623 | 230 | 3.967385 | 214 | 2.063564 | 0.0716802 | 0.4307692 | 0.0271516 | 0.5270423 | 0.9564642 | 1 | 1 | 1.383933 | 1.383933 | 1.383933 | 0.1395349 | 1.80 | 0.0804590 | 1.437946 | 1.89 | 1.00 | 1.437946 | 0 | 1 | 0.0355541 | 0.0000573 | -2.6210355 | -0.0981868 | -0.0740868 | 0.0651438 | -0.5824739 | 1 | 0 | 1 | 0.0355541 | 0.0000573 | -2.6210355 | -0.0981868 | -0.0740868 | 0.0651438 | 0.9817288 | -0.0396281 | 0.0429784 | -0.4964245 | 0.3379915 | -0.6704837 | 0.6178088 | 0.8820380 | -252.2509 | 0.4826255 |
2 | 2 | -0.0047799 | 0.0544155 | 0.0423445 | -0.4931653 | 0.3114689 | -0.6980787 | 0.6597427 | 0.1111625 | 0.4513619 | 0.6964981 | -0.0047799 | 3 | 0.2147570 | 2.068849 | 0.5250965 | 1 | 0 | 0.4513619 | 0.6964981 | -0.0047799 | 3 | 0.2147570 | 2.068849 | 0.5250965 | 1 | 1 | 1.611106 | 1.375120 | 2.15 | 1.40 | 0.25 | 0.1211538 | 0.1627907 | 142 | 0.25 | 0.1211538 | 0.4513619 | 0.9856808 | 3 | 3 | 6 | 0.1627907 | 0.1313081 | 0.0468159 | 0.0939769 | 0.0402163 | 0.25 | 0.0063703 | 0.0001000 | 0.5012778 | NA | NA | 1 | 0.5347516 | 6.848494 | 91 | 1.360520 | 80 | 3.586240 | 75 | 2.068849 | 0.0618461 | 0.1211538 | 0.0344415 | 0.4336405 | 0.9510320 | 1 | 1 | 1.611106 | 1.611106 | 1.611106 | 0.1627907 | 2.14 | 0.0796936 | 1.375120 | 1.82 | 1.34 | 1.375120 | 0 | 1 | 0.0216068 | 0.0000391 | 0.1351482 | -0.3430376 | 0.0339344 | 0.0578569 | 0.2147570 | 1 | 0 | 1 | 0.0216068 | 0.0000391 | 0.1351482 | -0.3430376 | 0.0339344 | 0.0578569 | 0.9856808 | 0.0544155 | 0.0423445 | -0.4931653 | 0.3114689 | -0.6980787 | 0.6597427 | 0.0722224 | -226.9463 | 0.5250965 |
3 | 3 | 0.0370364 | -0.0041963 | 0.1781209 | -0.3838557 | 0.3158431 | -0.5535087 | 0.3948373 | 0.3450202 | 0.6138996 | 0.7915058 | 0.0370364 | 2 | 2.9002534 | 2.067845 | 0.5598456 | 1 | 0 | 0.6138996 | 0.7915058 | 0.0370364 | 2 | 2.9002534 | 2.067845 | 0.5598456 | 1 | 1 | 1.436472 | 1.414575 | 1.24 | 1.00 | 0.50 | 0.7230769 | 0.1627907 | 139 | 0.50 | 0.7230769 | 0.6138996 | 0.9627133 | 2 | 1 | 6 | 0.1627907 | 0.4731295 | 0.0342727 | 0.2247245 | 0.0323111 | 0.50 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 3.9022555 | 33.656077 | 240 | 1.695947 | 222 | 9.122984 | 232 | 2.067845 | 0.7040489 | 0.7230769 | 0.0685939 | 0.5171369 | 1.0433489 | 1 | 1 | 1.436472 | 1.436472 | 1.436472 | 0.1627907 | 1.39 | 0.1088905 | 1.414575 | 1.43 | 1.00 | 1.414575 | 0 | 1 | 0.0058644 | 0.0001243 | -1.1897947 | -0.4762066 | -0.0084531 | 0.1814633 | 2.9002534 | 1 | 0 | 1 | 0.0058644 | 0.0001243 | -1.1897947 | -0.4762066 | -0.0084531 | 0.1814633 | 0.9627133 | -0.0041963 | 0.1781209 | -0.3838557 | 0.3158431 | -0.5535087 | 0.3948373 | 0.1757311 | -235.0780 | 0.5598456 |
4 | 4 | -0.0576029 | -0.0338906 | 0.0251717 | -0.4963752 | 0.2570591 | -0.6694337 | 0.4910006 | 0.0471296 | 0.3899614 | 0.6332046 | -0.0576029 | 3 | -0.1053821 | 2.075447 | 0.5366795 | 0 | 1 | 0.3899614 | 0.6332046 | -0.0576029 | 3 | -0.1053821 | 2.075447 | 0.5366795 | 1 | 1 | 1.785628 | 1.436827 | 1.52 | 1.00 | -0.25 | 0.0769231 | 0.1860465 | 137 | -0.25 | 0.0769231 | 0.3899614 | 0.9886539 | 3 | 1 | 3 | 0.1860465 | 0.0511246 | 0.0516446 | 0.0471296 | 0.0470911 | -0.25 | 0.0025845 | 0.0025845 | 0.5000458 | NA | NA | 1 | 0.2161135 | 2.534373 | 34 | 1.404765 | 154 | 2.213233 | 205 | 2.075447 | 0.0681473 | 0.0769231 | 0.0179401 | 0.4720756 | 0.9626432 | 1 | 1 | 1.785628 | 1.785628 | 1.785628 | 0.1860465 | 1.44 | 0.0499953 | 1.436827 | 1.42 | 1.00 | 1.436827 | 0 | 1 | 0.0042080 | 0.0000286 | 0.9969942 | 0.1863847 | -0.0370368 | 0.0269840 | -0.1053821 | 1 | 0 | 1 | 0.0042080 | 0.0000286 | 0.9969942 | 0.1863847 | -0.0370368 | 0.0269840 | 0.9886539 | -0.0338906 | 0.0251717 | -0.4963752 | 0.2570591 | -0.6694337 | 0.4910006 | 0.0860264 | -241.6752 | 0.5366795 |
5 | 5 | -0.1236994 | 0.0086381 | 0.0308039 | -0.5025363 | 0.3330186 | -0.6693011 | 0.5835466 | 0.1157603 | 0.4202335 | 0.7003891 | -0.1236994 | 1 | -0.0489352 | 2.058889 | 0.4864865 | 1 | 0 | 0.4202335 | 0.7003891 | -0.1236994 | 1 | -0.0489352 | 2.058889 | 0.4864865 | 1 | 1 | 1.722492 | 1.396172 | 1.69 | 1.32 | -0.50 | -0.0076923 | 0.8139535 | 120 | -0.50 | -0.0076923 | 0.4202335 | 0.9908616 | 1 | 3 | 6 | 0.8139535 | 0.0537820 | 0.0583484 | 0.1157603 | 0.1120523 | -0.50 | 0.0001609 | 0.0001609 | 0.5090878 | NA | NA | 1 | 0.6488028 | 3.045684 | 97 | 1.287940 | 14 | 4.338131 | 240 | 2.058889 | 0.0094165 | -0.0076923 | 0.0059114 | 0.4457371 | 0.9190563 | 1 | 1 | 1.722492 | 1.722492 | 1.722492 | 0.8139535 | 1.63 | 0.1107442 | 1.396172 | 1.75 | 1.35 | 1.396172 | 0 | 1 | 0.0229286 | 0.0000550 | -0.6149100 | 0.2128084 | -0.0125452 | 0.0317617 | -0.0489352 | 1 | 0 | 1 | 0.0229286 | 0.0000550 | -0.6149100 | 0.2128084 | -0.0125452 | 0.0317617 | 0.9908616 | 0.0086381 | 0.0308039 | -0.5025363 | 0.3330186 | -0.6693011 | 0.5835466 | 0.1169027 | -266.1451 | 0.4864865 |
6 | 6 | 0.0137566 | -0.0889224 | 0.0668615 | -0.5649436 | 0.4404459 | -0.7097820 | 0.7128451 | 0.0752299 | 0.5366795 | 0.6447876 | 0.0137566 | 1 | 0.3033072 | 2.064104 | 0.5328185 | 1 | 0 | 0.5366795 | 0.6447876 | 0.0137566 | 1 | 0.3033072 | 2.064104 | 0.5328185 | 1 | 1 | 1.464977 | 1.477767 | 1.53 | 1.00 | 0.25 | 0.3269231 | 0.1627907 | 136 | 0.25 | 0.3269231 | 0.5366795 | 0.9835850 | 1 | 1 | 6 | 0.1627907 | 0.1033936 | 0.0236197 | 0.0740159 | 0.0248339 | 0.25 | 0.0001000 | 0.0001000 | 0.5000458 | NA | NA | 1 | 0.7510236 | 12.688453 | 197 | 1.217490 | 189 | 2.987989 | 194 | 2.064104 | 0.0649001 | 0.3269231 | 0.0200688 | 0.5201834 | 1.0761503 | 1 | 1 | 1.464977 | 1.464977 | 1.464977 | 0.1627907 | 1.35 | 0.0814814 | 1.477767 | 1.36 | 1.00 | 1.477767 | 0 | 1 | 0.0081147 | 0.0000469 | 0.6555116 | -0.0489727 | -0.0976177 | 0.0700199 | 0.3033072 | 1 | 0 | 1 | 0.0081147 | 0.0000469 | 0.6555116 | -0.0489727 | -0.0976177 | 0.0700199 | 0.9835850 | -0.0889224 | 0.0668615 | -0.5649436 | 0.4404459 | -0.7097820 | 0.7128451 | 0.0869913 | -279.8920 | 0.5328185 |
Finally we can run the final model on the held-out-test-set and obtain our predictions based on the training data and the optimal parameters.
# previously and run the final training model (to make predictions on the out-of-sample test data)
x_train_final <- train_final %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()
x_test_final <- test_final %>%
ungroup() %>%
select(-row_id, -X) %>%
as.matrix()
y_train_final <- train_final %>%
ungroup() %>%
pull(class)
dtrain_final <- xgb.DMatrix(data = as.matrix(x_train_final), label = y_train_final, missing = "NaN")
dtest_final <- xgb.DMatrix(data = as.matrix(x_test_final), missing = "NaN")
watchlist <- list("train" = dtrain_final)
params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
"objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
"colsample_bylevel" = 1, "eval_metric"= "auc",
"set.seed" = 176)
nround <- 95
xgb.model_final <- xgb.train(params, dtrain_final, nround, watchlist)
## [1] train-auc:0.708604
## [2] train-auc:0.721700
## [3] train-auc:0.723230
## [4] train-auc:0.729888
## [5] train-auc:0.735542
## [6] train-auc:0.738081
## [7] train-auc:0.740926
## [8] train-auc:0.744105
## [9] train-auc:0.746320
## [10] train-auc:0.748644
## [11] train-auc:0.754211
## [12] train-auc:0.756892
## [13] train-auc:0.761524
## [14] train-auc:0.763882
## [15] train-auc:0.767216
## [16] train-auc:0.772009
## [17] train-auc:0.772943
## [18] train-auc:0.774261
## [19] train-auc:0.775471
## [20] train-auc:0.777801
## [21] train-auc:0.780629
## [22] train-auc:0.784384
## [23] train-auc:0.787112
## [24] train-auc:0.788946
## [25] train-auc:0.791835
## [26] train-auc:0.793142
## [27] train-auc:0.795289
## [28] train-auc:0.798502
## [29] train-auc:0.799893
## [30] train-auc:0.802186
## [31] train-auc:0.804981
## [32] train-auc:0.805649
## [33] train-auc:0.807120
## [34] train-auc:0.809020
## [35] train-auc:0.810318
## [36] train-auc:0.812637
## [37] train-auc:0.814760
## [38] train-auc:0.816024
## [39] train-auc:0.817956
## [40] train-auc:0.819350
## [41] train-auc:0.821653
## [42] train-auc:0.822729
## [43] train-auc:0.824029
## [44] train-auc:0.824765
## [45] train-auc:0.826924
## [46] train-auc:0.827804
## [47] train-auc:0.828475
## [48] train-auc:0.831018
## [49] train-auc:0.832247
## [50] train-auc:0.833265
## [51] train-auc:0.834168
## [52] train-auc:0.835535
## [53] train-auc:0.836093
## [54] train-auc:0.837008
## [55] train-auc:0.837715
## [56] train-auc:0.839537
## [57] train-auc:0.840310
## [58] train-auc:0.841701
## [59] train-auc:0.842480
## [60] train-auc:0.843106
## [61] train-auc:0.844495
## [62] train-auc:0.845348
## [63] train-auc:0.845932
## [64] train-auc:0.847843
## [65] train-auc:0.849445
## [66] train-auc:0.850345
## [67] train-auc:0.851337
## [68] train-auc:0.852121
## [69] train-auc:0.852663
## [70] train-auc:0.854132
## [71] train-auc:0.855949
## [72] train-auc:0.856758
## [73] train-auc:0.857115
## [74] train-auc:0.857954
## [75] train-auc:0.858849
## [76] train-auc:0.859527
## [77] train-auc:0.859917
## [78] train-auc:0.860590
## [79] train-auc:0.861264
## [80] train-auc:0.862359
## [81] train-auc:0.863101
## [82] train-auc:0.863794
## [83] train-auc:0.864911
## [84] train-auc:0.866293
## [85] train-auc:0.866976
## [86] train-auc:0.867436
## [87] train-auc:0.869036
## [88] train-auc:0.869469
## [89] train-auc:0.869931
## [90] train-auc:0.870681
## [91] train-auc:0.872326
## [92] train-auc:0.873706
## [93] train-auc:0.875704
## [94] train-auc:0.876178
## [95] train-auc:0.876789
I make the final predictions based on the test.csv
data. The predict
function in R is great, it can take any model and make predictions, we just need to provide the testing data along with the model. I “ask” for probability scores from the predictions. I plot the density of predicted probabilities also.
# Make the final predictions on the 'test.csv' data and plot the probability density function.
xgb.pred_final <- predict(xgb.model_final, dtest_final, type = 'prob')
xgb.pred_final %>%
as_tibble() %>%
setNames(c("Prediction")) %>%
ggplot(aes(x = Prediction)) +
geom_density(color = "darkblue", fill = "lightblue") +
geom_vline(aes(xintercept = mean(Prediction)),
color = "blue", linetype = "dashed", size = 1) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") +
ggtitle("(Out of sample) - Predicted probability density plot") +
theme_tq()
Finally! I make the submission file based on the predicted probabilities.
# Convert the probabilities into a binary class of 0 or 1 by a decision threshold of 0.465.
# Write the predictions to "submission.csv"
xgb.pred_final %>%
as_tibble() %>%
setNames(c("Prediction")) %>%
summarise(mean = mean(Prediction))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 0.465
xgb.pred_final %>%
as_tibble() %>%
setNames(c("Prediction")) %>%
mutate(pred = case_when(
Prediction > 0.465 ~ 1,
Prediction <= 0.465 ~ 0
)) %>%
write.csv("submission.csv")
I make the final remark in the Jupyter Notebook I sent as part of the interview process
Quote:: Final footnote: Hopefully the out-of-sample predictions will obtain a 67% accuracy (the predictions in the “submission.csv” file).
I was told after I sent my scores as part of the interview process how the scores were evaluated (In Spanish):
*Para que sepas cómo es la valoración:
Obtener entre 0.4-0.6 se considera un resultado aleatorio.
A partir 0.6 el algoritmo clasifica correctamente y más de un 0.7 el algoritmo es genial.
Por debajo de 0.4 son capaces de diferenciar series sintéticas de las reales, pero están intercambiadas.*
I was informed that based on the held out test set I obtained a result of 0.649636 ~0.65% (a little lower than my 0.67% in-sample training set!) but still consistent with the correct methodology I was using (i.e. no leaking test data to the training data) along with the fact that I was just throwing the time series features book/kitchen sink at the problem. Further reading into time series features will strengthen this classification problem and will certainly improve the prediction accuracy! Recall, that my feature selection consisted of applying every feature in the tsfeatures
package… Using functions <- ls("package:tsfeatures")[1:42]
and then mapping
over the data using summarise(Statistics = map(data, ~ data.frame( bind_cols(tsfeatures(.x$value, functions))))) %>%
. So there is plenty of improvements to the current model.
Conclusion: A combination of time series feature selection and classifciation models can do pretty well on time series classification models such as this one I faced.
Any errors are my own!