# Time Series Classification Synthetic vs Real Financial Time Series

Distinguishing between real financial time series and synthetic time series using XGBoost

::Note:: This is a long post but I talk about the procedure I took when dealing with a specific time series classification task.

I was given a “Data Science” challenge as part of an interview in which I had to distinguish between real financial time series and synthetic time series. I document the results here, the data was anonymous and I have no idea which assets were which or from what time series the assets came from.

To conclude I obtained an in-sample-test-accuracy of 67% and an out-of-sample-test-accuracy of 65% (based on what the interviewers told me)

All I knew was that I had 12,000 real time series and 12,000 synthetically created time series. (apologies for no data but this was the companies data and not mine, I have uploaded the train and test data sets discussed later here where you should be able to run the final XGBoost model). In total there were 24,000 observations. I show the code here for methodological purposes and if you are interested in visualising time series in R and ggplot2. The time series features used here are taken from the following papers:

• Large Scale Unusual Time Series Detection by R.Hyndman, E.Wang and N.Laptev
• Visualising forecasting algorithm performance using time series instance spaces by Y.Kang, Rob.Hyndman and Kate Smith-Miles

You can check out my Jupyter Notebook version here.

I added a lot of notes to the code throughout the document which might be of additional interest.

### Part 1 of the notebook:

• Cleans the data and puts it into a better format for analysis. The data I recieved removed all dates, assest names etc. for anonymity.
• Simple plot of some returns for the Synthetic and Real financial time series.
• Box-plots of average returns and standard deviations.
• Computes the Durbin-Watson test statistics for both Synthetic and Real time series and box-plots.
• Plot the 10 day rolling mean and standard deviations for a random time series for Synthetic and real data.
• Dickey Fuller test on both the Synthetic and real time series.
• Jarque-Bera Test For Normality on the Synthetic and real time series.
• ACF Plots for both the Synthetic and real time series.

### Part 2 of the notebook:

• Creates the time series features.
• Splits the train.csv into “train” and “validation” data sets.
• Puts the data into the correct format for XGBoost.
• Sets up and searches over a parameter space to find the most optimal parameters for this data set (on the train data).
• Outputs these parameters into a data frame.
• Train the model using the optimal parameters found from the grid-search.
• Plot the feature importance scores - i.e. the most “important” variables that the model found when making its predictions.
• Assign a cut-off on the probability scores (> 0.5 then assign a 1 - real time series, <= 0.5 then assign a 0 for Synthetic).
• Compute the Confusion Matrix and analyse the ‘in-sample’ validation results.

### Part 3 of the notebook:

• Create the “test.csv” features just as before and save as “TSfeatures_test.csv”.
• Load in the “TSfeatures_train_val.csv” and “TSfeatures_test.csv” which were created from “train.csv” and “test.csv”.
• Set up and run the XGBoost model using the optimal parameters found from the cross-validation grid search in “Part 2”.
• Plot the predicted probability density plot as before in “Part 2”.
• Set the cut-off threshold as the mean prediction score (0.465) which is close to the (0.500) score from “Part 2”.
• Save the results as “submission.csv”.

Lets get started…

I often remove all other data in my environment before hand and turn scientific notation off which is what the first 2 lines does. The shhh command is useful for Jupyter Notebooks which outputs all the warning messages, adding shhh suppresses these warning messaged when loading in the packages. (In R markdown I can set warning = FALSE but there is no option on Notebooks. - that I know of - )

rm(list = ls())
options(scipen=999)
setwd('C:/Users/Matt/Desktop/Data Science Challenge')
shhh <- suppressPackageStartupMessages

shhh(library(dplyr))
library(TSrepr)
library(ggplot2)
library(data.table)
library(cluster)
library(clusterCrit)
library(fractalrock)
library(cowplot)
library(tidyr)
library(tidyquant)
library(lmtest)
library(aTSA)
library(tsoutliers)
library(tsfeatures)
library(xgboost)
library(caret)
library(purrr)

test <- read_csv("test.csv")

### NOTE:

I have 2 data sets, the train_Val.csv for training and validation data set and the test.csv data set. I do not touch the test.csv data set until the very end in part 3. All the analysis and optimisation is performed only on the train_val.csv data set. The train_val.csv contains 12,000 observations and the test.csv contains 12,000 observations.

### Part 1

The data was given to me in this format:

head(train_val[, 1:5], 1)
## # A tibble: 1 x 5
##   feature1 feature2 feature3 feature4 feature5
##      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1  0.00629  0.00441  -0.0381   0.0253 -0.00658

The names of the columns are as follows:

colnames(train_val) %>%
data.frame() %>%
setNames(c("features")) %>%
split(as.integer(gl(nrow(.), 20, nrow(.)))) %>%
kable(caption = "Time series variables") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
Table 1: Time series variables
features
feature1
feature2
feature3
feature4
feature5
feature6
feature7
feature8
feature9
feature10
feature11
feature12
feature13
feature14
feature15
feature16
feature17
feature18
feature19
feature20
features
21 feature21
22 feature22
23 feature23
24 feature24
25 feature25
26 feature26
27 feature27
28 feature28
29 feature29
30 feature30
31 feature31
32 feature32
33 feature33
34 feature34
35 feature35
36 feature36
37 feature37
38 feature38
39 feature39
40 feature40
features
41 feature41
42 feature42
43 feature43
44 feature44
45 feature45
46 feature46
47 feature47
48 feature48
49 feature49
50 feature50
51 feature51
52 feature52
53 feature53
54 feature54
55 feature55
56 feature56
57 feature57
58 feature58
59 feature59
60 feature60
features
61 feature61
62 feature62
63 feature63
64 feature64
65 feature65
66 feature66
67 feature67
68 feature68
69 feature69
70 feature70
71 feature71
72 feature72
73 feature73
74 feature74
75 feature75
76 feature76
77 feature77
78 feature78
79 feature79
80 feature80
features
81 feature81
82 feature82
83 feature83
84 feature84
85 feature85
86 feature86
87 feature87
88 feature88
89 feature89
90 feature90
91 feature91
92 feature92
93 feature93
94 feature94
95 feature95
96 feature96
97 feature97
98 feature98
99 feature99
100 feature100
features
101 feature101
102 feature102
103 feature103
104 feature104
105 feature105
106 feature106
107 feature107
108 feature108
109 feature109
110 feature110
111 feature111
112 feature112
113 feature113
114 feature114
115 feature115
116 feature116
117 feature117
118 feature118
119 feature119
120 feature120
features
121 feature121
122 feature122
123 feature123
124 feature124
125 feature125
126 feature126
127 feature127
128 feature128
129 feature129
130 feature130
131 feature131
132 feature132
133 feature133
134 feature134
135 feature135
136 feature136
137 feature137
138 feature138
139 feature139
140 feature140
features
141 feature141
142 feature142
143 feature143
144 feature144
145 feature145
146 feature146
147 feature147
148 feature148
149 feature149
150 feature150
151 feature151
152 feature152
153 feature153
154 feature154
155 feature155
156 feature156
157 feature157
158 feature158
159 feature159
160 feature160
features
161 feature161
162 feature162
163 feature163
164 feature164
165 feature165
166 feature166
167 feature167
168 feature168
169 feature169
170 feature170
171 feature171
172 feature172
173 feature173
174 feature174
175 feature175
176 feature176
177 feature177
178 feature178
179 feature179
180 feature180
features
181 feature181
182 feature182
183 feature183
184 feature184
185 feature185
186 feature186
187 feature187
188 feature188
189 feature189
190 feature190
191 feature191
192 feature192
193 feature193
194 feature194
195 feature195
196 feature196
197 feature197
198 feature198
199 feature199
200 feature200
features
201 feature201
202 feature202
203 feature203
204 feature204
205 feature205
206 feature206
207 feature207
208 feature208
209 feature209
210 feature210
211 feature211
212 feature212
213 feature213
214 feature214
215 feature215
216 feature216
217 feature217
218 feature218
219 feature219
220 feature220
features
221 feature221
222 feature222
223 feature223
224 feature224
225 feature225
226 feature226
227 feature227
228 feature228
229 feature229
230 feature230
231 feature231
232 feature232
233 feature233
234 feature234
235 feature235
236 feature236
237 feature237
238 feature238
239 feature239
240 feature240
features
241 feature241
242 feature242
243 feature243
244 feature244
245 feature245
246 feature246
247 feature247
248 feature248
249 feature249
250 feature250
251 feature251
252 feature252
253 feature253
254 feature254
255 feature255
256 feature256
257 feature257
258 feature258
259 feature259
260 feature260
features
261 class

There are 260 “features” in the train data along with a class variable which is excluded from the testing data. With ~253 trading days in a year the feature1, feature2, … featureN were daily time series. From my initial observation (and plots) I believed this data to be “returns” data. I firstly clean a little the data since time series does not do so well with feature1, feature2, … featureN as its input. I chose a year at random and renamed the columns with the function getTradingDates (there is always an R package for everything…).

######################################################################
################# Clean the data #####################################

# Since the "features" are daily time series, I just choose a random year and rename the feautres into more meaningful names
# Such as "2010-01-01", "2010-01-02", "2010-01-03" instead of "feature1", "feature2", "feature3" etc.
# Theres a "trading dates" package in R to get only the dates which are trading dates.
colnames(train_val) <- getTradingDates('2010-01-01', obs = 260)
colnames(train_val)[ncol(train_val)] <- "class"
colnames(test) <- getTradingDates('2010-01-01', obs = 260)
test$dataset <- "test" train_val$dataset <- "train"

Here (if I were to do things differently) I would keep to tidy data principles and use test %>% add_column(dataset = "test) and train %>% add_colum(dataset = "train") instead of test$dataset <- "test and train_val$dataset <- "train". But that doesn’t matter much.

### How the training data looks after cleaning:

Table 2: How the training set looks now (cleaned)
2009-01-05 2009-01-06 2009-01-07 2009-01-08 2009-01-09
0.0062865 0.0044074 -0.0380887 0.0252850 -0.0065788
0.0008491 0.0025729 0.0013584 -0.0054742 -0.0098234
0.0142292 -0.0252533 -0.0100752 -0.0319871 -0.0065087
-0.0215930 -0.0102866 -0.0210674 -0.0086876 0.0371876
0.0092523 -0.0235778 0.0170582 0.0037303 0.0171185
0.0143528 0.0094828 0.0042109 -0.0038064 0.0084914

### How the testing data looks after cleaning:

Table 3: How the testing data looks (cleaned)
2009-01-05 2009-01-06 2009-01-07 2009-01-08 2009-01-09
0.0331039 0.0086225 0.0040622 0.0082554 0.0558741
0.0020681 -0.0034293 0.0134305 -0.0109182 -0.0184851
0.0147834 -0.0113800 -0.0046055 -0.0008757 -0.0011536
-0.0094855 0.0113410 -0.0213286 0.0033220 -0.0111519
0.0381690 -0.0037092 -0.0010865 -0.0062307 0.0232117
0.0004257 -0.0042553 0.0029915 0.0017043 0.0012760

The goal: Was to classify which financial time series were real vs which were synthetically created (by some algorithm I have no knowledge of how it generated the synthetic time series)

I re-arranged the data using the melt function in R, however I suggest anybody reading this to use the pivol_longer function from the tidyverse packages. The pivot_longer package was released a few weeks after writing the code for this problem.

######################################################################
################# Rearrange the data #################################

# I melt the data for easier analysis, now the data is in a long format.

# "Class" corresponds to whether the asset is Synthetic or Real
# "Dataset" tells me where the data came from
# "row_id" - corresponds to a unique ID assigned to each asset both "(Synthetic & Real)"
# "Variable" is the column names of the original dataset (feature1, feature2, ... , featureN) converted to some date
# "Value" is the daily returns

df <- train_val %>%
mutate(row_id = row_number()) %>%
melt(., measure.vars = 1:260) %>%
arrange(row_id)

head(df)
##   class dataset row_id   variable        value
## 1     0   train      1 2009-01-05  0.006286455
## 2     0   train      1 2009-01-06  0.004407363
## 3     0   train      1 2009-01-07 -0.038088652
## 4     0   train      1 2009-01-08  0.025285012
## 5     0   train      1 2009-01-09 -0.006578773
## 6     0   train      1 2009-01-12  0.005713677
dim(df)
## [1] 3120000       5

Note: I call the training data df which in hindsight is probably bad practice and it should be called something related to the train_Val named data set. Just keep in mind that df refers to the train_Val data set. (and does not include the test.csv data set data)

As we can see the data has 3,120,000 rows which is 12,000 assets * 260 trading days. Next I plot the returns series using ggplot.

# Plot some returns - I only plot a random sample of 20 assets for each Synthetic vs Real.

ret_plot0 <- df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(20) %>%
unnest() %>%
ggplot(aes(x = variable, y = value)) +
geom_line(aes(group = factor(row_id), color = factor(row_id))) +
ggtitle("Synthetic Financial Time Series") +
theme_classic() +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

ret_plot1 <- df %>%
filter(class == 1) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(20) %>%
unnest() %>%
ggplot(aes(x = variable, y = value)) +
geom_line(aes(group = factor(row_id), color = factor(row_id))) +
ggtitle("Real Financial Time Series") +
theme_classic() +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

plot_grid(ret_plot0, ret_plot1)

Next I plot boxplots for the Average returns and secondly the standard deviations.

ave_box <- df %>%
group_by(class, row_id) %>%
summarise(mean = mean(value)) %>%
ggplot(aes(x = factor(class), y = mean, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Syn vs Real Average Returns") +
xlab("Class") +
ylab("Average Returns") +
theme_tq()

sd_box <- df %>%
group_by(class, row_id) %>%
summarise(sd = sd(value)) %>%
ggplot(aes(x = factor(class), y = sd, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Syn vs Real Standard Deviations") +
xlab("Class") +
ylab("Standard Deviation") +
theme_tq()

plot_grid(ave_box, sd_box)

I next calculate the Durbin-Watson statistic. I mostly code using R’s tidy data principles and therefore use the tidy function from the broom package to tidy the output of the DW statistic up a little. I do this for both the synthetic time series and real time series.

# I calculate the Durbin-Watson statistic and use the "tidy()" function to summarise the key information from the calculation.

dw_test_class_zero <- df %>%
dplyr::filter(class == 0) %>%
nest(-row_id) %>%
mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
unnest(dw_res) %>%
mutate(class = "0")

dw_test_class_zero %>%
head()
## # A tibble: 6 x 7
##   row_id         data statistic p.value method      alternative            class
##    <int> <list<df[,4>     <dbl>   <dbl> <chr>       <chr>                  <chr>
## 1      1    [260 x 4]      1.98   0.426 Durbin-Wat~ true autocorrelation ~ 0
## 2      2    [260 x 4]      2.01   0.521 Durbin-Wat~ true autocorrelation ~ 0
## 3      4    [260 x 4]      2.08   0.747 Durbin-Wat~ true autocorrelation ~ 0
## 4      5    [260 x 4]      2.49   1.000 Durbin-Wat~ true autocorrelation ~ 0
## 5      6    [260 x 4]      1.90   0.214 Durbin-Wat~ true autocorrelation ~ 0
## 6      9    [260 x 4]      1.87   0.138 Durbin-Wat~ true autocorrelation ~ 0
# Here I do the exact same thing as above but this time for the class == 1 data.

dw_test_class_one <- df %>%
filter(class == 1) %>%
nest(-row_id) %>%
mutate(dw_res = map(data, ~ broom::tidy(lmtest::dwtest(value ~ 1, data = .x)))) %>%
unnest(dw_res) %>%
mutate(class = "1")

dw_test_class_one %>%
head()
## # A tibble: 6 x 7
##   row_id         data statistic p.value method      alternative            class
##    <int> <list<df[,4>     <dbl>   <dbl> <chr>       <chr>                  <chr>
## 1      3    [260 x 4]      2.08  0.728  Durbin-Wat~ true autocorrelation ~ 1
## 2      7    [260 x 4]      1.81  0.0654 Durbin-Wat~ true autocorrelation ~ 1
## 3      8    [260 x 4]      1.93  0.296  Durbin-Wat~ true autocorrelation ~ 1
## 4     13    [260 x 4]      2.05  0.644  Durbin-Wat~ true autocorrelation ~ 1
## 5     15    [260 x 4]      2.07  0.715  Durbin-Wat~ true autocorrelation ~ 1
## 6     16    [260 x 4]      2.07  0.709  Durbin-Wat~ true autocorrelation ~ 1

Next I plot the boxplot statistics for each of the Durbin Watson tests.

# I bind the rows together and plot a box-plot.

bind_rows(dw_test_class_zero, dw_test_class_one) %>%
group_by(class) %>%
ggplot(aes(x = factor(class), y = statistic, color = factor(class))) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Durbin Watson Box Plot Statistics") +
xlab("Class") +
ylab("Durbin Watson") +
theme_tq()

I compute the 10 day rolling mean and standard deviation using the tq_mutate function from the tidyquant package. value corresponds to the returns of the financial time series and is plotted in blue with the 10 day rolling average and standard deviation plotted over the returns. (I use melt again here but look into the pivot_longer function for a more intuitive application)

# Rolling mean and standard deviations
# I only use a random sample of 1 of each class of the grouped observations to save on memory and to make the plot more readable.
# The rollowing window is 10 days
# I use the tq_mutate functionality from the "tidyquant" package to keep things in a "tidy" format as per the "tidyverse" 'rules'.
# In the plot "value" is the returns, "mean_10" is the 10 day rolling mean and "sd_10" is the 10 day rolling standard deviation.

plot0 <- df %>%
filter(class == 0) %>%
as_tibble() %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
mutate(variable = as.Date(variable)) %>%
tq_mutate(
select     = value,
mutate_fun = rollapply,
width      = 10,
align      = "right",
FUN        = mean,
na.rm      = TRUE,
col_rename = "mean_10"
) %>%
tq_mutate(
select     = value,
mutate_fun = rollapply,
width      = 10,
align      = "right",
FUN        = sd,
na.rm      = TRUE,
col_rename = "sd_10"
) %>%
melt(measure.vars = 5:7) %>%
setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value, colour = variable)) +
ggtitle("Synthetic Financial Time Series Rolling Mean and Standard Deviation") +
theme_classic() +
scale_colour_manual(values = c("#1f77b4", "red", "black")) +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

plot1 <- df %>%
filter(class == 1) %>%
as_tibble() %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
mutate(variable = as.Date(variable)) %>%
tq_mutate(
select     = value,
mutate_fun = rollapply,
width      = 10,
align      = "right",
FUN        = mean,
na.rm      = TRUE,
col_rename = "mean_10"
) %>%
tq_mutate(
select     = value,
mutate_fun = rollapply,
width      = 10,
align      = "right",
FUN        = sd,
na.rm      = TRUE,
col_rename = "sd_10"
) %>%
melt(measure.vars = 5:7) %>%
setNames(c("row_id", "class", "data set", "date", "variable", "value")) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value, colour = variable)) +
ggtitle("Real Financial Time Series Rolling Mean and Standard Deviation") +
theme_classic() +
scale_colour_manual(values = c("#1f77b4", "red", "black")) +
theme(axis.text.x = element_blank(), legend.position = "bottom", legend.title = element_blank())

plot_grid(plot0, plot1)

An important note in the code here is that I randomly sample by group, that is, I do not take a random sample from all observations across all groups. Instead I group_by each time series (each of the 6,000 observations after I filtered by the class == 0, likewise when I filter by the class == 1), I then nest() the data to collapse the daily time series for each asset into a list. From here I will have 6,000 observations, each of which has their time series nested inside a list. Thus, I can sample 1 of the 6,000 observations and then unnest() and obtain a full time series set of one of the random assets selected, - instead of sampling randomly over all assets time series data (which would be completely wrong).

For example the following commented out code group_by() the ID variable and nest() the data, takes a random sample_n() of the grouped data and then unnest() the data to its original form, this time with a random sample of the IDs.

#  group_by(row_id) %>%
#  nest() %>%
#  ungroup() %>%
#  sample_n(1) %>%
#  unnest() %>%

Next I compute the Dickey Fuller test on both series for a single random observation, hence the sample_n(1) argument (it’s too computationally expensive to compute it on all 12,000 observations).

For the synthetically created series.

# Dickey Fuller test on the 0 class
# I only randomly sample 1 of the assets for the 0 class to save on output space

df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(adf_res = map(data, ~ adf.test(.x$value))) %>% unnest(adf_res) ## Augmented Dickey-Fuller Test ## alternative: stationary ## ## Type 1: no drift no trend ## lag ADF p.value ## [1,] 0 -17.94 0.01 ## [2,] 1 -11.75 0.01 ## [3,] 2 -8.66 0.01 ## [4,] 3 -7.62 0.01 ## [5,] 4 -7.13 0.01 ## Type 2: with drift no trend ## lag ADF p.value ## [1,] 0 -17.94 0.01 ## [2,] 1 -11.76 0.01 ## [3,] 2 -8.67 0.01 ## [4,] 3 -7.64 0.01 ## [5,] 4 -7.15 0.01 ## Type 3: with drift and trend ## lag ADF p.value ## [1,] 0 -18.00 0.01 ## [2,] 1 -11.83 0.01 ## [3,] 2 -8.77 0.01 ## [4,] 3 -7.74 0.01 ## [5,] 4 -7.26 0.01 ## ---- ## Note: in fact, p.value = 0.01 means p.value <= 0.01 ## # A tibble: 3 x 3 ## row_id data adf_res ## <int> <list<df[,4]>> <named list> ## 1 7807 [260 x 4] <dbl[,3] [5 x 3]> ## 2 7807 [260 x 4] <dbl[,3] [5 x 3]> ## 3 7807 [260 x 4] <dbl[,3] [5 x 3]> The same but on the real financial series. # Dickey Fuller test on the 1 class # I only randomly sample 1 of the assets for the 1 class to save on output space df %>% filter(class == 1) %>% group_by(row_id) %>% nest() %>% ungroup() %>% sample_n(1) %>% unnest() %>% nest(-row_id) %>% mutate(adf_res = map(data, ~ adf.test(.x$value))) %>%
unnest(adf_res)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## [1,]   0 -15.99    0.01
## [2,]   1 -10.71    0.01
## [3,]   2  -9.12    0.01
## [4,]   3  -8.74    0.01
## [5,]   4  -7.58    0.01
## Type 2: with drift no trend
## [1,]   0 -16.10    0.01
## [2,]   1 -10.84    0.01
## [3,]   2  -9.27    0.01
## [4,]   3  -8.93    0.01
## [5,]   4  -7.81    0.01
## Type 3: with drift and trend
## [1,]   0 -16.27    0.01
## [2,]   1 -10.99    0.01
## [3,]   2  -9.46    0.01
## [4,]   3  -9.18    0.01
## [5,]   4  -8.06    0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
## # A tibble: 3 x 3
##    <int> <list<df[,4]>> <named list>
## 1  10833      [260 x 4] <dbl[,3] [5 x 3]>
## 2  10833      [260 x 4] <dbl[,3] [5 x 3]>
## 3  10833      [260 x 4] <dbl[,3] [5 x 3]>

Next the Jarque-Bera tests for normality. Firstly on the synthetically created series.

# For both classes I take a random sample of 1 observation from each class (Synthetic and Real financial series)

jb_zero <- df %>%
filter(class == 0) %>%
group_by(row_id) %>%
nest() %>%
ungroup() %>%
sample_n(1) %>%
unnest() %>%
nest(-row_id) %>%
mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value))) print("Jarque-Bera Test on the 0 - Synthetic class") ## [1] "Jarque-Bera Test on the 0 - Synthetic class" jb_zero$JarqueBeraTest
## [[1]]
##
##  Jarque Bera Test
##
## data:  .x$value ## X-squared = 0.3088, df = 2, p-value = 0.8569 ## ## ## Skewness ## ## data: .x$value
## statistic = 0.045794, p-value = 0.7631
##
##
##  Kurtosis
##
## data:  .x$value ## statistic = 2.8582, p-value = 0.6406 Also on the real financial series. jb_one <- df %>% filter(class == 0) %>% group_by(row_id) %>% nest() %>% ungroup() %>% sample_n(1) %>% unnest() %>% nest(-row_id) %>% mutate(JarqueBeraTest = map(data, ~ JarqueBera.test(.x$value)))

print("Jarque-Bera Test on the 1 - Real class")
## [1] "Jarque-Bera Test on the 1 - Real class"
jb_one$JarqueBeraTest ## [[1]] ## ## Jarque Bera Test ## ## data: .x$value
## X-squared = 25.14, df = 2, p-value = 0.000003474
##
##
##  Skewness
##
## data:  .x$value ## statistic = 0.084191, p-value = 0.5794 ## ## ## Kurtosis ## ## data: .x$value
## statistic = 4.514, p-value = 0.0000006251

### Autocorrelation plots:

I plot the Autocorrelation Function for a “random” sample of observations time series. I selected 4 observations and filtered the data by them.

######################################################################
################# ACF plots ##########################################

# I only use 4 observations for these plots, 2 from the "synthetic" class and 2 from the "real" class.

df %>%
filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id ==   5734) %>%
mutate(date = as.Date(variable)) %>%
ggplot(aes(x = date)) +
geom_line(aes(y = value), color = "red", alpha = 0.4) +
geom_hline(yintercept = 0) +
facet_wrap(~ row_id + class) +
theme_tq()

acf_data <- df %>%
dplyr::filter(row_id == 6422 | row_id == 8967 | row_id == 6080 | row_id ==    5734) %>%
mutate(date = as.Date(variable))

df_acf <- acf_data %>%
group_by(row_id) %>%
summarise(list_acf = list(acf(value, plot=FALSE))) %>%
mutate(acf_vals = purrr::map(list_acf, ~as.numeric(.x$acf))) %>% select(-list_acf) %>% unnest() %>% group_by(row_id) %>% mutate(lag = row_number() - 1) df_ci <- acf_data %>% group_by(row_id) %>% summarise(ci = qnorm((1 + 0.95)/2)/sqrt(n())) ggplot(df_acf, aes(x = lag, y = acf_vals)) + geom_bar(stat="identity", width=.05) + geom_hline(yintercept = 0) + geom_hline(data = df_ci, aes(yintercept = -ci), color="blue", linetype="dotted") + geom_hline(data = df_ci, aes(yintercept = ci), color="blue", linetype="dotted") + labs(x="Lag", y="ACF") + facet_wrap(~ row_id) + theme_tq() Thats enough data analysis I could probably fit the PACF plots also along with a few more exploratory data analysis but I move on to generating the financial time series features using the tsfeatures package. What I do in the below code is to take a random sample of 5 groups (Using the whole data set takes too long to calculate the time series features) and then apply all the functions in the tsfeatures package to each of the time series assets data which is does by mapping over each assets data and computing the time series features. This section takes some time to process and compute (especially on the whole sample) and I already saved the results as a csv which I will just work from and load in the pre-computed time series features. ################# Generate Time Series Features ###################### # I create some time series features from the package "tsfeatures". There are 40+ functions in the "tsfeatures" package # which can generate approximately 106 time series features. # Due to memory issues I am only able to create a few of the features, therefore I randomly sample 10 features from the # "tsfeatures" package. We could also add in technical indicators from the "PerformanceAnalytics" or "TTR" packages (I omit these # here, however creating 'functions2 <- ls("package:TTR")' and adding it to the 'summarise' command will work.) functions <- ls("package:tsfeatures")[1:42] # functions <- sample(functions, 20) Stats <- df %>% group_by(row_id, class) %>% nest() %>% ungroup() %>% sample_n(5) %>% unnest() %>% nest(-row_id, -class) %>% group_by(row_id, class) %T>% {options(warn = -1)} %>% summarise(Statistics = map(data, ~ data.frame( bind_cols( tsfeatures(.x$value, functions))))) %>%
unnest(Statistics)
# I saved to whole dataset as "Stats" next I split it between training and test.
Stats <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv")

Note: Again, bad practice by me. I just called the df data Stats which consists of only the time series features. This still only refers to the train_val.csv data and not the test.csv data.

The training data looks like: (after computing the time series features). Now each asset has been collapsed from ~260 days down to 1 signal time series feature observation.

Recall the goal here was to classify synthetic time series vs real time series and not what the next days price is going to be. For each asset I have a signal observation and based on this I can train a classifying algorithm to distinguish between real vs synthetic time series.

### How the training data looks:

Table 4: tsfeatures package features
X row_id class ac_9_ac_9 acf_features_x_acf1 acf_features_x_acf10 acf_features_diff1_acf1 acf_features_diff1_acf10 acf_features_diff2_acf1 acf_features_diff2_acf10 ARCH.LM autocorr_features_embed2_incircle_1 autocorr_features_embed2_incircle_2 autocorr_features_ac_9 autocorr_features_firstmin_ac autocorr_features_trev_num autocorr_features_motiftwo_entro3 autocorr_features_walker_propcross binarize_mean_binarize_mean binarize_mean_NA compengine_embed2_incircle_1 compengine_embed2_incircle_2 compengine_ac_9 compengine_firstmin_ac compengine_trev_num compengine_motiftwo_entro3 compengine_walker_propcross compengine_localsimple_mean1 compengine_localsimple_lfitac compengine_sampen_first compengine_std1st_der compengine_spreadrandomlocal_meantaul_50 compengine_spreadrandomlocal_meantaul_ac2 compengine_histogram_mode_10 compengine_outlierinclude_mdrmd compengine_fluctanal_prop_r1 crossing_points dist_features_histogram_mode_10 dist_features_outlierinclude_mdrmd embed2_incircle entropy firstmin_ac firstzero_ac flat_spots fluctanal_prop_r1_fluctanal_prop_r1 arch_acf garch_acf arch_r2 garch_r2 histogram_mode alpha beta hurst hw_parameters_hw_parameters hw_parameters_NA localsimple_taures lumpiness max_kl_shift time_kl_shift max_level_shift time_level_shift max_var_shift time_var_shift motiftwo_entro3 nonlinearity outlierinclude_mdrmd x_pacf5 diff1x_pacf5 diff2x_pacf5 pred_features_localsimple_mean1 pred_features_localsimple_lfitac pred_features_sampen_first sampen_first_sampen_first sampenc scal_features_fluctanal_prop_r1 spreadrandomlocal_meantaul stability station_features_std1st_der station_features_spreadrandomlocal_meantaul_50 station_features_spreadrandomlocal_meantaul_ac2 std1st_der_std1st_der nperiods seasonal_period trend spike linearity curvature e_acf1 e_acf10 trev_num tsfeatures_frequency tsfeatures_nperiods tsfeatures_seasonal_period tsfeatures_trend tsfeatures_spike tsfeatures_linearity tsfeatures_curvature tsfeatures_e_acf1 tsfeatures_e_acf10 tsfeatures_entropy tsfeatures_x_acf1 tsfeatures_x_acf10 tsfeatures_diff1_acf1 tsfeatures_diff1_acf10 tsfeatures_diff2_acf1 tsfeatures_diff2_acf10 unitroot_kpss unitroot_pp walker_propcross
1 1 0 -0.0675275 0.0097094 0.0526897 -0.5005299 0.3297018 -0.6772403 0.6124739 0.0627825 0.3929961 0.6147860 -0.0675275 1 0.1208750 2.071663 0.5405405 1 1 0.3929961 0.6147860 -0.0675275 1 0.1208750 2.071663 0.5405405 1 1 1.788841 1.408737 1.68 1.43 -0.25 -0.2865385 0.1627907 132 -0.25 -0.2865385 0.3929961 0.9840151 1 3 4 0.1627907 0.0652585 0.0154406 0.0627825 0.0253367 -0.25 0.0013330 0.0013330 0.5000458 NA NA 1 0.3556536 1.783636 103 1.297736 97 2.819828 46 2.071663 0.0752319 -0.2865385 0.0108653 0.4457792 1.0525222 1 1 1.788841 1.788841 1.788841 0.1627907 1.76 0.0562693 1.408737 1.74 1.36 1.408737 0 1 0.0043052 0.0000261 0.8421403 -0.7069160 0.0052389 0.0588324 0.1208750 1 0 1 0.0043052 0.0000261 0.8421403 -0.7069160 0.0052389 0.0588324 0.9840151 0.0097094 0.0526897 -0.5005299 0.3297018 -0.6772403 0.6124739 0.0993829 -249.7732 0.5405405
2 2 0 -0.0421577 -0.0075902 0.0387481 -0.5171529 0.3129147 -0.6727897 0.5379301 0.0558032 0.4285714 0.6563707 -0.0421577 1 -0.4765229 2.077581 0.5019305 1 1 0.4285714 0.6563707 -0.0421577 1 -0.4765229 2.077581 0.5019305 1 1 1.780390 1.419266 1.95 1.00 0.50 0.2615385 0.1627907 123 0.50 0.2615385 0.4285714 0.9864332 1 1 4 0.1627907 0.0664358 0.0657859 0.0558032 0.0554355 0.50 0.0001000 0.0001000 0.5000458 NA NA 1 0.4636768 1.733008 247 1.311861 141 2.625772 221 2.077581 0.0273335 0.2615385 0.0256032 0.4606850 1.0171377 1 1 1.780390 1.780390 1.780390 0.1627907 2.05 0.0892206 1.419266 2.12 1.00 1.419266 0 1 0.0177460 0.0000399 0.9249561 0.7665407 -0.0218053 0.0411861 -0.4765229 1 0 1 0.0177460 0.0000399 0.9249561 0.7665407 -0.0218053 0.0411861 0.9864332 -0.0075902 0.0387481 -0.5171529 0.3129147 -0.6727897 0.5379301 0.0414599 -256.0485 0.5019305
3 3 1 0.0099598 -0.0405929 0.0449036 -0.5026683 0.3471209 -0.6718885 0.6109006 0.0325470 0.4671815 0.7065637 0.0099598 1 -0.8755173 2.069233 0.5328185 1 0 0.4671815 0.7065637 0.0099598 1 -0.8755173 2.069233 0.5328185 1 1 1.706841 1.443315 1.38 1.00 -0.50 -0.2538462 0.1395349 132 -0.50 -0.2538462 0.4671815 0.9868568 1 1 6 0.1395349 0.0388513 0.0039162 0.0325470 0.0041902 -0.50 0.0014557 0.0014557 0.5000458 NA NA 1 1.2670493 7.746711 95 1.403784 87 5.235499 84 2.069233 0.2436499 -0.2538462 0.0223069 0.5356408 0.9954919 1 1 1.706841 1.706841 1.706841 0.1395349 1.42 0.0716499 1.443315 1.42 1.00 1.443315 0 1 0.0141368 0.0000929 0.8414359 -0.0259311 -0.0547484 0.0492987 -0.8755173 1 0 1 0.0141368 0.0000929 0.8414359 -0.0259311 -0.0547484 0.0492987 0.9868568 -0.0405929 0.0449036 -0.5026683 0.3471209 -0.6718885 0.6109006 0.0775698 -258.1295 0.5328185
4 4 0 -0.0428748 -0.0443619 0.0615867 -0.4571442 0.3184053 -0.5906478 0.4361178 0.1275576 0.4555985 0.7027027 -0.0428748 2 -0.9943808 2.068744 0.4903475 0 0 0.4555985 0.7027027 -0.0428748 2 -0.9943808 2.068744 0.4903475 1 1 1.660825 1.445807 1.24 1.00 0.25 0.0153846 0.1395349 127 0.25 0.0153846 0.4555985 0.9790521 2 1 7 0.1395349 0.0694296 0.0112709 0.0579144 0.0123884 0.25 0.0480021 0.0001000 0.5000458 NA NA 1 1.0068624 4.994753 132 1.258758 173 5.886911 156 2.068744 0.3840091 0.0153846 0.0503205 0.5402603 1.1070217 1 1 1.660825 1.660825 1.660825 0.1395349 1.10 0.1065111 1.445807 1.14 1.00 1.445807 0 1 0.0283540 0.0000482 -1.2297854 0.2921899 -0.0728152 0.0752389 -0.9943808 1 0 1 0.0283540 0.0000482 -1.2297854 0.2921899 -0.0728152 0.0752389 0.9790521 -0.0443619 0.0615867 -0.4571442 0.3184053 -0.5906478 0.4361178 0.2129633 -262.0781 0.4903475
5 5 0 0.0259312 -0.2447835 0.1469130 -0.5810073 0.4796508 -0.6799229 0.6232529 0.2014861 0.6563707 0.7992278 0.0259312 1 -0.7167079 2.059764 0.5289575 1 0 0.6563707 0.7992278 0.0259312 1 -0.7167079 2.059764 0.5289575 1 1 1.347789 1.580825 1.08 0.98 -0.50 0.7961538 0.1627907 133 -0.50 0.7961538 0.6563707 0.9723766 1 1 9 0.1627907 0.2718058 0.2229375 0.1765130 0.1330761 -0.50 0.0001000 0.0001000 0.5000458 NA NA 1 2.8846415 11.474426 80 1.772392 229 8.468236 236 2.059764 0.2143595 0.7961538 0.1008392 0.7538746 1.2926800 1 1 1.347789 1.347789 1.347789 0.1627907 1.08 0.0797924 1.580825 1.06 0.98 1.580825 0 1 0.0121072 0.0001568 -0.5488436 0.2255538 -0.2599764 0.1558209 -0.7167079 1 0 1 0.0121072 0.0001568 -0.5488436 0.2255538 -0.2599764 0.1558209 0.9723766 -0.2447835 0.1469130 -0.5810073 0.4796508 -0.6799229 0.6232529 0.1506344 -323.5672 0.5289575
6 6 0 -0.0761166 0.0468556 0.0858348 -0.5253131 0.3438031 -0.6901570 0.6130725 0.0432628 0.4352941 0.6627451 -0.0761166 1 0.0898648 2.068914 0.5250965 1 1 0.4352941 0.6627451 -0.0761166 1 0.0898648 2.068914 0.5250965 1 1 1.751575 1.381854 2.69 1.71 -0.25 -0.0846154 0.3488372 134 -0.25 -0.0846154 0.4352941 0.9806218 1 5 5 0.3488372 0.0500806 0.0502154 0.0627968 0.0620877 -0.25 0.0286244 0.0001000 0.5188805 NA NA 1 0.2189481 3.145763 141 1.447883 80 2.077936 84 2.068914 0.0137733 -0.0846154 0.0172321 0.4345976 1.0881798 1 1 1.751575 1.751575 1.751575 0.3488372 2.61 0.1479673 1.381854 2.63 1.81 1.381854 0 1 0.0077481 0.0000329 -0.5473782 0.4505809 0.0410068 0.0873468 0.0898648 1 0 1 0.0077481 0.0000329 -0.5473782 0.4505809 0.0410068 0.0873468 0.9806218 0.0468556 0.0858348 -0.5253131 0.3438031 -0.6901570 0.6130725 0.0259414 -262.3484 0.5250965
## [1] 12000   109

The dimensions of the data as still 12,000 with 109 features (created from the tsfeatures package). That is we have 6,000 synthetic and 6,000 real financial time series (12,000 * ~260 = 3,120,000 but we applied tsfeatures to collapse the ~260 down to 1 single observation for each asset)

I collapsed this problem down from a time series prediction problem to a pure classification problem. I split the data between training and validation set next… I also split the data into X_train, Y_train… etc.

I split the df/Stats data set into a train set of 75% of the observations and an in-sample test data set of 25% of the observations.

######################################################################
################# Train and XGBoost model on the TS Features #########

#Stats <- Stats %>%
#  select_if(~sum(!is.na(.)) > 0)

# Split the training set up between train and a small validation set
smp_size <- floor(0.75 * nrow(Stats))
#set.seed(123)
train_ind <- sample(seq_len(nrow(Stats)), size = smp_size)

train <- Stats[train_ind, ]
val <- Stats[-train_ind, ]

# We have 106 time series features for the model to learn from.

x_train <- train %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()

x_val <- val %>%
ungroup() %>%
select(-class, -row_id, -X) %>%
as.matrix()

y_train <- train %>%
ungroup() %>%
pull(class)

y_val <- val %>%
ungroup() %>%
pull(class)

### How the training X (input variables) data looks:

Table 5: How the X_train data look
ac_9_ac_9 acf_features_x_acf1 acf_features_x_acf10 acf_features_diff1_acf1 acf_features_diff1_acf10 acf_features_diff2_acf1 acf_features_diff2_acf10 ARCH.LM autocorr_features_embed2_incircle_1 autocorr_features_embed2_incircle_2 autocorr_features_ac_9 autocorr_features_firstmin_ac autocorr_features_trev_num autocorr_features_motiftwo_entro3 autocorr_features_walker_propcross binarize_mean_binarize_mean binarize_mean_NA compengine_embed2_incircle_1 compengine_embed2_incircle_2 compengine_ac_9 compengine_firstmin_ac compengine_trev_num compengine_motiftwo_entro3 compengine_walker_propcross compengine_localsimple_mean1 compengine_localsimple_lfitac compengine_sampen_first compengine_std1st_der compengine_spreadrandomlocal_meantaul_50 compengine_spreadrandomlocal_meantaul_ac2 compengine_histogram_mode_10 compengine_outlierinclude_mdrmd compengine_fluctanal_prop_r1 crossing_points dist_features_histogram_mode_10 dist_features_outlierinclude_mdrmd embed2_incircle entropy firstmin_ac firstzero_ac flat_spots fluctanal_prop_r1_fluctanal_prop_r1 arch_acf garch_acf arch_r2 garch_r2 histogram_mode alpha beta hurst hw_parameters_hw_parameters hw_parameters_NA localsimple_taures lumpiness max_kl_shift time_kl_shift max_level_shift time_level_shift max_var_shift time_var_shift motiftwo_entro3 nonlinearity outlierinclude_mdrmd x_pacf5 diff1x_pacf5 diff2x_pacf5 pred_features_localsimple_mean1 pred_features_localsimple_lfitac pred_features_sampen_first sampen_first_sampen_first sampenc scal_features_fluctanal_prop_r1 spreadrandomlocal_meantaul stability station_features_std1st_der station_features_spreadrandomlocal_meantaul_50 station_features_spreadrandomlocal_meantaul_ac2 std1st_der_std1st_der nperiods seasonal_period trend spike linearity curvature e_acf1 e_acf10 trev_num tsfeatures_frequency tsfeatures_nperiods tsfeatures_seasonal_period tsfeatures_trend tsfeatures_spike tsfeatures_linearity tsfeatures_curvature tsfeatures_e_acf1 tsfeatures_e_acf10 tsfeatures_entropy tsfeatures_x_acf1 tsfeatures_x_acf10 tsfeatures_diff1_acf1 tsfeatures_diff1_acf10 tsfeatures_diff2_acf1 tsfeatures_diff2_acf10 unitroot_kpss unitroot_pp walker_propcross
6801 0.0498492 -0.0642025 0.0542648 -0.4423482 0.2575236 -0.5981303 0.4149592 0.0271444 0.4710425 0.7181467 0.0498492 2 0.8754566 2.057333 0.5598456 0 1 0.4710425 0.7181467 0.0498492 2 0.8754566 2.057333 0.5598456 1 1 1.704503 1.460466 1.33 1.00 -0.50 0.1115385 0.8604651 139 -0.50 0.1115385 0.4710425 0.9888208 2 1 3 0.8604651 0.0332257 0.0244434 0.0370423 0.0287773 -0.50 0.0001000 0.0001000 0.5000458 NA NA 1 0.7769640 3.827223 209 1.027671 131 3.254518 195 2.057333 0.0695918 0.1115385 0.0474059 0.5669070 1.0663179 1 1 1.704503 1.704503 1.704503 0.8604651 1.41 0.0639649 1.460466 1.42 1.00 1.460466 0 1 0.0069481 0.0000643 -0.8628963 0.2636951 -0.0719026 0.0587799 0.8754566 1 0 1 0.0069481 0.0000643 -0.8628963 0.2636951 -0.0719026 0.0587799 0.9888208 -0.0642025 0.0542648 -0.4423482 0.2575236 -0.5981303 0.4149592 0.1777957 -246.9618 0.5598456
4209 -0.0037257 -0.0166400 0.0302609 -0.5444182 0.3391695 -0.7025401 0.5898760 0.0369855 0.3976834 0.6409266 -0.0037257 1 0.0772589 2.065480 0.5598456 1 1 0.3976834 0.6409266 -0.0037257 1 0.0772589 2.065480 0.5598456 1 1 1.752028 1.427591 1.39 1.00 -0.25 -0.1000000 0.4651163 137 -0.25 -0.1000000 0.3976834 0.9866480 1 1 4 0.4651163 0.0328564 0.0286941 0.0369855 0.0347972 -0.25 0.0008843 0.0008843 0.5000458 NA NA 1 0.2267605 3.549229 215 1.390319 3 2.017745 143 2.065480 0.0236440 -0.1000000 0.0060988 0.4859730 1.0685267 1 1 1.752028 1.752028 1.752028 0.4651163 1.49 0.0831999 1.427591 1.53 1.00 1.427591 0 1 0.0431696 0.0000288 -0.6356332 1.0362897 -0.0608160 0.0358936 0.0772589 1 0 1 0.0431696 0.0000288 -0.6356332 1.0362897 -0.0608160 0.0358936 0.9866480 -0.0166400 0.0302609 -0.5444182 0.3391695 -0.7025401 0.5898760 0.0372919 -268.4757 0.5598456
11168 0.0236704 -0.0269749 0.0299079 -0.4943006 0.2640054 -0.6626027 0.4906038 0.1265569 0.4401544 0.6640927 0.0236704 2 -0.4569401 2.075666 0.4633205 1 1 0.4401544 0.6640927 0.0236704 2 -0.4569401 2.075666 0.4633205 1 1 1.709466 1.431144 1.52 1.00 0.25 -0.0961538 0.1627907 122 0.25 -0.0961538 0.4401544 0.9882937 2 1 4 0.1627907 0.1453674 0.1490540 0.1265569 0.1247021 0.25 0.0411075 0.0001000 0.5000458 NA NA 1 0.3863291 2.834691 227 1.096209 123 2.760158 197 2.075666 0.1218026 -0.0961538 0.0088598 0.4643608 1.0505751 1 1 1.709466 1.709466 1.709466 0.1627907 1.61 0.0691848 1.431144 1.50 1.00 1.431144 0 1 0.0134781 0.0000342 -0.6468298 -1.1770328 -0.0419291 0.0376999 -0.4569401 1 0 1 0.0134781 0.0000342 -0.6468298 -1.1770328 -0.0419291 0.0376999 0.9882937 -0.0269749 0.0299079 -0.4943006 0.2640054 -0.6626027 0.4906038 0.1743418 -260.0758 0.4633205
5794 -0.0007087 0.1194830 0.0616705 -0.4062897 0.2206195 -0.6016700 0.4137913 0.1556551 0.4806202 0.6782946 -0.0007087 2 -0.5797405 2.066637 0.4787645 1 0 0.4806202 0.6782946 -0.0007087 2 -0.5797405 2.066637 0.4787645 1 1 1.558307 1.328565 2.03 1.18 -0.25 -0.3000000 0.2325581 120 -0.25 -0.3000000 0.4806202 0.9815963 2 2 5 0.2325581 0.2198692 0.0941053 0.1406280 0.0756639 -0.25 0.0125856 0.0001000 0.5477543 NA NA 1 0.7772726 8.411092 48 1.573682 146 3.802986 149 2.066637 0.1381103 -0.3000000 0.0193037 0.3959500 0.9255264 1 1 1.558307 1.558307 1.558307 0.2325581 1.98 0.1331827 1.328565 2.01 1.27 1.328565 0 1 0.0139233 0.0000358 -0.8988748 0.9389128 0.1079346 0.0661260 -0.5797405 1 0 1 0.0139233 0.0000358 -0.8988748 0.9389128 0.1079346 0.0661260 0.9815963 0.1194830 0.0616705 -0.4062897 0.2206195 -0.6016700 0.4137913 0.1182423 -224.0670 0.4787645
8693 -0.0814496 -0.0984498 0.1142883 -0.4688008 0.3181153 -0.6166136 0.4555893 0.1508792 0.4054054 0.6602317 -0.0814496 2 0.3988370 2.060571 0.5250965 0 1 0.4054054 0.6602317 -0.0814496 2 0.3988370 2.060571 0.5250965 1 1 1.651243 1.484233 1.19 1.00 -0.50 -0.0576923 0.3488372 136 -0.50 -0.0576923 0.4054054 0.9745764 2 1 6 0.3488372 0.0946062 0.0937635 0.1057152 0.1052409 -0.50 0.0269522 0.0001000 0.5000458 NA NA 1 0.5495742 7.853783 195 1.039641 191 4.458772 187 2.060571 0.1164590 -0.0576923 0.0467339 0.5896074 1.1095330 1 1 1.651243 1.651243 1.651243 0.3488372 1.24 0.0998210 1.484233 1.35 1.00 1.484233 0 1 0.0033231 0.0000574 0.1887497 0.4564879 -0.1022983 0.1171558 0.3988370 1 0 1 0.0033231 0.0000574 0.1887497 0.4564879 -0.1022983 0.1171558 0.9745764 -0.0984498 0.1142883 -0.4688008 0.3181153 -0.6166136 0.4555893 0.0391658 -262.9010 0.5250965
1073 -0.1253873 0.1511912 0.0608605 -0.3832523 0.2048003 -0.5832067 0.3861283 0.0876692 0.4031008 0.6356589 -0.1253873 2 0.2463431 2.061698 0.4594595 1 1 0.4031008 0.6356589 -0.1253873 2 0.2463431 2.061698 0.4594595 1 1 1.763381 1.304792 2.44 1.13 -0.25 0.1230769 0.1395349 121 -0.25 0.1230769 0.4031008 0.9867903 2 2 4 0.1395349 0.0779468 0.0618625 0.0695878 0.0601294 -0.25 0.0778294 0.0001000 0.5663347 NA NA 1 0.3151884 7.528904 185 2.069230 177 2.340804 169 2.061698 0.0279574 0.1230769 0.0310540 0.3527793 0.8978003 1 1 1.763381 1.763381 1.763381 0.1395349 2.45 0.0816322 1.304792 2.35 1.23 1.304792 0 1 0.0213244 0.0000306 -0.5577693 0.6111726 0.1329904 0.0758345 0.2463431 1 0 1 0.0213244 0.0000306 -0.5577693 0.6111726 0.1329904 0.0758345 0.9867903 0.1511912 0.0608605 -0.3832523 0.2048003 -0.5832067 0.3861283 0.0849681 -208.4546 0.4594595

### How the training Y (predictor variable) data looks:

Table 6: Y_train
.
1
0
1
0
0
1

I set the data up for an XGBoost model:

I create a grid search in order search over a parameter space to locate the optimal parameters for the data set. It needs a little more work but it’s a pretty good starting point. I can just add code to the expand.grid function. That is, say I want to increase the depth of the tree I can add to max_depth = c(5, 8, 14) more parameters such as max_depth = c(5, 8, 14, 1, 2, 3, 4, 6, 7). Note Adding parameters to the grid search increases computational time exponentially. Every parameter you add a value to, the model has to search all possible combinations associated with that parameter. That is, adding an eta = c(0.1) and max_depth = c(5) would give me the optimal parameter for one iteration/loop through the training model, i.e. an eta = c(0.1) mapped onto a max_depth = c(5). Adding an additional value to the eta = c(0.1, 0.3) and max_depth = c(5) would map eta = 0.1 onto max_depth = 5 and eta = 0.3 on to max_depth = 5. If I add another value such that eta = c(0.1, 0.3, 0.4) then all 3 of these values will be mapped to max_depth = c(5). Adding values to the max_depth = c(5) parameter would add an extra layer of complexity to the grid search. This added into the fact that there are many parameters to optimize in an XGBoost model can drastically increase computational complexity. Thus, understanding the statistics behind the models in Machine Learning is important when trying to avoid getting stuck in a local minimum (which any greedy algorithm using gradient descent optimisation can do: greedy algorithm).

######################################################################
################# XGBoost Grid Search to locate Optimal Parameters ###

##############################################################################################################################
# NOTE: This section was taken from the first chapter of my PhD where I needed to search over a parameter space to locate the
# most optimal parameters - I have just adapted it for this problem of Time Series Classification.
# Its simple enough to add parameters and different values - I just optimise a few important parameters from domain knowledge
# of the XGBoost model for this task, i.e depth and eta are quite important in gradient boosting.

# 1) I create a "grid" with different parameter values or combinations of parameter values
# 2) I apply cross validation over the parameter space to fine the most optimal values for the XGBoost model.
# 3) I print the model parameters which give the best train / (in-sample test) results in a data table.
##############################################################################################################################

# Grid Search Parameters:
# 1)
searchGridSubCol <- expand.grid(subsample = c(1), #Range (0,1], default = 1, set to 0.5 will prevent overfitting
colsample_bytree = c(1), #Range (0,1], default = 1
max_depth = c(5, 8, 14), #Range (0, inf], default = 6
min_child = c(1), #Range (0, inf], default = 1
eta = c(0.1, 0.05, 0.3), #Range (0,1], default = 0.3
gamma = c(0), #Range (0, inf], default = 0
lambda = c(1), #Default = 1, L2 regularisation on weights, higher the more conservative the model
alpha = c(0), #Default = 0, L1 regularisation on weights, higher the more conservative the model
max_delta_step = c(0), #Range (0, inf], default = 0 (Helpful for logisitc regression when class is extremely imbalanced, set to value 1-10 may help control the update)
colsample_bylevel = c(1) #Range (0,1], default = 1
)

ntrees = 200
nfold <- 10                             # I use nfold = 10 which is probably too many folds, 5 should be sufficient.
watchlist <- list(train = dtrain, test = dval)

# 2)
system.time(
AUCHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["sub_sample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentDepth <- parameterList[["max_depth"]]
currentEta <- parameterList[["eta"]]
currentMinChild <- parameterList[["min_child"]]
gamma <- parameterList[["gamma"]]
lambda <- parameterList[["lambda"]]
alpha <- parameterList[["alpha"]]
max_delta_step <- parameterList[["max_delta_step"]]
colsample_bylevel <- parameterList[["colsample_bylevel"]]
xgboostModelCV <- xgb.cv(data =  dtrain,
nrounds = ntrees,
nfold = nfold,
showsd = TRUE,
metrics = c("auc", "logloss", "error"),
verbose = TRUE,
"eval_metric" = c("auc", "logloss", "error"),
"objective" = "binary:logistic", #Outputs a probability "binary:logitraw" - outputs score before logistic transformation
"max.depth" = currentDepth,
"eta" = currentEta,
"gamma" = gamma,
"lambda" = lambda,
"alpha" = alpha,
"subsample" = currentSubsampleRate,
"colsample_bytree" = currentColsampleRate,
print_every_n = 50, # print ever 50 trees to reduce the outputs printed.
"min_child_weight" = currentMinChild,
booster = "gbtree", #booster = "dart"  #using dart can help improve accuracy.
early_stopping_rounds = 10,
watchlist = watchlist,
seed = 1234)
xvalidationScores <<- as.data.frame(xgboostModelCV$evaluation_log) train_auc_mean <- tail(xvalidationScores$train_auc_mean, 1)
test_auc_mean <- tail(xvalidationScores$test_auc_mean, 1) train_logloss_mean <- tail(xvalidationScores$train_logloss_mean, 1)
test_logloss_mean <- tail(xvalidationScores$test_logloss_mean, 1) train_error_mean <- tail(xvalidationScores$train_error_mean, 1)
test_error_mean <- tail(xvalidationScores$test_error_mean, 1) output <- return(c(train_auc_mean, test_auc_mean, train_logloss_mean, test_logloss_mean, train_error_mean, test_error_mean, xvalidationScores, currentSubsampleRate, currentColsampleRate, currentDepth, currentEta, gamma, lambda, alpha, max_delta_step, colsample_bylevel, currentMinChild)) hypemeans <- which.max(AUCHyperparameters[[1]]$test_auc_mean)
output2 <- return(hypemeans)
}))

The output of the grid search can be set into a nice data frame using the following code. However I did not save this output to file and therefore cannot read it in. You can view the output on the original Jupyter Notebook In [49] here

# 3)
output <- as.data.frame(t(sapply(AUCHyperparameters, '[', c(1:6, 20:29))))
varnames <- c("TrainAUC", "TestAUC", "TrainLogloss", "TestLogloss", "TrainError", "TestError", "SubSampRate", "ColSampRate", "Depth", "eta", "gamma", "lambda", "alpha", "max_delta_step", "col_sample_bylevel", "currentMinChild")
colnames(output) <- varnames
data.table(output)

According to the results at the time the optimal parameters were:

• ntrees = 95,
• eta = 0.1,
• max_depth = 5,

With the other parameters left to default settings for simplicity.

# Plug the optimal parameters into the model.

#################################################################################
################# XGBoost Optimal Parameters from Cross Validation ##############

# This is the final training model where I use the most optimal parameters found over the grid space and plug them in here.

watchlist <- list("train" = dtrain)

params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1,
"objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0,
"colsample_bylevel" = 1, "eval_metric"= "auc",
"set.seed" = 176)

nround <- 95

Now that I have the optimal parameters from the cross validation grid search I can train the final XGBoost model on the whole train_val.csv data set. (Whereas before the optimal parameters were obtained from different folds in the model. More info on k-fold cross validation here)

# Train the XGBoost model

xgb.model <- xgb.train(params, dtrain, nround, watchlist)
## [1]  train-auc:0.700790
## [2]  train-auc:0.720114
## [3]  train-auc:0.735281
## [4]  train-auc:0.741159
## [5]  train-auc:0.748016
## [6]  train-auc:0.752070
## [7]  train-auc:0.754637
## [8]  train-auc:0.759151
## [9]  train-auc:0.762538
## [10] train-auc:0.769652
## [11] train-auc:0.776582
## [12] train-auc:0.780015
## [13] train-auc:0.782065
## [14] train-auc:0.782815
## [15] train-auc:0.788966
## [16] train-auc:0.791026
## [17] train-auc:0.793545
## [18] train-auc:0.797363
## [19] train-auc:0.799069
## [20] train-auc:0.802015
## [21] train-auc:0.802583
## [22] train-auc:0.806938
## [23] train-auc:0.808239
## [24] train-auc:0.811255
## [25] train-auc:0.813142
## [26] train-auc:0.816767
## [27] train-auc:0.817697
## [28] train-auc:0.820239
## [29] train-auc:0.821589
## [30] train-auc:0.823343
## [31] train-auc:0.823939
## [32] train-auc:0.825701
## [33] train-auc:0.827316
## [34] train-auc:0.829365
## [35] train-auc:0.832646
## [36] train-auc:0.833297
## [37] train-auc:0.837006
## [38] train-auc:0.838857
## [39] train-auc:0.839923
## [40] train-auc:0.842968
## [41] train-auc:0.844877
## [42] train-auc:0.845940
## [43] train-auc:0.846583
## [44] train-auc:0.847330
## [45] train-auc:0.848292
## [46] train-auc:0.850215
## [47] train-auc:0.851641
## [48] train-auc:0.852670
## [49] train-auc:0.854706
## [50] train-auc:0.855752
## [51] train-auc:0.856772
## [52] train-auc:0.857806
## [53] train-auc:0.860245
## [54] train-auc:0.861337
## [55] train-auc:0.864178
## [56] train-auc:0.865290
## [57] train-auc:0.865808
## [58] train-auc:0.866386
## [59] train-auc:0.867751
## [60] train-auc:0.870032
## [61] train-auc:0.870500
## [62] train-auc:0.872442
## [63] train-auc:0.873391
## [64] train-auc:0.875188
## [65] train-auc:0.877767
## [66] train-auc:0.879196
## [67] train-auc:0.880079
## [68] train-auc:0.879969
## [69] train-auc:0.880638
## [70] train-auc:0.881389
## [71] train-auc:0.882066
## [72] train-auc:0.882515
## [73] train-auc:0.883854
## [74] train-auc:0.884654
## [75] train-auc:0.885104
## [76] train-auc:0.885922
## [77] train-auc:0.887100
## [78] train-auc:0.888646
## [79] train-auc:0.889833
## [80] train-auc:0.890387
## [81] train-auc:0.891815
## [82] train-auc:0.892281
## [83] train-auc:0.894417
## [84] train-auc:0.895006
## [85] train-auc:0.897079
## [86] train-auc:0.899254
## [87] train-auc:0.901114
## [88] train-auc:0.902460
## [89] train-auc:0.902939
## [90] train-auc:0.903763
## [91] train-auc:0.903792
## [92] train-auc:0.904433
## [93] train-auc:0.904986
## [94] train-auc:0.907339
## [95] train-auc:0.907761
# Note: Plot AUC on for the in-sample train / validation scores -  this was a note for me at the time of writing this R file - I never did get around to plotting the AUC for the in-sample train / validation scores...

What is nice about tree based models is that we can obtain importance scores from the model and find which variables contributed most to the gain in the model. The original paper explains more about the gain in Algorithm 1 and Algorithm 3 here.

# We can obtain "feature" importance results from the model.
xgb.imp <- xgb.importance(model = xgb.model)
xgb.plot.importance(xgb.imp, top_n = 10)

That is, the XGBoost model found that the spike was the most important variable. The spike comes from the stl_features function of the tsfeatures package in R. It computes various measures of trend and seasonality based on Seasonal and Trend Decomposition (STL) and measures the spikiness of a time series based on the variance of the leave-one-out variances of component e_t.

The second variable is interesting also and comes from the compengine feature set from the CompEngine database. It groups variables as autocorrelation, prediction, stationarity, distribution and scaling.

The ARCH.LM comes from the arch_stat function of the tsfeatures package and is based on the Lagrange Multiplier for Autoregressive Conditional Heteroscedasticity (ARCH) Engle 1982.

These are just a few of the variables the XGBoost model found to be the most important. A full overview and more information of the variables used in the model can be found here.

### Predictions using the in-sample test set

Now that I have trained the model using the optimal parameters I want to see if it scores the same or better based on the cross validation phase using the validation data. I use the dval which is the validation data set from the training split to test the model.

# I next make the predictions on the 'in-sample' held out test set, that is, originally I took the 12,000 training samples
# and split them between 75% training and 25% 'in-sample' testing (9000 training vs 3000 in-sample testing)

# I plot the probabilities from the model - the "dashed" line is the average predicted probability.
xgb.pred <- predict(xgb.model, dval, type = 'prob')

results <- cbind(y_val, xgb.pred)

results %>%
as.tibble() %>%
ggplot(aes(x = xgb.pred)) +
geom_density(color = "darkblue", fill = "lightblue") +
geom_vline(aes(xintercept = mean(xgb.pred)),
color = "blue", linetype = "dashed", size = 1) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") +
ggtitle("Predicted probability density plot") +
theme_tq()

# The average predicted probability sits around 0.48 / 0.49, for simplicity I will just select 0.50 as the cut off threshold.
# That is, all observations <= 0.50 are assigned a "0" class or "synthetic" data and all observations >= are assigned a "1" or
# "real" data.
# Finally I output the confusion matrix on the 'in-sample' testing data.

results <- results %>%
as_tibble() %>%
mutate(pred = case_when(
xgb.pred > 0.5 ~ 1,
xgb.pred <= 0.5 ~ 0
))

confusionMatrix(factor(results$pred), factor(results$y_val))
## Confusion Matrix and Statistics
##
##           Reference
## Prediction    0    1
##          0 1041  537
##          1  465  957
##
##                Accuracy : 0.666
##                  95% CI : (0.6488, 0.6829)
##     No Information Rate : 0.502
##     P-Value [Acc > NIR] : <0.0000000000000002
##
##                   Kappa : 0.3319
##
##  Mcnemar's Test P-Value : 0.0249
##
##             Sensitivity : 0.6912
##             Specificity : 0.6406
##          Pos Pred Value : 0.6597
##          Neg Pred Value : 0.6730
##              Prevalence : 0.5020
##          Detection Rate : 0.3470
##    Detection Prevalence : 0.5260
##       Balanced Accuracy : 0.6659
##
##        'Positive' Class : 0
## 

A balanced accuracy score of 67% isn’t so bad considering I threw the kitchen sink at the classification problem and that this is a time series (stock market) classification problem. By kitchen sink I refer to all the time series functions found in the tsfeatures package.

From here I end the training and validation model. I have obtained the optimal values based on the training and validation data sets and now I want to test it on the unknown data the test.csv data.

I read in the test data and compute the time series features from the tsfeatures package just as I did with the training data.

 test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/test.csv") %>%
mutate(row_id = row_number()) %>%
melt(., measure.vars = 1:260) %>%
arrange(row_id)

### How the test features look - (they look similar to the train data set):

Table 7: Test feature data set
row_id variable value
1 feature1 0.0331039
1 feature2 0.0086225
1 feature3 0.0040622
1 feature4 0.0082554
1 feature5 0.0558741
1 feature6 -0.0061266

I call this test_final and not test for no reason what so ever - its the same test.csv from the beginning.

Next I create the same time series features on the test data set as I do on the training data set. I save this as TSfeatures_test.csv.

functions <- sample(functions, 20)

test_final <- test_final %>%
group_by(row_id) %>%
#  nest() %>%
#  sample_n(5) %>%
#  ungroup() %>%
#  unnest() %>%
nest(-row_id) %>%
group_by(row_id) %T>%
{options(warn = -1)} %>%
summarise(Statistics = map(data, ~ data.frame(
bind_cols(
tsfeatures(.x$value, functions))))) %>% unnest(Statistics) #print("Generated 106 Time Series features") #write.csv(test_final, "TSfeatures_test.csv") I have computed all the tsfeatures for the train data set and also for the test data set. I saved these two as TSfeatures_train_val.csv and TSfeatures_test.csv. #### Load in the train and test features data sets I uploaded these files here # I have already created the features for the training dataset so I can just load them right back in as train_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_train_val.csv") test_final <- read.csv("C:/Users/Matt/Desktop/Data Science Challenge/TSfeatures_test.csv") The final data for the training and test looks like: train_final %>% head() %>% kable(caption = "Final training data set") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) Table 8: Final training data set X row_id class ac_9_ac_9 acf_features_x_acf1 acf_features_x_acf10 acf_features_diff1_acf1 acf_features_diff1_acf10 acf_features_diff2_acf1 acf_features_diff2_acf10 ARCH.LM autocorr_features_embed2_incircle_1 autocorr_features_embed2_incircle_2 autocorr_features_ac_9 autocorr_features_firstmin_ac autocorr_features_trev_num autocorr_features_motiftwo_entro3 autocorr_features_walker_propcross binarize_mean_binarize_mean binarize_mean_NA compengine_embed2_incircle_1 compengine_embed2_incircle_2 compengine_ac_9 compengine_firstmin_ac compengine_trev_num compengine_motiftwo_entro3 compengine_walker_propcross compengine_localsimple_mean1 compengine_localsimple_lfitac compengine_sampen_first compengine_std1st_der compengine_spreadrandomlocal_meantaul_50 compengine_spreadrandomlocal_meantaul_ac2 compengine_histogram_mode_10 compengine_outlierinclude_mdrmd compengine_fluctanal_prop_r1 crossing_points dist_features_histogram_mode_10 dist_features_outlierinclude_mdrmd embed2_incircle entropy firstmin_ac firstzero_ac flat_spots fluctanal_prop_r1_fluctanal_prop_r1 arch_acf garch_acf arch_r2 garch_r2 histogram_mode alpha beta hurst hw_parameters_hw_parameters hw_parameters_NA localsimple_taures lumpiness max_kl_shift time_kl_shift max_level_shift time_level_shift max_var_shift time_var_shift motiftwo_entro3 nonlinearity outlierinclude_mdrmd x_pacf5 diff1x_pacf5 diff2x_pacf5 pred_features_localsimple_mean1 pred_features_localsimple_lfitac pred_features_sampen_first sampen_first_sampen_first sampenc scal_features_fluctanal_prop_r1 spreadrandomlocal_meantaul stability station_features_std1st_der station_features_spreadrandomlocal_meantaul_50 station_features_spreadrandomlocal_meantaul_ac2 std1st_der_std1st_der nperiods seasonal_period trend spike linearity curvature e_acf1 e_acf10 trev_num tsfeatures_frequency tsfeatures_nperiods tsfeatures_seasonal_period tsfeatures_trend tsfeatures_spike tsfeatures_linearity tsfeatures_curvature tsfeatures_e_acf1 tsfeatures_e_acf10 tsfeatures_entropy tsfeatures_x_acf1 tsfeatures_x_acf10 tsfeatures_diff1_acf1 tsfeatures_diff1_acf10 tsfeatures_diff2_acf1 tsfeatures_diff2_acf10 unitroot_kpss unitroot_pp walker_propcross 1 1 0 -0.0675275 0.0097094 0.0526897 -0.5005299 0.3297018 -0.6772403 0.6124739 0.0627825 0.3929961 0.6147860 -0.0675275 1 0.1208750 2.071663 0.5405405 1 1 0.3929961 0.6147860 -0.0675275 1 0.1208750 2.071663 0.5405405 1 1 1.788841 1.408737 1.68 1.43 -0.25 -0.2865385 0.1627907 132 -0.25 -0.2865385 0.3929961 0.9840151 1 3 4 0.1627907 0.0652585 0.0154406 0.0627825 0.0253367 -0.25 0.0013330 0.0013330 0.5000458 NA NA 1 0.3556536 1.783636 103 1.297736 97 2.819828 46 2.071663 0.0752319 -0.2865385 0.0108653 0.4457792 1.0525222 1 1 1.788841 1.788841 1.788841 0.1627907 1.76 0.0562693 1.408737 1.74 1.36 1.408737 0 1 0.0043052 0.0000261 0.8421403 -0.7069160 0.0052389 0.0588324 0.1208750 1 0 1 0.0043052 0.0000261 0.8421403 -0.7069160 0.0052389 0.0588324 0.9840151 0.0097094 0.0526897 -0.5005299 0.3297018 -0.6772403 0.6124739 0.0993829 -249.7732 0.5405405 2 2 0 -0.0421577 -0.0075902 0.0387481 -0.5171529 0.3129147 -0.6727897 0.5379301 0.0558032 0.4285714 0.6563707 -0.0421577 1 -0.4765229 2.077581 0.5019305 1 1 0.4285714 0.6563707 -0.0421577 1 -0.4765229 2.077581 0.5019305 1 1 1.780390 1.419266 1.95 1.00 0.50 0.2615385 0.1627907 123 0.50 0.2615385 0.4285714 0.9864332 1 1 4 0.1627907 0.0664358 0.0657859 0.0558032 0.0554355 0.50 0.0001000 0.0001000 0.5000458 NA NA 1 0.4636768 1.733008 247 1.311861 141 2.625772 221 2.077581 0.0273335 0.2615385 0.0256032 0.4606850 1.0171377 1 1 1.780390 1.780390 1.780390 0.1627907 2.05 0.0892206 1.419266 2.12 1.00 1.419266 0 1 0.0177460 0.0000399 0.9249561 0.7665407 -0.0218053 0.0411861 -0.4765229 1 0 1 0.0177460 0.0000399 0.9249561 0.7665407 -0.0218053 0.0411861 0.9864332 -0.0075902 0.0387481 -0.5171529 0.3129147 -0.6727897 0.5379301 0.0414599 -256.0485 0.5019305 3 3 1 0.0099598 -0.0405929 0.0449036 -0.5026683 0.3471209 -0.6718885 0.6109006 0.0325470 0.4671815 0.7065637 0.0099598 1 -0.8755173 2.069233 0.5328185 1 0 0.4671815 0.7065637 0.0099598 1 -0.8755173 2.069233 0.5328185 1 1 1.706841 1.443315 1.38 1.00 -0.50 -0.2538462 0.1395349 132 -0.50 -0.2538462 0.4671815 0.9868568 1 1 6 0.1395349 0.0388513 0.0039162 0.0325470 0.0041902 -0.50 0.0014557 0.0014557 0.5000458 NA NA 1 1.2670493 7.746711 95 1.403784 87 5.235499 84 2.069233 0.2436499 -0.2538462 0.0223069 0.5356408 0.9954919 1 1 1.706841 1.706841 1.706841 0.1395349 1.42 0.0716499 1.443315 1.42 1.00 1.443315 0 1 0.0141368 0.0000929 0.8414359 -0.0259311 -0.0547484 0.0492987 -0.8755173 1 0 1 0.0141368 0.0000929 0.8414359 -0.0259311 -0.0547484 0.0492987 0.9868568 -0.0405929 0.0449036 -0.5026683 0.3471209 -0.6718885 0.6109006 0.0775698 -258.1295 0.5328185 4 4 0 -0.0428748 -0.0443619 0.0615867 -0.4571442 0.3184053 -0.5906478 0.4361178 0.1275576 0.4555985 0.7027027 -0.0428748 2 -0.9943808 2.068744 0.4903475 0 0 0.4555985 0.7027027 -0.0428748 2 -0.9943808 2.068744 0.4903475 1 1 1.660825 1.445807 1.24 1.00 0.25 0.0153846 0.1395349 127 0.25 0.0153846 0.4555985 0.9790521 2 1 7 0.1395349 0.0694296 0.0112709 0.0579144 0.0123884 0.25 0.0480021 0.0001000 0.5000458 NA NA 1 1.0068624 4.994753 132 1.258758 173 5.886911 156 2.068744 0.3840091 0.0153846 0.0503205 0.5402603 1.1070217 1 1 1.660825 1.660825 1.660825 0.1395349 1.10 0.1065111 1.445807 1.14 1.00 1.445807 0 1 0.0283540 0.0000482 -1.2297854 0.2921899 -0.0728152 0.0752389 -0.9943808 1 0 1 0.0283540 0.0000482 -1.2297854 0.2921899 -0.0728152 0.0752389 0.9790521 -0.0443619 0.0615867 -0.4571442 0.3184053 -0.5906478 0.4361178 0.2129633 -262.0781 0.4903475 5 5 0 0.0259312 -0.2447835 0.1469130 -0.5810073 0.4796508 -0.6799229 0.6232529 0.2014861 0.6563707 0.7992278 0.0259312 1 -0.7167079 2.059764 0.5289575 1 0 0.6563707 0.7992278 0.0259312 1 -0.7167079 2.059764 0.5289575 1 1 1.347789 1.580825 1.08 0.98 -0.50 0.7961538 0.1627907 133 -0.50 0.7961538 0.6563707 0.9723766 1 1 9 0.1627907 0.2718058 0.2229375 0.1765130 0.1330761 -0.50 0.0001000 0.0001000 0.5000458 NA NA 1 2.8846415 11.474426 80 1.772392 229 8.468236 236 2.059764 0.2143595 0.7961538 0.1008392 0.7538746 1.2926800 1 1 1.347789 1.347789 1.347789 0.1627907 1.08 0.0797924 1.580825 1.06 0.98 1.580825 0 1 0.0121072 0.0001568 -0.5488436 0.2255538 -0.2599764 0.1558209 -0.7167079 1 0 1 0.0121072 0.0001568 -0.5488436 0.2255538 -0.2599764 0.1558209 0.9723766 -0.2447835 0.1469130 -0.5810073 0.4796508 -0.6799229 0.6232529 0.1506344 -323.5672 0.5289575 6 6 0 -0.0761166 0.0468556 0.0858348 -0.5253131 0.3438031 -0.6901570 0.6130725 0.0432628 0.4352941 0.6627451 -0.0761166 1 0.0898648 2.068914 0.5250965 1 1 0.4352941 0.6627451 -0.0761166 1 0.0898648 2.068914 0.5250965 1 1 1.751575 1.381854 2.69 1.71 -0.25 -0.0846154 0.3488372 134 -0.25 -0.0846154 0.4352941 0.9806218 1 5 5 0.3488372 0.0500806 0.0502154 0.0627968 0.0620877 -0.25 0.0286244 0.0001000 0.5188805 NA NA 1 0.2189481 3.145763 141 1.447883 80 2.077936 84 2.068914 0.0137733 -0.0846154 0.0172321 0.4345976 1.0881798 1 1 1.751575 1.751575 1.751575 0.3488372 2.61 0.1479673 1.381854 2.63 1.81 1.381854 0 1 0.0077481 0.0000329 -0.5473782 0.4505809 0.0410068 0.0873468 0.0898648 1 0 1 0.0077481 0.0000329 -0.5473782 0.4505809 0.0410068 0.0873468 0.9806218 0.0468556 0.0858348 -0.5253131 0.3438031 -0.6901570 0.6130725 0.0259414 -262.3484 0.5250965 test_final %>% head() %>% kable(caption = "Final testing data set") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12) Table 9: Final testing data set X row_id ac_9_ac_9 acf_features_x_acf1 acf_features_x_acf10 acf_features_diff1_acf1 acf_features_diff1_acf10 acf_features_diff2_acf1 acf_features_diff2_acf10 ARCH.LM autocorr_features_embed2_incircle_1 autocorr_features_embed2_incircle_2 autocorr_features_ac_9 autocorr_features_firstmin_ac autocorr_features_trev_num autocorr_features_motiftwo_entro3 autocorr_features_walker_propcross binarize_mean_binarize_mean binarize_mean_NA compengine_embed2_incircle_1 compengine_embed2_incircle_2 compengine_ac_9 compengine_firstmin_ac compengine_trev_num compengine_motiftwo_entro3 compengine_walker_propcross compengine_localsimple_mean1 compengine_localsimple_lfitac compengine_sampen_first compengine_std1st_der compengine_spreadrandomlocal_meantaul_50 compengine_spreadrandomlocal_meantaul_ac2 compengine_histogram_mode_10 compengine_outlierinclude_mdrmd compengine_fluctanal_prop_r1 crossing_points dist_features_histogram_mode_10 dist_features_outlierinclude_mdrmd embed2_incircle entropy firstmin_ac firstzero_ac flat_spots fluctanal_prop_r1_fluctanal_prop_r1 arch_acf garch_acf arch_r2 garch_r2 histogram_mode alpha beta hurst hw_parameters_hw_parameters hw_parameters_NA localsimple_taures lumpiness max_kl_shift time_kl_shift max_level_shift time_level_shift max_var_shift time_var_shift motiftwo_entro3 nonlinearity outlierinclude_mdrmd x_pacf5 diff1x_pacf5 diff2x_pacf5 pred_features_localsimple_mean1 pred_features_localsimple_lfitac pred_features_sampen_first sampen_first_sampen_first sampenc scal_features_fluctanal_prop_r1 spreadrandomlocal_meantaul stability station_features_std1st_der station_features_spreadrandomlocal_meantaul_50 station_features_spreadrandomlocal_meantaul_ac2 std1st_der_std1st_der nperiods seasonal_period trend spike linearity curvature e_acf1 e_acf10 trev_num tsfeatures_frequency tsfeatures_nperiods tsfeatures_seasonal_period tsfeatures_trend tsfeatures_spike tsfeatures_linearity tsfeatures_curvature tsfeatures_e_acf1 tsfeatures_e_acf10 tsfeatures_entropy tsfeatures_x_acf1 tsfeatures_x_acf10 tsfeatures_diff1_acf1 tsfeatures_diff1_acf10 tsfeatures_diff2_acf1 tsfeatures_diff2_acf10 unitroot_kpss unitroot_pp walker_propcross 1 1 -0.0262073 -0.0396281 0.0429784 -0.4964245 0.3379915 -0.6704837 0.6178088 0.1425744 0.5482625 0.7528958 -0.0262073 1 -0.5824739 2.063564 0.4826255 1 1 0.5482625 0.7528958 -0.0262073 1 -0.5824739 2.063564 0.4826255 1 1 1.383933 1.437946 1.91 1.00 0.50 0.4307692 0.1395349 117 0.50 0.4307692 0.5482625 0.9817288 1 1 7 0.1395349 0.1906443 0.0422059 0.1425744 0.0417531 0.50 0.0440489 0.0001000 0.5000458 NA NA 1 1.1617874 4.857530 130 1.031623 230 3.967385 214 2.063564 0.0716802 0.4307692 0.0271516 0.5270423 0.9564642 1 1 1.383933 1.383933 1.383933 0.1395349 1.80 0.0804590 1.437946 1.89 1.00 1.437946 0 1 0.0355541 0.0000573 -2.6210355 -0.0981868 -0.0740868 0.0651438 -0.5824739 1 0 1 0.0355541 0.0000573 -2.6210355 -0.0981868 -0.0740868 0.0651438 0.9817288 -0.0396281 0.0429784 -0.4964245 0.3379915 -0.6704837 0.6178088 0.8820380 -252.2509 0.4826255 2 2 -0.0047799 0.0544155 0.0423445 -0.4931653 0.3114689 -0.6980787 0.6597427 0.1111625 0.4513619 0.6964981 -0.0047799 3 0.2147570 2.068849 0.5250965 1 0 0.4513619 0.6964981 -0.0047799 3 0.2147570 2.068849 0.5250965 1 1 1.611106 1.375120 2.15 1.40 0.25 0.1211538 0.1627907 142 0.25 0.1211538 0.4513619 0.9856808 3 3 6 0.1627907 0.1313081 0.0468159 0.0939769 0.0402163 0.25 0.0063703 0.0001000 0.5012778 NA NA 1 0.5347516 6.848494 91 1.360520 80 3.586240 75 2.068849 0.0618461 0.1211538 0.0344415 0.4336405 0.9510320 1 1 1.611106 1.611106 1.611106 0.1627907 2.14 0.0796936 1.375120 1.82 1.34 1.375120 0 1 0.0216068 0.0000391 0.1351482 -0.3430376 0.0339344 0.0578569 0.2147570 1 0 1 0.0216068 0.0000391 0.1351482 -0.3430376 0.0339344 0.0578569 0.9856808 0.0544155 0.0423445 -0.4931653 0.3114689 -0.6980787 0.6597427 0.0722224 -226.9463 0.5250965 3 3 0.0370364 -0.0041963 0.1781209 -0.3838557 0.3158431 -0.5535087 0.3948373 0.3450202 0.6138996 0.7915058 0.0370364 2 2.9002534 2.067845 0.5598456 1 0 0.6138996 0.7915058 0.0370364 2 2.9002534 2.067845 0.5598456 1 1 1.436472 1.414575 1.24 1.00 0.50 0.7230769 0.1627907 139 0.50 0.7230769 0.6138996 0.9627133 2 1 6 0.1627907 0.4731295 0.0342727 0.2247245 0.0323111 0.50 0.0001000 0.0001000 0.5000458 NA NA 1 3.9022555 33.656077 240 1.695947 222 9.122984 232 2.067845 0.7040489 0.7230769 0.0685939 0.5171369 1.0433489 1 1 1.436472 1.436472 1.436472 0.1627907 1.39 0.1088905 1.414575 1.43 1.00 1.414575 0 1 0.0058644 0.0001243 -1.1897947 -0.4762066 -0.0084531 0.1814633 2.9002534 1 0 1 0.0058644 0.0001243 -1.1897947 -0.4762066 -0.0084531 0.1814633 0.9627133 -0.0041963 0.1781209 -0.3838557 0.3158431 -0.5535087 0.3948373 0.1757311 -235.0780 0.5598456 4 4 -0.0576029 -0.0338906 0.0251717 -0.4963752 0.2570591 -0.6694337 0.4910006 0.0471296 0.3899614 0.6332046 -0.0576029 3 -0.1053821 2.075447 0.5366795 0 1 0.3899614 0.6332046 -0.0576029 3 -0.1053821 2.075447 0.5366795 1 1 1.785628 1.436827 1.52 1.00 -0.25 0.0769231 0.1860465 137 -0.25 0.0769231 0.3899614 0.9886539 3 1 3 0.1860465 0.0511246 0.0516446 0.0471296 0.0470911 -0.25 0.0025845 0.0025845 0.5000458 NA NA 1 0.2161135 2.534373 34 1.404765 154 2.213233 205 2.075447 0.0681473 0.0769231 0.0179401 0.4720756 0.9626432 1 1 1.785628 1.785628 1.785628 0.1860465 1.44 0.0499953 1.436827 1.42 1.00 1.436827 0 1 0.0042080 0.0000286 0.9969942 0.1863847 -0.0370368 0.0269840 -0.1053821 1 0 1 0.0042080 0.0000286 0.9969942 0.1863847 -0.0370368 0.0269840 0.9886539 -0.0338906 0.0251717 -0.4963752 0.2570591 -0.6694337 0.4910006 0.0860264 -241.6752 0.5366795 5 5 -0.1236994 0.0086381 0.0308039 -0.5025363 0.3330186 -0.6693011 0.5835466 0.1157603 0.4202335 0.7003891 -0.1236994 1 -0.0489352 2.058889 0.4864865 1 0 0.4202335 0.7003891 -0.1236994 1 -0.0489352 2.058889 0.4864865 1 1 1.722492 1.396172 1.69 1.32 -0.50 -0.0076923 0.8139535 120 -0.50 -0.0076923 0.4202335 0.9908616 1 3 6 0.8139535 0.0537820 0.0583484 0.1157603 0.1120523 -0.50 0.0001609 0.0001609 0.5090878 NA NA 1 0.6488028 3.045684 97 1.287940 14 4.338131 240 2.058889 0.0094165 -0.0076923 0.0059114 0.4457371 0.9190563 1 1 1.722492 1.722492 1.722492 0.8139535 1.63 0.1107442 1.396172 1.75 1.35 1.396172 0 1 0.0229286 0.0000550 -0.6149100 0.2128084 -0.0125452 0.0317617 -0.0489352 1 0 1 0.0229286 0.0000550 -0.6149100 0.2128084 -0.0125452 0.0317617 0.9908616 0.0086381 0.0308039 -0.5025363 0.3330186 -0.6693011 0.5835466 0.1169027 -266.1451 0.4864865 6 6 0.0137566 -0.0889224 0.0668615 -0.5649436 0.4404459 -0.7097820 0.7128451 0.0752299 0.5366795 0.6447876 0.0137566 1 0.3033072 2.064104 0.5328185 1 0 0.5366795 0.6447876 0.0137566 1 0.3033072 2.064104 0.5328185 1 1 1.464977 1.477767 1.53 1.00 0.25 0.3269231 0.1627907 136 0.25 0.3269231 0.5366795 0.9835850 1 1 6 0.1627907 0.1033936 0.0236197 0.0740159 0.0248339 0.25 0.0001000 0.0001000 0.5000458 NA NA 1 0.7510236 12.688453 197 1.217490 189 2.987989 194 2.064104 0.0649001 0.3269231 0.0200688 0.5201834 1.0761503 1 1 1.464977 1.464977 1.464977 0.1627907 1.35 0.0814814 1.477767 1.36 1.00 1.477767 0 1 0.0081147 0.0000469 0.6555116 -0.0489727 -0.0976177 0.0700199 0.3033072 1 0 1 0.0081147 0.0000469 0.6555116 -0.0489727 -0.0976177 0.0700199 0.9835850 -0.0889224 0.0668615 -0.5649436 0.4404459 -0.7097820 0.7128451 0.0869913 -279.8920 0.5328185 Finally we can run the final model on the held-out-test-set and obtain our predictions based on the training data and the optimal parameters. # previously and run the final training model (to make predictions on the out-of-sample test data) x_train_final <- train_final %>% ungroup() %>% select(-class, -row_id, -X) %>% as.matrix() x_test_final <- test_final %>% ungroup() %>% select(-row_id, -X) %>% as.matrix() y_train_final <- train_final %>% ungroup() %>% pull(class) dtrain_final <- xgb.DMatrix(data = as.matrix(x_train_final), label = y_train_final, missing = "NaN") dtest_final <- xgb.DMatrix(data = as.matrix(x_test_final), missing = "NaN") watchlist <- list("train" = dtrain_final) params <- list("eta" = 0.1, "max_depth" = 5, "colsample_bytree" = 1, "min_child_weight" = 1, "subsample"= 1, "objective"="binary:logistic", "gamma" = 1, "lambda" = 1, "alpha" = 0, "max_delta_step" = 0, "colsample_bylevel" = 1, "eval_metric"= "auc", "set.seed" = 176) nround <- 95 xgb.model_final <- xgb.train(params, dtrain_final, nround, watchlist) ## [1] train-auc:0.708604 ## [2] train-auc:0.721700 ## [3] train-auc:0.723230 ## [4] train-auc:0.729888 ## [5] train-auc:0.735542 ## [6] train-auc:0.738081 ## [7] train-auc:0.740926 ## [8] train-auc:0.744105 ## [9] train-auc:0.746320 ## [10] train-auc:0.748644 ## [11] train-auc:0.754211 ## [12] train-auc:0.756892 ## [13] train-auc:0.761524 ## [14] train-auc:0.763882 ## [15] train-auc:0.767216 ## [16] train-auc:0.772009 ## [17] train-auc:0.772943 ## [18] train-auc:0.774261 ## [19] train-auc:0.775471 ## [20] train-auc:0.777801 ## [21] train-auc:0.780629 ## [22] train-auc:0.784384 ## [23] train-auc:0.787112 ## [24] train-auc:0.788946 ## [25] train-auc:0.791835 ## [26] train-auc:0.793142 ## [27] train-auc:0.795289 ## [28] train-auc:0.798502 ## [29] train-auc:0.799893 ## [30] train-auc:0.802186 ## [31] train-auc:0.804981 ## [32] train-auc:0.805649 ## [33] train-auc:0.807120 ## [34] train-auc:0.809020 ## [35] train-auc:0.810318 ## [36] train-auc:0.812637 ## [37] train-auc:0.814760 ## [38] train-auc:0.816024 ## [39] train-auc:0.817956 ## [40] train-auc:0.819350 ## [41] train-auc:0.821653 ## [42] train-auc:0.822729 ## [43] train-auc:0.824029 ## [44] train-auc:0.824765 ## [45] train-auc:0.826924 ## [46] train-auc:0.827804 ## [47] train-auc:0.828475 ## [48] train-auc:0.831018 ## [49] train-auc:0.832247 ## [50] train-auc:0.833265 ## [51] train-auc:0.834168 ## [52] train-auc:0.835535 ## [53] train-auc:0.836093 ## [54] train-auc:0.837008 ## [55] train-auc:0.837715 ## [56] train-auc:0.839537 ## [57] train-auc:0.840310 ## [58] train-auc:0.841701 ## [59] train-auc:0.842480 ## [60] train-auc:0.843106 ## [61] train-auc:0.844495 ## [62] train-auc:0.845348 ## [63] train-auc:0.845932 ## [64] train-auc:0.847843 ## [65] train-auc:0.849445 ## [66] train-auc:0.850345 ## [67] train-auc:0.851337 ## [68] train-auc:0.852121 ## [69] train-auc:0.852663 ## [70] train-auc:0.854132 ## [71] train-auc:0.855949 ## [72] train-auc:0.856758 ## [73] train-auc:0.857115 ## [74] train-auc:0.857954 ## [75] train-auc:0.858849 ## [76] train-auc:0.859527 ## [77] train-auc:0.859917 ## [78] train-auc:0.860590 ## [79] train-auc:0.861264 ## [80] train-auc:0.862359 ## [81] train-auc:0.863101 ## [82] train-auc:0.863794 ## [83] train-auc:0.864911 ## [84] train-auc:0.866293 ## [85] train-auc:0.866976 ## [86] train-auc:0.867436 ## [87] train-auc:0.869036 ## [88] train-auc:0.869469 ## [89] train-auc:0.869931 ## [90] train-auc:0.870681 ## [91] train-auc:0.872326 ## [92] train-auc:0.873706 ## [93] train-auc:0.875704 ## [94] train-auc:0.876178 ## [95] train-auc:0.876789 I make the final predictions based on the test.csv data. The predict function in R is great, it can take any model and make predictions, we just need to provide the testing data along with the model. I “ask” for probability scores from the predictions. I plot the density of predicted probabilities also. # Make the final predictions on the 'test.csv' data and plot the probability density function. xgb.pred_final <- predict(xgb.model_final, dtest_final, type = 'prob') xgb.pred_final %>% as_tibble() %>% setNames(c("Prediction")) %>% ggplot(aes(x = Prediction)) + geom_density(color = "darkblue", fill = "lightblue") + geom_vline(aes(xintercept = mean(Prediction)), color = "blue", linetype = "dashed", size = 1) + geom_histogram(aes(y = ..density..), colour = "black", fill = "white", alpha = 0.1, position = "identity") + ggtitle("(Out of sample) - Predicted probability density plot") + theme_tq() ### Finally! I make the submission file based on the predicted probabilities. # Convert the probabilities into a binary class of 0 or 1 by a decision threshold of 0.465. # Write the predictions to "submission.csv" xgb.pred_final %>% as_tibble() %>% setNames(c("Prediction")) %>% summarise(mean = mean(Prediction)) ## # A tibble: 1 x 1 ## mean ## <dbl> ## 1 0.465 xgb.pred_final %>% as_tibble() %>% setNames(c("Prediction")) %>% mutate(pred = case_when( Prediction > 0.465 ~ 1, Prediction <= 0.465 ~ 0 )) %>% write.csv("submission.csv") I make the final remark in the Jupyter Notebook I sent as part of the interview process Quote:: Final footnote: Hopefully the out-of-sample predictions will obtain a 67% accuracy (the predictions in the “submission.csv” file). I was told after I sent my scores as part of the interview process how the scores were evaluated (In Spanish): *Para que sepas cómo es la valoración: Obtener entre 0.4-0.6 se considera un resultado aleatorio. A partir 0.6 el algoritmo clasifica correctamente y más de un 0.7 el algoritmo es genial. Por debajo de 0.4 son capaces de diferenciar series sintéticas de las reales, pero están intercambiadas.* I was informed that based on the held out test set I obtained a result of 0.649636 ~0.65% (a little lower than my 0.67% in-sample training set!) but still consistent with the correct methodology I was using (i.e. no leaking test data to the training data) along with the fact that I was just throwing the time series features book/kitchen sink at the problem. Further reading into time series features will strengthen this classification problem and will certainly improve the prediction accuracy! Recall, that my feature selection consisted of applying every feature in the tsfeatures package… Using functions <- ls("package:tsfeatures")[1:42] and then mapping over the data using summarise(Statistics = map(data, ~ data.frame( bind_cols(tsfeatures(.x$value, functions))))) %>%. So there is plenty of improvements to the current model.

Conclusion: A combination of time series feature selection and classifciation models can do pretty well on time series classification models such as this one I faced.

Any errors are my own!

##### Matthew Smith
###### Researcher in Dept Finance

I am a researcher with a focus on Machine Learning methods applied to economics and finance.