Series of Hackerrank Competitions
A few simple Hackerrank competitons consisting of time series and cross sectional problems
library(knitr)
library(kableExtra)
library(ggplot2)
library(tidyquant)
library(dplyr)
Problem 1: Battery Life Prediction Problem
The first problem was to predict how long a laptop would last given the number of hours it was charged. The problem is here
We can read in the data directly from the hackerrank website using the following:
BatteryData <- read.table(file = url("https://s3.amazonaws.com/hr-testcases/399/assets/trainingdata.txt"), strip.white = TRUE, sep = ',', header = FALSE, skip = 0)
V1 | V2 |
---|---|
2.81 | 5.62 |
7.14 | 8.00 |
2.72 | 5.44 |
3.87 | 7.74 |
1.90 | 3.80 |
7.82 | 8.00 |
At first thought I applied a simple time series model to the data and obtained poor predictions. I then took a closer look at the actual data by plotting it and it told a different story.
BatteryData %>%
ggplot(aes(x = V1, y = V2)) +
geom_point() +
stat_smooth(method = "lm", col = "red") +
geom_vline(xintercept = 4) +
theme_tq()
I drew a horizontal line at 4 hours of battery life and plotted a linear regression line - quite clearly it performs badely and I needed to find a different model or better yet set a threshold and predict all observations whose battery had been charging for 4 or more hours to be fully charged.
The model:
LinearPredictionModel <- function(TestData){
TrainData <- read.table(file = url("https://s3.amazonaws.com/hr-testcases/399/assets/trainingdata.txt"), strip.white = TRUE, sep = ',', header = FALSE, skip = 0)
TrainData$LessThanFour <- ifelse(TrainData[, 2] < 4, 1, 0)
LMData <- TrainData %>%
filter(LessThanFour == 1)
LinearModel <- lm(V1 ~ V2, data = LMData)
TestData <- data.frame(V2 = TestData)
STDOUT <- ifelse(TestData < 4, predict(LinearModel, newdata = TestData), 8)
return(cat(STDOUT))
}
The model basically tells us that if the data is less than 4, then fit a simple linear regression model if it is greater or equal to 4 then just apply an 8 prediction.
We can input some new data as follows:
TestData <- c(5, 8, 1, 0.2, 3.5, 5.6, 4.32)
LinearPredictionModel(TestData)
## 8 8 0.5 0.1 1.75 8 8
Problem 2: House Price Prediction Problem using XGBoost
In this problem I want to train an XGBoost (Extreme Gradient Boosting) model to predict what a house is going to be based on some characteristics. There is a house price prediction problem on Hackerrank here but I used the pre-installed base R house prices data.
library(AER)
library(car)
library(dplyr)
library(xgboost)
library(tidyquant)
library(Metrics)
Load in the data and set factors to numeric for easier pre-processing.
data("HousePrices")
data <- HousePrices %>%
mutate_if(is.factor, as.numeric)
price | lotsize | bedrooms | bathrooms | stories | driveway | recreation | fullbase | gasheat | aircon | garage | prefer |
---|---|---|---|---|---|---|---|---|---|---|---|
42000 | 5850 | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 1 | 1 | 1 |
38500 | 4000 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 1 |
49500 | 3060 | 3 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 1 |
60500 | 6650 | 3 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 0 | 1 |
61000 | 6360 | 2 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 1 |
66000 | 4160 | 3 | 1 | 1 | 2 | 2 | 2 | 1 | 2 | 0 | 1 |
Split the data between 80% train and 20% testing.
####################################################################################
# For this example split between train and test
smp_size <- floor(nrow(data) * 0.80)
train_ind <- sample(seq_len(nrow(data)), size = smp_size)
train <- data[train_ind, ]
test <- data[-train_ind, ]
paste("Train Dimensions"); dim(train)
## [1] "Train Dimensions"
## [1] 436 12
paste("Test Dimensions"); dim(test)
## [1] "Test Dimensions"
## [1] 110 12
I create the function to take the training and testing data, train a model and output the predictions. I omit the cross-validation and hype-parameter tuning here but I will write a post detailing how I use a grid search to find the optimal parameters of an XGBoost model soon, (it takes some explaining of the functions I create).
####################################################################################
predictHouseSales <- function(train_data, test_data){
x_train <- subset(train_data, select = c(-price))
y_train <- subset(train_data, select = c(price)) %>% pull(price)
x_test <- subset(test_data, select = c(-price))
y_test <- subset(test_data, select = c(price)) %>% pull(price)
dtrain <- xgb.DMatrix(data = as.matrix(x_train), label = y_train, missing = "NaN")
dtest <- xgb.DMatrix(data = as.matrix(x_test), missing = "NaN")
params <- list(
"eta" = 0.1,
"max_depth" = 5,
"objective" = "reg:linear",
"eval_metric"= "rmse"
)
xgb.model <- xgb.train(params, dtrain, nrounds = 100)
return(predict(xgb.model, newdata = dtest))
}
myPredictions <- predictHouseSales(train, test)
We can inspect some of the predictions and compare them to the actual preditions below:
####################################################################################
myPredictions %>%
data.frame() %>%
setNames(c("myPreds")) %>%
mutate(myPreds = round(myPreds, 0)) %>%
bind_cols(test) %>%
select(myPreds, price) %>%
sample_n(5) %>%
kable(caption = "Compare predictions to the observed") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
myPreds | price |
---|---|
49942 | 39000 |
117177 | 115442 |
65904 | 34000 |
65651 | 42000 |
48866 | 65000 |
Compute the Root Mean Square Error:
####################################################################################
myPredictions %>%
data.frame() %>%
setNames(c("myPreds")) %>%
mutate(myPreds = round(myPreds, 0)) %>%
bind_cols(test) %>%
select(myPreds, price) %>%
summarise(rmse(myPreds, price)) %>%
kable(caption = "Compute the RMSE") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
rmse(myPreds, price) |
---|
16266.12 |
We can also plot the predictions against the observed using ggplot2
:
####################################################################################
myPredictions %>%
data.frame() %>%
setNames(c("myPreds")) %>%
mutate(myPreds = round(myPreds, 0)) %>%
bind_cols(test) %>%
select(myPreds, price) %>%
ggplot(aes(x = price, y = myPreds)) +
geom_point() +
geom_smooth(method = 'lm') +
ggtitle("My predictions vs Actual Price") +
theme_tq()
The model does okay without any feature engineering or parameter optimisation using a grid search method.
Problem 3: Imputing missing values - cross section
For simplicity I load in the iris
dataset and I create some random missing values since the iris
data doesn’t have any.
# Load in the data
library(missForest)
data(iris)
data <- iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
I generate some random missing data, i.e. 20% of the data is missing at random (MaR) or Missing Completely at Random (MCaR). This is an important issue for economists since much of the data we collect contains missing values, due to surveys not being filled correctly, or simply data not being very well colelcted. The below model allows us to impute the missing values using machine learning, avoiding simple averaging or simply removing the values all together.
## [1] "Create some random missing data"
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | NA | setosa |
NA | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | NA | 0.2 | setosa |
5.0 | 3.6 | NA | 0.2 | setosa |
5.4 | 3.9 | NA | 0.4 | setosa |
Okay now we can train a series of simple missing value imputation models on the data.
Model 1: randomForest Imputation:
I won’t discuss the models here but random forests uses a collection of decision trees and averages the results across all trees, which results it some powerful predictions.
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.100000 | 3.5 | 1.400000 | 0.2000000 | setosa |
4.900000 | 3.0 | 1.400000 | 0.1415167 | setosa |
4.722635 | 3.2 | 1.300000 | 0.2000000 | setosa |
4.600000 | 3.1 | 1.197250 | 0.2000000 | setosa |
5.000000 | 3.6 | 1.409930 | 0.2000000 | setosa |
5.400000 | 3.9 | 1.419667 | 0.4000000 | setosa |
Lets combine all the data together, filter it down to the actual data, the missing data and the predicted data.
For the Sepal.Length data
Predicted_Sepal | Actual_Sepal |
---|---|
6.3 | 6.9 |
5.1 | 5.0 |
7.0 | 6.3 |
5.7 | 5.6 |
5.1 | 5.1 |
6.2 | 6.5 |
4.7 | 4.8 |
6.0 | 5.8 |
6.0 | 6.9 |
5.6 | 5.8 |
Which isn’t bad! given that we only had 436 observations to train on and only 110 observations to test on. Using randomForest
models to impute the data is better than averaging the data on some datasets, of course we have errors the actual observations are somewhat close to the predicted data. I only analyse the Sepal.Length
data but substitute the values to find the missing values for the Sepal.Width
, Petal.Length
, Petal.Width
and Species
values since missorest
imputes categorical variables as well.