Series of Hackerrank Competitions

A few simple Hackerrank competitons consisting of time series and cross sectional problems

library(knitr)
library(kableExtra)
library(ggplot2)
library(tidyquant)
library(dplyr)

Problem 1: Battery Life Prediction Problem

The first problem was to predict how long a laptop would last given the number of hours it was charged. The problem is here

We can read in the data directly from the hackerrank website using the following:

BatteryData <- read.table(file = url("https://s3.amazonaws.com/hr-testcases/399/assets/trainingdata.txt"), strip.white = TRUE, sep = ',', header = FALSE, skip = 0)
Table 1: Battery Life Data
V1 V2
2.81 5.62
7.14 8.00
2.72 5.44
3.87 7.74
1.90 3.80
7.82 8.00

At first thought I applied a simple time series model to the data and obtained poor predictions. I then took a closer look at the actual data by plotting it and it told a different story.

BatteryData %>%
  ggplot(aes(x = V1, y = V2)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  geom_vline(xintercept = 4) +
  theme_tq()

I drew a horizontal line at 4 hours of battery life and plotted a linear regression line - quite clearly it performs badely and I needed to find a different model or better yet set a threshold and predict all observations whose battery had been charging for 4 or more hours to be fully charged.

The model:

LinearPredictionModel <- function(TestData){
  TrainData <- read.table(file = url("https://s3.amazonaws.com/hr-testcases/399/assets/trainingdata.txt"), strip.white = TRUE, sep = ',', header = FALSE, skip = 0)
  
  TrainData$LessThanFour <- ifelse(TrainData[, 2] < 4, 1, 0)
  
  LMData <- TrainData %>%
    filter(LessThanFour == 1)
  
  LinearModel <- lm(V1 ~ V2, data = LMData) 
  
  TestData <- data.frame(V2 = TestData)
  
  STDOUT <- ifelse(TestData < 4, predict(LinearModel, newdata = TestData), 8)
  
  return(cat(STDOUT))
}

The model basically tells us that if the data is less than 4, then fit a simple linear regression model if it is greater or equal to 4 then just apply an 8 prediction.

We can input some new data as follows:

TestData <- c(5, 8, 1, 0.2, 3.5, 5.6, 4.32)
LinearPredictionModel(TestData)
## 8 8 0.5 0.1 1.75 8 8

Problem 2: House Price Prediction Problem using XGBoost

In this problem I want to train an XGBoost (Extreme Gradient Boosting) model to predict what a house is going to be based on some characteristics. There is a house price prediction problem on Hackerrank here but I used the pre-installed base R house prices data.

library(AER)
library(car)
library(dplyr)
library(xgboost)
library(tidyquant)
library(Metrics)

Load in the data and set factors to numeric for easier pre-processing.

data("HousePrices")

data <- HousePrices %>%
  mutate_if(is.factor, as.numeric)
Table 2: House Price Data
price lotsize bedrooms bathrooms stories driveway recreation fullbase gasheat aircon garage prefer
42000 5850 3 1 2 2 1 2 1 1 1 1
38500 4000 2 1 1 2 1 1 1 1 0 1
49500 3060 3 1 1 2 1 1 1 1 0 1
60500 6650 3 1 2 2 2 1 1 1 0 1
61000 6360 2 1 1 2 1 1 1 1 0 1
66000 4160 3 1 1 2 2 2 1 2 0 1

Split the data between 80% train and 20% testing.

####################################################################################
# For this example split between train and test
smp_size <- floor(nrow(data) * 0.80)

train_ind <- sample(seq_len(nrow(data)), size = smp_size)

train <- data[train_ind, ]
test <- data[-train_ind, ]

paste("Train Dimensions"); dim(train)
## [1] "Train Dimensions"
## [1] 436  12
paste("Test Dimensions"); dim(test)
## [1] "Test Dimensions"
## [1] 110  12

I create the function to take the training and testing data, train a model and output the predictions. I omit the cross-validation and hype-parameter tuning here but I will write a post detailing how I use a grid search to find the optimal parameters of an XGBoost model soon, (it takes some explaining of the functions I create).

####################################################################################
predictHouseSales <- function(train_data, test_data){
  x_train <- subset(train_data, select = c(-price))
  y_train <- subset(train_data, select = c(price)) %>% pull(price)
  
  x_test <- subset(test_data, select = c(-price))
  y_test <- subset(test_data, select = c(price)) %>% pull(price)
  
  dtrain <- xgb.DMatrix(data = as.matrix(x_train), label = y_train, missing = "NaN")
  dtest <- xgb.DMatrix(data = as.matrix(x_test), missing = "NaN")
  
  params <- list(
    "eta" = 0.1, 
    "max_depth" = 5, 
    "objective" = "reg:linear",
    "eval_metric"= "rmse"
    )
  
  xgb.model <- xgb.train(params, dtrain, nrounds = 100)
  
  return(predict(xgb.model, newdata = dtest))
  
}

myPredictions <- predictHouseSales(train, test)

We can inspect some of the predictions and compare them to the actual preditions below:

####################################################################################
myPredictions %>%
  data.frame() %>%
  setNames(c("myPreds")) %>%
  mutate(myPreds = round(myPreds, 0)) %>%
  bind_cols(test) %>%
  select(myPreds, price) %>%
  sample_n(5) %>%
  kable(caption = "Compare predictions to the observed") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
Table 3: Compare predictions to the observed
myPreds price
49942 39000
117177 115442
65904 34000
65651 42000
48866 65000

Compute the Root Mean Square Error:

####################################################################################
myPredictions %>%
  data.frame() %>%
  setNames(c("myPreds")) %>%
  mutate(myPreds = round(myPreds, 0)) %>%
  bind_cols(test) %>%
  select(myPreds, price) %>%
  summarise(rmse(myPreds, price)) %>%
  kable(caption = "Compute the RMSE") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 12)
Table 4: Compute the RMSE
rmse(myPreds, price)
16266.12

We can also plot the predictions against the observed using ggplot2:

####################################################################################
myPredictions %>%
  data.frame() %>%
  setNames(c("myPreds")) %>%
  mutate(myPreds = round(myPreds, 0)) %>%
  bind_cols(test) %>%
  select(myPreds, price) %>%
  ggplot(aes(x = price, y = myPreds)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  ggtitle("My predictions vs Actual Price") +
  theme_tq()

The model does okay without any feature engineering or parameter optimisation using a grid search method.

Problem 3: Imputing missing values - cross section

For simplicity I load in the iris dataset and I create some random missing values since the iris data doesn’t have any.

# Load in the data
library(missForest)
data(iris)
data <- iris
Table 5: Iris dataset
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

I generate some random missing data, i.e. 20% of the data is missing at random (MaR) or Missing Completely at Random (MCaR). This is an important issue for economists since much of the data we collect contains missing values, due to surveys not being filled correctly, or simply data not being very well colelcted. The below model allows us to impute the missing values using machine learning, avoiding simple averaging or simply removing the values all together.

## [1] "Create some random missing data"
Table 6: Iris dataset with random missing values
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 NA setosa
NA 3.2 1.3 0.2 setosa
4.6 3.1 NA 0.2 setosa
5.0 3.6 NA 0.2 setosa
5.4 3.9 NA 0.4 setosa

Okay now we can train a series of simple missing value imputation models on the data.

Model 1: randomForest Imputation:

I won’t discuss the models here but random forests uses a collection of decision trees and averages the results across all trees, which results it some powerful predictions.

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
##   missForest iteration 7 in progress...done!
Table 7: Iris dataset
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.100000 3.5 1.400000 0.2000000 setosa
4.900000 3.0 1.400000 0.1415167 setosa
4.722635 3.2 1.300000 0.2000000 setosa
4.600000 3.1 1.197250 0.2000000 setosa
5.000000 3.6 1.409930 0.2000000 setosa
5.400000 3.9 1.419667 0.4000000 setosa

Lets combine all the data together, filter it down to the actual data, the missing data and the predicted data.

For the Sepal.Length data

Table 8: Compare Sepal Observations
Predicted_Sepal Actual_Sepal
6.3 6.9
5.1 5.0
7.0 6.3
5.7 5.6
5.1 5.1
6.2 6.5
4.7 4.8
6.0 5.8
6.0 6.9
5.6 5.8

Which isn’t bad! given that we only had 436 observations to train on and only 110 observations to test on. Using randomForest models to impute the data is better than averaging the data on some datasets, of course we have errors the actual observations are somewhat close to the predicted data. I only analyse the Sepal.Length data but substitute the values to find the missing values for the Sepal.Width, Petal.Length, Petal.Width and Species values since missorest imputes categorical variables as well.

Avatar
Matthew Smith
Researcher in Dept Finance

I am a researcher with a focus on Machine Learning methods applied to economics and finance.

comments powered by Disqus