3 minute read

Using the data generated in Lab 1, we can compare the prediction performance of two traditional machine learning methods, LASSO and random forest.


Load required packages and helpful functions

Required packages:

tidyverse: data manipulation

glmnet: LASSO regression

ranger: random forest

pROC: calculating AUC

cv.R: cross validation split

library(tidyverse)
library(glmnet)
library(ranger)
library(pROC)

source("cv.R")

Read data

We read the previous data outcome and predictor into R, and then

(1) remove some special characters in the column names of predictor;

(2) rank the outcomes by their prevalence, i.e., the first column of outcome is the most prevalent one.

outcome = readRDS("Data/outcome.rds")
predictor = readRDS("Data/predictor.rds")

colnames(predictor) = str_replace_all(colnames(predictor), "-", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "/", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "%", "_")

prev = apply(outcome, 2, sum)
prev = sort(prev, decreasing = T)
outcome = outcome[,match(names(prev), colnames(outcome))]
identical(rownames(outcome), rownames(predictor))

Select an outcome of interest

We can use the dictionary of diagnoses D_ICD_DIAGNOSES.csv to map ICD code with diagnosis names, and then select any outcome of interest from the dataframe dict_outcome. The ith row of dict_outcome is corresponding to the ith column of outcome.

dict_ICD = read_csv("D_ICD_DIAGNOSES.csv")
dict_outcome = tibble(ICD9_CODE = str_remove_all(colnames(outcome), "ICD_")) %>% 
  left_join(dict_ICD) %>% 
  select(-ROW_ID)

Prediction of hypertension

First we are going to find the cases and controls using the outcome matrix, by given the column number i. Hypertension with ICD9-code 4019 is the most prevalent outcome, so i=1. You may change the value of i to change the outcome of interest.

We then 50/50 split the the dataset on cases and controls separately, into training data which is for cross validation and tuning hyperparameters, and new data for testing the model performance. The predictors in training data train_X and new data new_X are created using sparse coding.

i = 1
colnames(outcome)[i] == "ICD_4019"
y = outcome[,i]
case_idx = which(y==1)
ctrl_idx = which(y==0)

# split training data (for cv) and new data (for test)
set.seed(2021)
train_case_idx = sample(case_idx, length(case_idx)/2)
train_ctrl_idx = sample(ctrl_idx, length(ctrl_idx)/2)
new_case_idx = setdiff(case_idx, train_case_idx)
new_ctrl_idx = setdiff(ctrl_idx, train_ctrl_idx)

train_idx = c(train_case_idx, train_ctrl_idx)
new_idx = c(new_case_idx, new_ctrl_idx)

train_X = Matrix(predictor[train_idx,], sparse = TRUE)
new_X = Matrix(predictor[new_idx,], sparse = TRUE)
train_y = y[train_idx]
new_y = y[new_idx]

LASSO

We use LASSO as the regression-based method, because it can do feature selection and will select important predictors that are related to the response. There is a hyperparameter called lambda in LASSO need to be tuned. We will use the function cv.glmnet in the package glmnet to conduct cross validation and tuning the lambda. The argument alpha=1 means that we use LASSO (or L1 penalty). You can also try different alpha to use elastic net.

After getting the best lambda using training data, we are going to test the final model using the new data from the last step.

lasso_cv = cv.glmnet(x = train_X, y = train_y, family = "binomial", alpha = 1, nfolds = 5)
lasso_coef = as.matrix(coef(lasso_cv, s = lasso_cv$lambda.min))
lasso_pred = predict(lasso_cv, newx = new_X, s = lasso_cv$lambda.min, type = "response")
lasso_AUC = auc(new_y~lasso_pred, quiet=T)
lasso_AUC

Random forest

Similar to the LASSO, we use training data to do cross validation for random forest. We will implement the cross validation by ourselves. The hyperparameter to be tuned is m_try, which is the number of variables to possibly split at in each node (default is the square root of the number variables). We will try different m_try and select the best one based on the AUC of validation data. The table res will contain the cross validation results.

p = ncol(predictor)
m_list = c(sqrt(p)-20, sqrt(p)-10, sqrt(p), sqrt(p)+10, sqrt(p)+20)
nfold = 5
train_case_idx = which(train_y==1)
train_ctrl_idx = which(train_y==0)
case_cv = cross_validation(train_case_idx, nfold)
ctrl_cv = cross_validation(train_ctrl_idx, nfold)

res = NULL
for (j in 1:nfold) {
  cv_valid_idx = c(case_cv[[j]], ctrl_cv[[j]])
  cv_train_idx = setdiff(1:length(train_y), cv_valid_idx)
  
  cv_train_X = Matrix(train_X[cv_train_idx,], sparse = TRUE)
  cv_valid_X = Matrix(train_X[cv_valid_idx,], sparse = TRUE)
  cv_train_y = train_y[cv_train_idx]
  cv_valid_y = train_y[cv_valid_idx]
  
  for (m in m_list) {
    RF_model = ranger(x = cv_train_X, y = cv_train_y, num.trees = 100, mtry = m, probability = TRUE)
    cv_pred = predict(RF_model, data = cv_valid_X)$predictions[,2]
    cv_AUC = auc(cv_valid_y~cv_pred, quiet=T)
    tmp = tibble(cv=j, m=m, AUC=cv_AUC)
    res = rbind(res, tmp)
  }
}

After getting the best m_try using training data, we are going to test the final model using the new data.

m_best = res %>% 
  group_by(m) %>% 
  mutate(AUC=mean(AUC)) %>% 
  ungroup() %>% 
  select(-cv) %>% 
  distinct()

m_best = m_best$m[m_best$AUC==max(m_best$AUC)]
RF_model = ranger(x = train_X, y = train_y, num.trees = 100, mtry = m_best, probability = TRUE)
RF_pred = predict(RF_model, data = new_X)$predictions[,2]
RF_AUC = auc(new_y~RF_pred, quiet=T)
RF_AUC

Finally, we can compare LASSO and random forest using lasso_AUC and RF_AUC.

Updated: