Predictive analytics of MIMIC database - Lab 2: prediction of diagnoses
Using the data generated in Lab 1, we can compare the prediction performance of two traditional machine learning methods, LASSO and random forest.
Load required packages and helpful functions
Required packages:
tidyverse
: data manipulation
glmnet
: LASSO regression
ranger
: random forest
pROC
: calculating AUC
cv.R
: cross validation split
library(tidyverse)
library(glmnet)
library(ranger)
library(pROC)
source("cv.R")
Read data
We read the previous data outcome
and predictor
into R, and then
(1) remove some special characters in the column names of predictor
;
(2) rank the outcomes by their prevalence, i.e., the first column of outcome
is the most prevalent one.
outcome = readRDS("Data/outcome.rds")
predictor = readRDS("Data/predictor.rds")
colnames(predictor) = str_replace_all(colnames(predictor), "-", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "/", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "%", "_")
prev = apply(outcome, 2, sum)
prev = sort(prev, decreasing = T)
outcome = outcome[,match(names(prev), colnames(outcome))]
identical(rownames(outcome), rownames(predictor))
Select an outcome of interest
We can use the dictionary of diagnoses D_ICD_DIAGNOSES.csv
to map ICD code with diagnosis names, and then select any outcome of interest from the dataframe dict_outcome
. The ith row of dict_outcome
is corresponding to the ith column of outcome
.
dict_ICD = read_csv("D_ICD_DIAGNOSES.csv")
dict_outcome = tibble(ICD9_CODE = str_remove_all(colnames(outcome), "ICD_")) %>%
left_join(dict_ICD) %>%
select(-ROW_ID)
Prediction of hypertension
First we are going to find the cases and controls using the outcome
matrix, by given the column number i
. Hypertension with ICD9-code 4019 is the most prevalent outcome, so i=1
. You may change the value of i
to change the outcome of interest.
We then 50/50 split the the dataset on cases and controls separately, into training data which is for cross validation and tuning hyperparameters, and new data for testing the model performance. The predictors in training data train_X
and new data new_X
are created using sparse coding.
i = 1
colnames(outcome)[i] == "ICD_4019"
y = outcome[,i]
case_idx = which(y==1)
ctrl_idx = which(y==0)
# split training data (for cv) and new data (for test)
set.seed(2021)
train_case_idx = sample(case_idx, length(case_idx)/2)
train_ctrl_idx = sample(ctrl_idx, length(ctrl_idx)/2)
new_case_idx = setdiff(case_idx, train_case_idx)
new_ctrl_idx = setdiff(ctrl_idx, train_ctrl_idx)
train_idx = c(train_case_idx, train_ctrl_idx)
new_idx = c(new_case_idx, new_ctrl_idx)
train_X = Matrix(predictor[train_idx,], sparse = TRUE)
new_X = Matrix(predictor[new_idx,], sparse = TRUE)
train_y = y[train_idx]
new_y = y[new_idx]
LASSO
We use LASSO as the regression-based method, because it can do feature selection and will select important predictors that are related to the response. There is a hyperparameter called lambda
in LASSO need to be tuned. We will use the function cv.glmnet
in the package glmnet
to conduct cross validation and tuning the lambda
. The argument alpha=1
means that we use LASSO (or L1 penalty). You can also try different alpha
to use elastic net.
After getting the best lambda
using training data, we are going to test the final model using the new data from the last step.
lasso_cv = cv.glmnet(x = train_X, y = train_y, family = "binomial", alpha = 1, nfolds = 5)
lasso_coef = as.matrix(coef(lasso_cv, s = lasso_cv$lambda.min))
lasso_pred = predict(lasso_cv, newx = new_X, s = lasso_cv$lambda.min, type = "response")
lasso_AUC = auc(new_y~lasso_pred, quiet=T)
lasso_AUC
Random forest
Similar to the LASSO, we use training data to do cross validation for random forest. We will implement the cross validation by ourselves. The hyperparameter to be tuned is m_try
, which is the number of variables to possibly split at in each node (default is the square root of the number variables). We will try different m_try
and select the best one based on the AUC of validation data. The table res
will contain the cross validation results.
p = ncol(predictor)
m_list = c(sqrt(p)-20, sqrt(p)-10, sqrt(p), sqrt(p)+10, sqrt(p)+20)
nfold = 5
train_case_idx = which(train_y==1)
train_ctrl_idx = which(train_y==0)
case_cv = cross_validation(train_case_idx, nfold)
ctrl_cv = cross_validation(train_ctrl_idx, nfold)
res = NULL
for (j in 1:nfold) {
cv_valid_idx = c(case_cv[[j]], ctrl_cv[[j]])
cv_train_idx = setdiff(1:length(train_y), cv_valid_idx)
cv_train_X = Matrix(train_X[cv_train_idx,], sparse = TRUE)
cv_valid_X = Matrix(train_X[cv_valid_idx,], sparse = TRUE)
cv_train_y = train_y[cv_train_idx]
cv_valid_y = train_y[cv_valid_idx]
for (m in m_list) {
RF_model = ranger(x = cv_train_X, y = cv_train_y, num.trees = 100, mtry = m, probability = TRUE)
cv_pred = predict(RF_model, data = cv_valid_X)$predictions[,2]
cv_AUC = auc(cv_valid_y~cv_pred, quiet=T)
tmp = tibble(cv=j, m=m, AUC=cv_AUC)
res = rbind(res, tmp)
}
}
After getting the best m_try
using training data, we are going to test the final model using the new data.
m_best = res %>%
group_by(m) %>%
mutate(AUC=mean(AUC)) %>%
ungroup() %>%
select(-cv) %>%
distinct()
m_best = m_best$m[m_best$AUC==max(m_best$AUC)]
RF_model = ranger(x = train_X, y = train_y, num.trees = 100, mtry = m_best, probability = TRUE)
RF_pred = predict(RF_model, data = new_X)$predictions[,2]
RF_AUC = auc(new_y~RF_pred, quiet=T)
RF_AUC
Finally, we can compare LASSO and random forest using lasso_AUC
and RF_AUC
.