Predictive analytics of MIMIC database - Lab 3: patient similarity and PsDF

August 16, 2021 3 minute read

We will conduct a simple simulation study to visualize patient similarity, and then use PsDF algorithm and random forest to predict diagnoses.

Load required packages and helpful functions

Required packages:

tidyverse: data manipulation

ranger: random forest

pROC: calculating AUC

gplots: heatmap plot

SNFtool: similarity calculation

cv.RF.R: a helper function for random forest cross validation

PsDF.R: PsDF prediction

library(tidyverse)
library(ranger)
library(pROC)
library(gplots)
library(SNFtool)

source("cv.RF.R")
source("PsDF.R")

A simulation study for patient similarity

We generate 50 binary features for 100 cases and 100 controls, where the features in cases and controls follow different distributions. The matrix data_X combines cases and controls. After standardization, we compute the euclidean distance matrix, for any pair of patients. Then, we compute the similarity matrix (affinity matrix) from the distance. A heatmap for the similarity matrix is showed, where the darker color (blue) represents a higher similarity, while the white color represents a lower similarity. We will see that cases are grouped together and controls are also grouped together.

set.seed(2021)
n = 200
p = 50
case_X = matrix(rbinom(n/2*p, size = 1, prob = 0.6), nrow = n/2)
ctrl_X = matrix(rbinom(n/2*p, size = 1, prob = 0.2), nrow = n/2)
data_X = rbind(case_X, ctrl_X)

std_X = as.matrix(standardNormalization(data_X))
dist_X = as.matrix(dist(std_X, method = "euclidean"))
smlt_X = affinityMatrix(dist_X)
diag(smlt_X) = NA

hm_col = colorRampPalette(c("white", "dodgerblue3", "dodgerblue4"))
sample_col = c(rep("orange",n/2), rep("purple",n/2))

heatmap.2(smlt_X, srtCol=15, cexCol=1, Colv=F, Rowv=F,
          labRow=c(rep(NA, 200)), labCol=c(rep(NA, 200)),
          RowSideColors=sample_col, col=hm_col, dendrogram = "none", trace = "none",
          main="Similarity between 100 cases and 100 controls")
par(lend = 1)
legend("top", legend=c("cases", "controls"), col=c("orange", "purple"), lty= 1, lwd=10)

Read data and select an outcome of interest

We read the previous data outcome, predictor and D_ICD_DIAGNOSES.csv into R, and do the same things as Lab 2 to rank the outcomes by their prevalence.

outcome = readRDS("Data/outcome.rds")
predictor = readRDS("Data/predictor.rds")

colnames(predictor) = str_replace_all(colnames(predictor), "-", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "/", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "%", "_")

prev = apply(outcome, 2, sum)
prev = sort(prev, decreasing = T)
outcome = outcome[,match(names(prev), colnames(outcome))]
identical(rownames(outcome), rownames(predictor))

dict_ICD = read_csv("D_ICD_DIAGNOSES.csv")
dict_outcome = tibble(ICD9_CODE = str_remove_all(colnames(outcome), "ICD_")) %>% 
  left_join(dict_ICD) %>% 
  select(-ROW_ID)

Prediction of hypertensive encephalopathy

We do similar things as Lab 2 to find the cases and controls of hypertensive encephalopathy with i=986. The only differece is that we are going to randomly select 200 controls, instead of using all controls. Because PsDF requires a large RAM to compute similarity, and RAM in RStudio Cloud is limited. We then 50/50 split the the dataset on cases and controls separately, into training data and new data.

set.seed(2021)
i = 986
colnames(outcome)[i] == "ICD_4372"
y = outcome[,i]
case_idx = which(y==1)
ctrl_idx = which(y==0)
ctrl_idx = sample(ctrl_idx, 200)

# split training data (for cv) and new data (for test)
train_case_idx = sample(case_idx, length(case_idx)/2)
train_ctrl_idx = sample(ctrl_idx, length(ctrl_idx)/2)
new_case_idx = setdiff(case_idx, train_case_idx)
new_ctrl_idx = setdiff(ctrl_idx, train_ctrl_idx)

train_idx = c(train_case_idx, train_ctrl_idx)
new_idx = c(new_case_idx, new_ctrl_idx)

train_X = predictor[train_idx,]
new_X = predictor[new_idx,]
train_y = y[train_idx]
new_y = y[new_idx]

Random forest

We first try random forest. But this time we will use a helper function cv.RF to perform cross validation. The function cv.RF is based on codes from Lab 2.

p = ncol(predictor)
m_list = c(sqrt(p)-20, sqrt(p)-10, sqrt(p), sqrt(p)+10, sqrt(p)+20)
nfold = 5

RF_cv = cv.RF(train_X=train_X, train_y=train_y, m_list=m_list, nfold=nfold)
RF_cv$m_best
RF_pred = predict(RF_cv$model, data = new_X)$predictions[,2]
RF_AUC = auc(new_y~RF_pred, quiet=T)
RF_AUC

PsDF prediction

We next use PsDF to conduct prediction. The function PsDF needs the entire data matrix and the index of test samples as inputs, so first we need to combine train_X and new_X, as well as train_y and new_y. If the sample size is large, it may take time for PsDF to finish the prediction.

data_X = rbind(train_X, new_X)
data_y = c(train_y, new_y)
test_idx = (length(train_y)+1):length(data_y)

PsDF_pred = PsDF(predictor = data_X, y = data_y, test_idx = test_idx)
PsDF_AUC = auc(new_y~PsDF_pred, quiet=T)
PsDF_AUC

Finally, we can compare random forest and PsDF using RF_AUC and PsDF_AUC.

Reference

Guo, J., Yuan, C., Shang, N., Zheng, T., Bello, N. A., Kiryluk, K., Weng, C., & Wang, S. (2021). Similarity-based health risk prediction using domain fusion and electronic health records data. Journal of biomedical informatics, 116, 103711.

Jia Guo