Predictive analytics of MIMIC database - Lab 3: patient similarity and PsDF
We will conduct a simple simulation study to visualize patient similarity, and then use PsDF algorithm and random forest to predict diagnoses.
Load required packages and helpful functions
Required packages:
tidyverse
: data manipulation
ranger
: random forest
pROC
: calculating AUC
gplots
: heatmap plot
SNFtool
: similarity calculation
cv.RF.R
: a helper function for random forest cross validation
PsDF.R
: PsDF prediction
library(tidyverse)
library(ranger)
library(pROC)
library(gplots)
library(SNFtool)
source("cv.RF.R")
source("PsDF.R")
A simulation study for patient similarity
We generate 50 binary features for 100 cases and 100 controls, where the features in cases and controls follow different distributions. The matrix data_X
combines cases and controls. After standardization, we compute the euclidean distance matrix, for any pair of patients. Then, we compute the similarity matrix (affinity matrix) from the distance. A heatmap for the similarity matrix is showed, where the darker color (blue) represents a higher similarity, while the white color represents a lower similarity. We will see that cases are grouped together and controls are also grouped together.
set.seed(2021)
n = 200
p = 50
case_X = matrix(rbinom(n/2*p, size = 1, prob = 0.6), nrow = n/2)
ctrl_X = matrix(rbinom(n/2*p, size = 1, prob = 0.2), nrow = n/2)
data_X = rbind(case_X, ctrl_X)
std_X = as.matrix(standardNormalization(data_X))
dist_X = as.matrix(dist(std_X, method = "euclidean"))
smlt_X = affinityMatrix(dist_X)
diag(smlt_X) = NA
hm_col = colorRampPalette(c("white", "dodgerblue3", "dodgerblue4"))
sample_col = c(rep("orange",n/2), rep("purple",n/2))
heatmap.2(smlt_X, srtCol=15, cexCol=1, Colv=F, Rowv=F,
labRow=c(rep(NA, 200)), labCol=c(rep(NA, 200)),
RowSideColors=sample_col, col=hm_col, dendrogram = "none", trace = "none",
main="Similarity between 100 cases and 100 controls")
par(lend = 1)
legend("top", legend=c("cases", "controls"), col=c("orange", "purple"), lty= 1, lwd=10)
Read data and select an outcome of interest
We read the previous data outcome
, predictor
and D_ICD_DIAGNOSES.csv
into R, and do the same things as Lab 2 to rank the outcomes by their prevalence.
outcome = readRDS("Data/outcome.rds")
predictor = readRDS("Data/predictor.rds")
colnames(predictor) = str_replace_all(colnames(predictor), "-", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "/", "_")
colnames(predictor) = str_replace_all(colnames(predictor), "%", "_")
prev = apply(outcome, 2, sum)
prev = sort(prev, decreasing = T)
outcome = outcome[,match(names(prev), colnames(outcome))]
identical(rownames(outcome), rownames(predictor))
dict_ICD = read_csv("D_ICD_DIAGNOSES.csv")
dict_outcome = tibble(ICD9_CODE = str_remove_all(colnames(outcome), "ICD_")) %>%
left_join(dict_ICD) %>%
select(-ROW_ID)
Prediction of hypertensive encephalopathy
We do similar things as Lab 2 to find the cases and controls of hypertensive encephalopathy with i=986
. The only differece is that we are going to randomly select 200 controls, instead of using all controls. Because PsDF requires a large RAM to compute similarity, and RAM in RStudio Cloud is limited. We then 50/50 split the the dataset on cases and controls separately, into training data and new data.
set.seed(2021)
i = 986
colnames(outcome)[i] == "ICD_4372"
y = outcome[,i]
case_idx = which(y==1)
ctrl_idx = which(y==0)
ctrl_idx = sample(ctrl_idx, 200)
# split training data (for cv) and new data (for test)
train_case_idx = sample(case_idx, length(case_idx)/2)
train_ctrl_idx = sample(ctrl_idx, length(ctrl_idx)/2)
new_case_idx = setdiff(case_idx, train_case_idx)
new_ctrl_idx = setdiff(ctrl_idx, train_ctrl_idx)
train_idx = c(train_case_idx, train_ctrl_idx)
new_idx = c(new_case_idx, new_ctrl_idx)
train_X = predictor[train_idx,]
new_X = predictor[new_idx,]
train_y = y[train_idx]
new_y = y[new_idx]
Random forest
We first try random forest. But this time we will use a helper function cv.RF
to perform cross validation. The function cv.RF
is based on codes from Lab 2.
p = ncol(predictor)
m_list = c(sqrt(p)-20, sqrt(p)-10, sqrt(p), sqrt(p)+10, sqrt(p)+20)
nfold = 5
RF_cv = cv.RF(train_X=train_X, train_y=train_y, m_list=m_list, nfold=nfold)
RF_cv$m_best
RF_pred = predict(RF_cv$model, data = new_X)$predictions[,2]
RF_AUC = auc(new_y~RF_pred, quiet=T)
RF_AUC
PsDF prediction
We next use PsDF to conduct prediction. The function PsDF
needs the entire data matrix and the index of test samples as inputs, so first we need to combine train_X
and new_X
, as well as train_y
and new_y
. If the sample size is large, it may take time for PsDF to finish the prediction.
data_X = rbind(train_X, new_X)
data_y = c(train_y, new_y)
test_idx = (length(train_y)+1):length(data_y)
PsDF_pred = PsDF(predictor = data_X, y = data_y, test_idx = test_idx)
PsDF_AUC = auc(new_y~PsDF_pred, quiet=T)
PsDF_AUC
Finally, we can compare random forest and PsDF using RF_AUC
and PsDF_AUC
.
Reference
- Guo, J., Yuan, C., Shang, N., Zheng, T., Bello, N. A., Kiryluk, K., Weng, C., & Wang, S. (2021). Similarity-based health risk prediction using domain fusion and electronic health records data. Journal of biomedical informatics, 116, 103711.