Predictive analytics of MIMIC database - Lab 1: data preprocessing

August 14, 2021 4 minute read

I was working as a teaching assistance to provide a guide of MIMIC database and lead a hands-on computer lab session for Electronic Medical Records Boot Camp: Biostatistical methods for analyzing EMR data at Columbia University Mailman School of Public Health. These are parts of the materials:

Download data

We will use MIMIC-III Clinical Database to conduct predictive analytics. After getting credentialed access, the data is avaiable at here.

Tidyverse tutorial

We are going to perform data manipulation using tidyverse, which a collection of multiple powerful packages such as dplyr, readr, stringr, ggplot2, etc.. Basically, we will use simple “verbs” functions, and the detailed information can be found in this vignettes. Below are some functions that we use everyday:

select() changes whether or not a column is included.

rename() changes the name of columns.

mutate() changes the values of columns and creates new columns.

filter() chooses rows based on column values.

arrange() changes the order of the rows.

distinct() selects only unique/distinct rows.

na.omit() removes any rows with missing values.

group_by() converts an existing dataframe into a grouped one where operations are performed “by group”.

ungroup() removes grouping.

left_join() join two data tables based on specific columns, and all rows from the first table are kept.

The pipe operator %>% is probably the most commonly used function when we do data manipulation. It allows us to “pipe” results from the last step to the next step. In specific, for any function f() and g(), x %>% f(y) is equivalent to f(x, y), while x %>% f(y) %>% g(z) is equivalent to g(f(x, y), z). We will illustrate why this is useful in the below tutorial.

Load required packages and helpful functions

tidyverse: data manipulation

filling.R: utils functions for extracting 0-1 matrix

library(tidyverse)
source("filling.R")

Read data

We read four data tables into R: ADMISSIONS, DIAGNOSES_ICD, PRESCRIPTIONS, and LABEVENTS.

adms = read_csv("Data/ADMISSIONS.csv.gz")
dgns = read_csv("Data/DIAGNOSES_ICD.csv.gz")
prsc = read_csv("Data/PRESCRIPTIONS.csv.gz")
labt = read_csv("Data/LABEVENTS.csv.gz")

Data cleaning

In the MIMIC-III database, there are in total 58,976 admissions of 46,520 unique patients who have at least one medical record in diagnoses history, prescriptions, or lab tests. In this step, we

(1) select columns SUBJECT_ID, HADM_ID, and medical events for each table: ICD9_CODE in DIAGNOSES_ICD, FORMULARY_DRUG_CD in PRESCRIPTIONS, and ITEMID in LABEVENTS, and then only keep unique medical events within each admission;

(2) eliminate medical concepts that are rare with fewer than 50 counts across all 58,976 admissions;

(3) combine three tables together (bind the rows), and save the dataframe work_data.

dgns_new = dgns %>% 
  mutate(event = paste0("ICD_", ICD9_CODE)) %>% 
  select(SUBJECT_ID, HADM_ID, event) %>% 
  na.omit() %>% 
  distinct() %>% 
  group_by(event) %>% 
  mutate(n = n()) %>% 
  ungroup() %>% 
  filter(n>=50) %>% 
  select(-n)

prsc_new = prsc %>% 
  mutate(event = paste0("drug_", FORMULARY_DRUG_CD)) %>% 
  select(SUBJECT_ID, HADM_ID, event) %>% 
  na.omit() %>% 
  distinct() %>% 
  group_by(event) %>% 
  mutate(n = n()) %>% 
  ungroup() %>% 
  filter(n>=50) %>% 
  select(-n)

labt_new = labt %>% 
  mutate(event = paste0("lab_", ITEMID)) %>% 
  select(SUBJECT_ID, HADM_ID, event) %>% 
  na.omit() %>% 
  distinct() %>% 
  group_by(event) %>% 
  mutate(n = n()) %>% 
  ungroup() %>% 
  filter(n>=50) %>% 
  select(-n)

work_data = rbind(dgns_new, prsc_new, labt_new)
dim(work_data) # 6560717, 3

saveRDS(work_data, "Data/MIMIC_pre_data.rds")

Combine with ADMISSIONS table

In order to perform prediction, we need to identify patients with at least two admissions. In this step, we

(1) calculate how many admissions each patient has, and only keep those having at least two admissions;

(2) create a variable time_rank to indicate the admission order for each patient;

(3) combine with the previous work_data.

work_data = readRDS("Data/MIMIC_pre_data.rds")
adms = read_csv("Data/ADMISSIONS.csv.gz")

adms_new = adms %>% 
  select(SUBJECT_ID, HADM_ID, ADMITTIME) %>% 
  filter(HADM_ID %in% work_data$HADM_ID) %>% 
  na.omit() %>% 
  distinct()

adms_new = adms_new %>% 
  group_by(SUBJECT_ID) %>% 
  arrange(ADMITTIME) %>% 
  mutate(n_adms = n(),
         time_rank = row_number()) %>% 
  ungroup() %>% 
  filter(n_adms>=2)

work_data = work_data %>% 
  left_join(adms_new) %>% 
  na.omit()

Extract outcomes and predictors

We split the dataset into two parts:

(1) the outcome dataset contains all diagnoses in the last admission for each patient;

(2) the predictor dataset contains all medical events including diagnoses, prescriptions and labs in all previous admissions for each patient.

work_data_outcome = work_data %>% 
  group_by(SUBJECT_ID) %>% 
  filter(time_rank==max(time_rank)) %>% 
  ungroup()

outcome = work_data_outcome %>% 
  filter(str_detect(event, "ICD")) %>% 
  select(SUBJECT_ID, event)

work_data_predictor = work_data %>% 
  group_by(SUBJECT_ID) %>% 
  filter(time_rank!=max(time_rank)) %>% 
  ungroup()

predictor = work_data_predictor %>% 
  select(SUBJECT_ID, event)

length(unique(work_data_predictor$SUBJECT_ID)) #7519
length(unique(work_data_predictor$HADM_ID)) #12431
length(unique(work_data_predictor$event)) #3129

Convert to 0-1 matrix

In this step, we convert the outcome dataset into a matrix, where rows represent patients, columns represent features, and entries are all 0 or 1 indicating if patients have the specific condition in the last admission.

Similarly, we convert the predictor dataset into a matrix, where rows represent patients, columns represent features, and entries are the counts of how many times the patient have the specific condition in all previous admissions.

all_patient = unique(work_data_outcome$SUBJECT_ID)

outcome = filling_01_bycol(person_id = outcome$SUBJECT_ID,
                           concept_id = outcome$event,
                           all_person = all_patient)

predictor = filling_count(person_id = predictor$SUBJECT_ID,
                          concept_id = predictor$event,
                          all_person = all_patient)

identical(rownames(outcome), rownames(predictor))

Save data

We save the data for the next step of outcome prediction.

saveRDS(outcome, "Data/outcome.rds")
saveRDS(predictor, "Data/predictor.rds")

Jia Guo