OBJECTIVE : Clinically derived variables from different time frames exhibit varying levels of importance for predicting self-harm. Variable importance analyses can inform whether and how to implement risk prediction models into clinical practice given real-world data limitations. These analyses be applied more broadly in biomedical informatics research to provide insight into general clinical risk prediction tasks. : Self-harm risk prediction models developed using health system data (electronic health records and insurance claims information) often use patient information from up to several years prior to the index visit when the prediction is made. Measurements from some time periods may not be available for all patients. Using the framework of algorithm-agnostic variable importance, we study the predictive potential of variables corresponding to different time horizons prior to the index visit and demonstrate the application of variable importance techniques in the biomedical informatics setting.
MATERIALS AND METHODS : We use variable importance to quantify the potential of recent (up to three months before the index visit) and distant (more than one year before the index visit) patient mental health information for predicting self-harm risk using data from seven health systems. We quantify importance as the decrease in predictiveness when the variable set of interest is excluded from the prediction task. We define predictiveness using discriminative metrics: area under the receiver operating characteristic curve (AUC), sensitivity, and positive predictive value.
RESULTS : Mental health predictors corresponding to the three months prior to the index visit show strong signal of importance; in one setting, excluding these variables decreased AUC from 0.85 to 0.77. Predictors corresponding to more distant information were less important.
DISCUSSION : Predictors from the months immediately preceding the index visit are highly important. Implementation of self-harm prediction models may be challenging in settings where recent data are not completely available (e.g., due to lags in insurance claims processing) at the time a prediction is made.
CONCLUSION