Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts

Fuchiang R Tsui; Lingyun Shi; Victor Ruiz; Neal D Ryan; Candice Biernesser; Satish Iyengar; Colin G Walsh; David A Brent

doi:10.1093/jamiaopen/ooab011

Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts

JAMIA Open. 2021 Mar 17;4(1):ooab011. doi: 10.1093/jamiaopen/ooab011. eCollection 2021 Jan.

Authors

Fuchiang R Tsui^{1

2

3

4}, Lingyun Shi^{1

3}, Victor Ruiz^{1

3}, Neal D Ryan⁵, Candice Biernesser⁵, Satish Iyengar⁶, Colin G Walsh⁷, David A Brent⁵

Affiliations

¹ Tsui Laboratory, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
² Department of Anesthesiology and Critical Care Medicine, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
³ Department of Biomedical and Health Informatics, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, USA.
⁴ Department of Anesthesiology and Critical Care, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
⁵ Department of Psychiatry, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁶ Department of Statistics, School of Arts and Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
⁷ Department of Biomedical Informatics, School of Medicine, Vanderbilt University, Nashville, Tennessee, USA.

Abstract

Objective: Limited research exists in predicting first-time suicide attempts that account for two-thirds of suicide decedents. We aimed to predict first-time suicide attempts using a large data-driven approach that applies natural language processing (NLP) and machine learning (ML) to unstructured (narrative) clinical notes and structured electronic health record (EHR) data.

Methods: This case-control study included patients aged 10-75 years who were seen between 2007 and 2016 from emergency departments and inpatient units. Cases were first-time suicide attempts from coded diagnosis; controls were randomly selected without suicide attempts regardless of demographics, following a ratio of nine controls per case. Four data-driven ML models were evaluated using 2-year historical EHR data prior to suicide attempt or control index visits, with prediction windows from 7 to 730 days. Patients without any historical notes were excluded. Model evaluation on accuracy and robustness was performed on a blind dataset (30% cohort).

Results: The study cohort included 45 238 patients (5099 cases, 40 139 controls) comprising 54 651 variables from 5.7 million structured records and 798 665 notes. Using both unstructured and structured data resulted in significantly greater accuracy compared to structured data alone (area-under-the-curve [AUC]: 0.932 vs. 0.901 P < .001). The best-predicting model utilized 1726 variables with AUC = 0.932 (95% CI, 0.922-0.941). The model was robust across multiple prediction windows and subgroups by demographics, points of historical most recent clinical contact, and depression diagnosis history.

Conclusions: Our large data-driven approach using both structured and unstructured EHR data demonstrated accurate and robust first-time suicide attempt prediction, and has the potential to be deployed across various populations and clinical settings.

Keywords: electronic health records; machine learning; natural language processing; suicide attempt.

Grants and funding

P50 MH115838/MH/NIMH NIH HHS/United States