Preprocessing of natural language process variables using a data-driven method improves the association with suicide risk in a large veterans affairs population

IF 7 2区医学 Q1 BIOLOGY Computers in biology and medicine Pub Date : 2025-03-05 DOI:10.1016/j.compbiomed.2025.109939

Siting Li , Maxwell Levis , Monica DiMambro , Weiyi Wu , Joshua Levy , Brian Shiner , Jiang Gui

{"title":"Preprocessing of natural language process variables using a data-driven method improves the association with suicide risk in a large veterans affairs population","authors":"Siting Li , Maxwell Levis , Monica DiMambro , Weiyi Wu , Joshua Levy , Brian Shiner , Jiang Gui","doi":"10.1016/j.compbiomed.2025.109939","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Suicide risk assessment has historically relied heavily on clinical evaluations and patient self-reports. Natural language processing (NLP) of electronic health records (EHRs) provides an alternative approach for extracting risk predictors from clinical notes. Modeling NLP variables, however, is challenging because of zero inflation and skewed distributions. Therefore, we evaluated whether an adaptive-mixture-categorization (AMC) method could optimize the suicide risk predictive capacity of NLP data extracted from Veterans Affairs (VA) EHR notes.</div></div><div><h3>Methods</h3><div>NLP variables for 25,342 patients were analyzed using the SÉANCE python package. The AMC method was employed to categorize NLP measures into distinct groups to maximize the between-category variance. Associations between suicide outcomes and AMC-categorized NLP variables were compared to those between the original and quantile-categorized NLP variables.</div></div><div><h3>Results</h3><div>AMC-categorized variables showed stronger associations with suicide risk than other approaches did in the full cohort analysis and sensitivity analyses by subsampling bootstrapping. Additionally, over 90 % of the NLP variables were significantly associated with suicide risk in univariate analyses, indicating the relevance of clinical notes in suicide prevention.</div></div><div><h3>Conclusion</h3><div>AMC-based categorization substantially enhanced the suicide predictive capacity of NLP variables extracted from clinical text. Transforming skewed NLP data with the AMC method holds promise for improving risk prediction models.</div></div>","PeriodicalId":10578,"journal":{"name":"Computers in biology and medicine","volume":"189 ","pages":"Article 109939"},"PeriodicalIF":7.0000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in biology and medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0010482525002902","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Suicide risk assessment has historically relied heavily on clinical evaluations and patient self-reports. Natural language processing (NLP) of electronic health records (EHRs) provides an alternative approach for extracting risk predictors from clinical notes. Modeling NLP variables, however, is challenging because of zero inflation and skewed distributions. Therefore, we evaluated whether an adaptive-mixture-categorization (AMC) method could optimize the suicide risk predictive capacity of NLP data extracted from Veterans Affairs (VA) EHR notes.

Methods

NLP variables for 25,342 patients were analyzed using the SÉANCE python package. The AMC method was employed to categorize NLP measures into distinct groups to maximize the between-category variance. Associations between suicide outcomes and AMC-categorized NLP variables were compared to those between the original and quantile-categorized NLP variables.

Results

AMC-categorized variables showed stronger associations with suicide risk than other approaches did in the full cohort analysis and sensitivity analyses by subsampling bootstrapping. Additionally, over 90 % of the NLP variables were significantly associated with suicide risk in univariate analyses, indicating the relevance of clinical notes in suicide prevention.

Conclusion

AMC-based categorization substantially enhanced the suicide predictive capacity of NLP variables extracted from clinical text. Transforming skewed NLP data with the AMC method holds promise for improving risk prediction models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

Computers in biology and medicine 工程技术-工程：生物医学

CiteScore

11.70

自引率

10.40%

发文量

1086

审稿时长

74 days

期刊介绍： Computers in Biology and Medicine is an international forum for sharing groundbreaking advancements in the use of computers in bioscience and medicine. This journal serves as a medium for communicating essential research, instruction, ideas, and information regarding the rapidly evolving field of computer applications in these domains. By encouraging the exchange of knowledge, we aim to facilitate progress and innovation in the utilization of computers in biology and medicine.