Developing a natural language processing system using transformer-based models for adverse drug event detection in electronic health records

Jingyuan Wu, Xiaodi Ruan, Elizabeth McNeer, Katelyn M. Rossow, Leena Choi
{"title":"Developing a natural language processing system using transformer-based models for adverse drug event detection in electronic health records","authors":"Jingyuan Wu, Xiaodi Ruan, Elizabeth McNeer, Katelyn M. Rossow, Leena Choi","doi":"10.1101/2024.07.09.24310100","DOIUrl":null,"url":null,"abstract":"Objective:\nTo develop a transformer-based natural language processing (NLP) system for detecting adverse drug events (ADEs) from clinical notes in electronic health records (EHRs).\nMaterials and Methods:\nWe fine-tuned BERT Short-Formers and Clinical-Longformer using the processed dataset of the 2018 National NLP Clinical Challenges (n2c2) shared task Track 2. We investigated two data processing methods, window-based and split-based approaches, to find an optimal processing method. We evaluated the generalization capabilities on a dataset extracted from Vanderbilt University Medical Center (VUMC) EHRs.\nResults:\nOn the n2c2 dataset, the best average macro F-scores of 0.832 and 0.868 were achieved using a 15-word window with PubMedBERT and a 10-chunk split with Clinical-Longformer. On the VUMC dataset, the best average macro F-scores of 0.720 and 0.786 were achieved using a 4-chunk split with PubMedBERT and Clinical-Longformer.\nDiscussion:\nOur study provided a comparative analysis of data processing methods. The fine-tuned transformer models showed good performance for ADE-related tasks. Especially, Clinical-Longformer model with split-based approach had a great potential for practical implementation of ADE detection. While the token limit was crucial, the chunk size also significantly influenced model performance, even when the text length was within the token limit.\nConclusion:\nWe provided guidance on model development, including data processing methods for ADE detection from clinical notes using transformer-based models. Our results on two datasets indicated that data processing methods and models should be carefully selected based on the type of clinical notes and the allocation trade-offs of human and computational power in annotation and model fine-tuning.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.09.24310100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To develop a transformer-based natural language processing (NLP) system for detecting adverse drug events (ADEs) from clinical notes in electronic health records (EHRs). Materials and Methods: We fine-tuned BERT Short-Formers and Clinical-Longformer using the processed dataset of the 2018 National NLP Clinical Challenges (n2c2) shared task Track 2. We investigated two data processing methods, window-based and split-based approaches, to find an optimal processing method. We evaluated the generalization capabilities on a dataset extracted from Vanderbilt University Medical Center (VUMC) EHRs. Results: On the n2c2 dataset, the best average macro F-scores of 0.832 and 0.868 were achieved using a 15-word window with PubMedBERT and a 10-chunk split with Clinical-Longformer. On the VUMC dataset, the best average macro F-scores of 0.720 and 0.786 were achieved using a 4-chunk split with PubMedBERT and Clinical-Longformer. Discussion: Our study provided a comparative analysis of data processing methods. The fine-tuned transformer models showed good performance for ADE-related tasks. Especially, Clinical-Longformer model with split-based approach had a great potential for practical implementation of ADE detection. While the token limit was crucial, the chunk size also significantly influenced model performance, even when the text length was within the token limit. Conclusion: We provided guidance on model development, including data processing methods for ADE detection from clinical notes using transformer-based models. Our results on two datasets indicated that data processing methods and models should be carefully selected based on the type of clinical notes and the allocation trade-offs of human and computational power in annotation and model fine-tuning.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用基于转换器的模型开发自然语言处理系统,用于检测电子健康记录中的药物不良事件
目的:开发一种基于转换器的自然语言处理(NLP)系统,用于从电子健康记录(EHR)中的临床笔记中检测药物不良事件(ADE)。材料与方法:我们使用2018年全国NLP临床挑战赛(n2c2)共享任务轨道2的处理数据集,对BERT Short-Formers和Clinical-Longformer进行了微调。我们研究了两种数据处理方法,即基于窗口的方法和基于分割的方法,以找到最佳的处理方法。结果表明:在 n2c2 数据集上,使用 PubMedBERT 的 15 字窗口和 Clinical-Longformer 的 10 块分割,分别获得了 0.832 和 0.868 的最佳平均宏 F 分数。讨论:我们的研究对数据处理方法进行了比较分析。微调转换器模型在 ADE 相关任务中表现良好。尤其是基于拆分方法的 Clinical-Longformer 模型在 ADE 检测的实际应用中潜力巨大。结论:我们为模型开发提供了指导,包括使用基于转换器的模型从临床笔记中检测 ADE 的数据处理方法。我们对两个数据集的研究结果表明,应根据临床笔记的类型以及注释和模型微调过程中人力和计算力的分配权衡,谨慎选择数据处理方法和模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A case is not a case is not a case - challenges and solutions in determining urolithiasis caseloads using the digital infrastructure of a clinical data warehouse Reliable Online Auditory Cognitive Testing: An observational study Federated Multiple Imputation for Variables that Are Missing Not At Random in Distributed Electronic Health Records Characterizing the connection between Parkinson's disease progression and healthcare utilization Generative AI and Large Language Models in Reducing Medication Related Harm and Adverse Drug Events - A Scoping Review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1