开发自然语言处理 (NLP) 模型,自动提取电子健康记录中的临床数据:意大利综合中风中心的研究结果

IF 3.7 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS International Journal of Medical Informatics Pub Date : 2024-09-19 DOI:10.1016/j.ijmedinf.2024.105626
Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli
{"title":"开发自然语言处理 (NLP) 模型,自动提取电子健康记录中的临床数据:意大利综合中风中心的研究结果","authors":"Davide Badalotti ,&nbsp;Akanksha Agrawal ,&nbsp;Umberto Pensato ,&nbsp;Giovanni Angelotti ,&nbsp;Simona Marcheselli","doi":"10.1016/j.ijmedinf.2024.105626","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"192 ","pages":"Article 105626"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center\",\"authors\":\"Davide Badalotti ,&nbsp;Akanksha Agrawal ,&nbsp;Umberto Pensato ,&nbsp;Giovanni Angelotti ,&nbsp;Simona Marcheselli\",\"doi\":\"10.1016/j.ijmedinf.2024.105626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"192 \",\"pages\":\"Article 105626\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624002892","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

导言数据收集通常依赖于耗时的人工输入,大量信息蕴含在患者病历和临床笔记等非结构化文本中。我们的研究旨在开发一种结合了主动学习(AL)和 NLP 技术的管道,以增强急性缺血性中风队列中的数据提取能力。使用AL对意大利NLP双向编码器变换器表征(BERT)模型进行了训练,以便从电子健康文本中自动提取临床变量。通过比较贝叶斯不确定性分歧采样(BALD)和随机文本选择,对一组代表患者合并症的标签进行了模拟主动学习性能评估。结果可能是由于 BERT 模型中的根层冻结,主动学习过程最初显示出无效性能,直到约 20% 的文本被标记,但总体而言,主动学习提高了大多数合并症的模型学习效率。预后建模结果表明,人工标注数据与半自动提取数据所训练的模型在性能上没有明显差异,这表明在这两种情况下都能有效地进行预测。在初步分析中,我们证明了该模型在提高预测模型准确性方面的潜在适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center

Introduction

Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.

Materials and methods

Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.

Results

The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.

Conclusions

We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
International Journal of Medical Informatics
International Journal of Medical Informatics 医学-计算机:信息系统
CiteScore
8.90
自引率
4.10%
发文量
217
审稿时长
42 days
期刊介绍: International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings. The scope of journal covers: Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.; Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc. Educational computer based programs pertaining to medical informatics or medicine in general; Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.
期刊最新文献
Editorial Board Analysis of missing data in electronic health records of people with diabetes in primary care in Spain: A population-based cohort study What information do patients pay more attention to in online physician selection? Information needs model for online medical choice decision-making based on trust theory and fuzzy decision Systematic construction of composite radiation therapy dataset using automated data pipeline for prognosis prediction Perceptions of healthcare professionals and patients with cardiovascular diseases on mHealth lifestyle apps: A qualitative study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1