{"title":"开发自然语言处理 (NLP) 模型,自动提取电子健康记录中的临床数据:意大利综合中风中心的研究结果","authors":"Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli","doi":"10.1016/j.ijmedinf.2024.105626","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"192 ","pages":"Article 105626"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center\",\"authors\":\"Davide Badalotti , Akanksha Agrawal , Umberto Pensato , Giovanni Angelotti , Simona Marcheselli\",\"doi\":\"10.1016/j.ijmedinf.2024.105626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.</div></div><div><h3>Materials and methods</h3><div>Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.</div></div><div><h3>Results</h3><div>The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.</div></div><div><h3>Conclusions</h3><div>We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.</div></div>\",\"PeriodicalId\":54950,\"journal\":{\"name\":\"International Journal of Medical Informatics\",\"volume\":\"192 \",\"pages\":\"Article 105626\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892/pdfft?md5=ae5845eca9c8a78d1bf6334b838f5be4&pid=1-s2.0-S1386505624002892-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Medical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1386505624002892\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624002892","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Development of a Natural Language Processing (NLP) model to automatically extract clinical data from electronic health records: results from an Italian comprehensive stroke center
Introduction
Data collection often relies on time-consuming manual inputs, with a vast amount of information embedded in unstructured texts such as patients’ medical records and clinical notes. Our study aims to develop a pipeline that combines active learning (AL) and NLP techniques to enhance data extraction in an acute ischemic stroke cohort.
Materials and methods
Consecutive acute ischemic stroke patients who received reperfusion therapies at IRCCS Humanitas Research Hospital were included. The Italian NLP Bidirectional Encoder Representations from Transformers (BERT) model was trained with AL to automatically extract clinical variables from electronic health text. Simulated active learning performances were evaluated on a set of labels representing patients’ comorbidities, comparing Bayesian Uncertainty Sampling by Disagreement (BALD) and random text selection. Prognostic models predicting patients’ functional outcomes using Gradient Boosting were trained on manually labelled and semi-automatically extracted data and their performance was compared.
Results
The active learning process initially showed null performance until around 20% of texts were labelled, possibly due to root layers freezing in the BERT model, yet overall, active learning improves model learning efficiency across most comorbidities. Prognostic modelling showed no significant difference in performance between models trained on manually labelled versus semi-automatically extracted data, indicating effective prediction capabilities in both settings.
Conclusions
We developed an efficient language model to automate the extraction of clinical data from Italian unstructured health texts in a cohort of ischemic stroke patients. In a preliminary analysis, we demonstrated its potential applicability for enhancing prediction model accuracy.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.