首页 > 最新文献

medRxiv - Health Informatics最新文献

英文 中文
Fine-tuning large language models for effective nutrition support in residential aged care: a domain expertise approach 微调大型语言模型,为养老院护理提供有效的营养支持:领域专长方法
Pub Date : 2024-07-21 DOI: 10.1101/2024.07.21.24310775
Mohammad Alkhalaf, Chao Deng, Jun Shen, Hui-Chen (Rita) Chang, Ping Yu
Purpose: Malnutrition is a serious health concern, particularly among the older people living in residential aged care facilities. An automated and efficient method is required to identify the individuals afflicted with malnutrition in this setting. The recent advancements in transformer-based large language models (LLMs) equipped with sophisticated context-aware embeddings, such as RoBERTa, have significantly improved machine learning performance, particularly in predictive modelling. Enhancing the embeddings of these models on domain-specific corpora, such as clinical notes, is essential for elevating their performance in clinical tasks. Therefore, our study introduces a novel approach that trains a foundational RoBERTa model on nursing progress notes to develop a RAC domain-specific LLM. The model is further fine-tuned on nursing progress notes to enhance malnutrition identification and prediction in residential aged care setting.Methods: We develop our domain-specific model by training the RoBERTa LLM on 500,000 nursing progress notes from residential aged care electronic health records (EHRs). The model embeddings were used for two downstream tasks: malnutrition note identification and malnutrition prediction. Its performance was compared against baseline RoBERTa and BioClinicalBERT. Furthermore, we truncated long sequence text to fit into RoBERTa 512-token sequence length limitation, enabling our model to handle sequences up to1536 tokens.Results: Utilizing 5-fold cross-validation for both tasks, our RAC domain-specific LLM demonstrated significantly better performance over other models. In malnutrition note identification, it achieved a slightly higher F1-score of 0.966 compared to other LLMs. In prediction, it achieved significantly higher F1-score of 0.655. We enhanced our model predictive capability by integrating the risk factors extracted from each client notes, creating a combined data layer of structured risk factors and free-text notes. This integration improved the prediction performance, evidenced by an increased F1-score of 0.687.Conclusion: Our findings suggest that further fine-tuning a large language model on a domain-specific clinical corpus can improve the foundational model performance in clinical tasks. This specialized adaptation significantly improves our domain-specific model performance in tasks such as malnutrition risk identification and malnutrition prediction, making it useful for identifying and predicting malnutrition among older people living in residential aged care or long-term care facilities.
目的:营养不良是一个严重的健康问题,尤其是居住在养老院的老年人。在这种情况下,需要一种自动高效的方法来识别营养不良者。基于转换器的大型语言模型(LLM)配备了复杂的上下文感知嵌入(如 RoBERTa),其最新进展显著提高了机器学习性能,尤其是在预测建模方面。加强这些模型在特定领域语料库(如临床笔记)中的嵌入对于提高它们在临床任务中的性能至关重要。因此,我们的研究引入了一种新方法,即在护理进展笔记上训练基础 RoBERTa 模型,从而开发出针对特定领域的 RAC LLM。该模型在护理进展记录的基础上进行了进一步的微调,以提高住院养老护理环境中营养不良的识别和预测能力:方法:我们通过对来自养老院电子健康记录(EHR)的 500,000 份护理进展记录训练 RoBERTa LLM,从而开发出针对特定领域的模型。模型嵌入用于两个下游任务:营养不良记录识别和营养不良预测。我们将其性能与基线 RoBERTa 和 BioClinicalBERT 进行了比较。此外,我们对长序列文本进行了截断,以适应 RoBERTa 512 个标记的序列长度限制,从而使我们的模型能够处理多达 1536 个标记的序列:通过对这两项任务进行 5 倍交叉验证,我们的 RAC 特定领域 LLM 的性能明显优于其他模型。在营养不良注释识别方面,与其他 LLM 相比,它的 F1 分数略高,为 0.966。在预测方面,它的 F1 分数明显更高,达到 0.655。我们通过整合从每个客户笔记中提取的风险因素,创建了结构化风险因素和自由文本笔记的组合数据层,从而增强了模型的预测能力。这种整合提高了预测性能,F1 分数提高到了 0.687:我们的研究结果表明,在特定领域的临床语料库上进一步微调大型语言模型可以提高基础模型在临床任务中的性能。在营养不良风险识别和营养不良预测等任务中,这种专业化的调整大大提高了特定领域模型的性能,使其可用于识别和预测居住在养老院或长期护理机构的老年人的营养不良情况。
{"title":"Fine-tuning large language models for effective nutrition support in residential aged care: a domain expertise approach","authors":"Mohammad Alkhalaf, Chao Deng, Jun Shen, Hui-Chen (Rita) Chang, Ping Yu","doi":"10.1101/2024.07.21.24310775","DOIUrl":"https://doi.org/10.1101/2024.07.21.24310775","url":null,"abstract":"Purpose: Malnutrition is a serious health concern, particularly among the older people living in residential aged care facilities. An automated and efficient method is required to identify the individuals afflicted with malnutrition in this setting. The recent advancements in transformer-based large language models (LLMs) equipped with sophisticated context-aware embeddings, such as RoBERTa, have significantly improved machine learning performance, particularly in predictive modelling. Enhancing the embeddings of these models on domain-specific corpora, such as clinical notes, is essential for elevating their performance in clinical tasks. Therefore, our study introduces a novel approach that trains a foundational RoBERTa model on nursing progress notes to develop a RAC domain-specific LLM. The model is further fine-tuned on nursing progress notes to enhance malnutrition identification and prediction in residential aged care setting.\u0000Methods: We develop our domain-specific model by training the RoBERTa LLM on 500,000 nursing progress notes from residential aged care electronic health records (EHRs). The model embeddings were used for two downstream tasks: malnutrition note identification and malnutrition prediction. Its performance was compared against baseline RoBERTa and BioClinicalBERT. Furthermore, we truncated long sequence text to fit into RoBERTa 512-token sequence length limitation, enabling our model to handle sequences up to1536 tokens.\u0000Results: Utilizing 5-fold cross-validation for both tasks, our RAC domain-specific LLM demonstrated significantly better performance over other models. In malnutrition note identification, it achieved a slightly higher F1-score of 0.966 compared to other LLMs. In prediction, it achieved significantly higher F1-score of 0.655. We enhanced our model predictive capability by integrating the risk factors extracted from each client notes, creating a combined data layer of structured risk factors and free-text notes. This integration improved the prediction performance, evidenced by an increased F1-score of 0.687.\u0000Conclusion: Our findings suggest that further fine-tuning a large language model on a domain-specific clinical corpus can improve the foundational model performance in clinical tasks. This specialized adaptation significantly improves our domain-specific model performance in tasks such as malnutrition risk identification and malnutrition prediction, making it useful for identifying and predicting malnutrition among older people living in residential aged care or long-term care facilities.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exposomics and Cardiovascular Diseases: A Scoping Review of Machine Learning Approaches 暴露组学与心血管疾病:机器学习方法范围综述
Pub Date : 2024-07-19 DOI: 10.1101/2024.07.19.24310695
Katerina D. Argyri, Ioannis K. Gallos, Angelos Amditis, Dimitra D. Dionysiou
Cardiovascular disease has been established as the world's number one killer, causing over 20 million deaths per year. This fact, along with the growing awareness of the impact of exposomic risk factors on cardiovascular diseases, has led the scientific community to leverage machine learning strategies as a complementary approach to traditional statistical epidemiological studies that are challenged by the highly heterogeneous and dynamic nature of exposomics data. The principal objective served by this work is to identify key pertinent literature and provide an overview of the breadth of research in the field of machine learning applications on exposomics data with a focus on cardiovascular diseases. Secondarily, we aimed at identifying common limitations and meaningful directives to be addressed in the future. Overall, this work shows that, despite the fact that machine learning on exposomics data is under-researched compared to its application on other members of the -omics family, it is increasingly adopted to investigate different aspects of cardiovascular diseases.
心血管疾病已成为世界头号杀手,每年造成 2000 多万人死亡。这一事实以及人们对暴露组学风险因素对心血管疾病影响的日益认识,促使科学界利用机器学习策略作为传统统计流行病学研究的补充方法,而暴露组学数据的高度异构性和动态性对这些研究提出了挑战。这项工作的主要目的是确定关键的相关文献,并概述机器学习在暴露组学数据应用领域的研究广度,重点关注心血管疾病。其次,我们还旨在确定共同的局限性和未来需要解决的有意义的问题。总之,这项工作表明,尽管与机器学习在组学家族其他成员上的应用相比,机器学习在暴露组学数据上的应用研究不足,但它正被越来越多地用于研究心血管疾病的不同方面。
{"title":"Exposomics and Cardiovascular Diseases: A Scoping Review of Machine Learning Approaches","authors":"Katerina D. Argyri, Ioannis K. Gallos, Angelos Amditis, Dimitra D. Dionysiou","doi":"10.1101/2024.07.19.24310695","DOIUrl":"https://doi.org/10.1101/2024.07.19.24310695","url":null,"abstract":"Cardiovascular disease has been established as the world's number one killer, causing over 20 million deaths per year. This fact, along with the growing awareness of the impact of exposomic risk factors on cardiovascular diseases, has led the scientific community to leverage machine learning strategies as a complementary approach to traditional statistical epidemiological studies that are challenged by the highly heterogeneous and dynamic nature of exposomics data. The principal objective served by this work is to identify key pertinent literature and provide an overview of the breadth of research in the field of machine learning applications on exposomics data with a focus on cardiovascular diseases. Secondarily, we aimed at identifying common limitations and meaningful directives to be addressed in the future. Overall, this work shows that, despite the fact that machine learning on exposomics data is under-researched compared to its application on other members of the -omics family, it is increasingly adopted to investigate different aspects of cardiovascular diseases.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Radiotherapy continuity for cancer treatment: lessons learned from natural disasters 癌症放疗的连续性:从自然灾害中汲取的经验教训
Pub Date : 2024-07-19 DOI: 10.1101/2024.07.18.24310636
Ralf Müller-Polyzou, Melanie Reuter-Oppermann
Background:The contemporary world is challenged by natural disasters accelerated by climate change, affecting a growing world population. Simultaneously, cancer remains a persistent threat as a leading cause of death, killing 10~million people annually. The efficacy of radiotherapy, a cornerstone in cancer treatment worldwide, depends on an uninterrupted course of therapy. However, natural disasters cause significant disruptions to the continuity of radiotherapy services, posing a critical challenge to cancer treatment. This paper explores how natural disasters impact radiotherapy practice, compares them to man-made disasters, and outlines strategies to mitigate adverse effects of natural disasters. Through this analysis, the study seeks to contribute to developing resilient healthcare frameworks capable of sustaining essential cancer treatment amidst the challenges posed by natural disasters.Method:We conducted a Structured Literature Review to investigate this matter comprehensively, gathering and evaluating relevant academic publications. We explored how natural disasters affected radiotherapy practice and examined the experience of radiotherapy centres worldwide in resuming operations after such events. Subsequently, we validated and extended our research findings through a global online survey involving radiotherapy professionals.Results:The Structured Literature Review identified twelve academic publications describing hurricanes, floods, and earthquakes as the primary disruptors of radiotherapy practice. The analysis confirms and complements risk mitigation themes identified in our previous research, which focused on the continuity of radiotherapy practice during the COVID-19 pandemic. Our work describes nine overarching themes, forming the basis for a taxonomy of 36 distinct groups. The subsequent confirmative online survey supported and solidified our findings and served as a basis for developing a conceptual framework for natural disaster-resilient radiotherapy.Discussion:The growing threat posed by natural disasters underscores the need to develop business continuity programs and define risk mitigation measures to ensure the uninterrupted provision of radiotherapy services. By drawing lessons from past disasters, we can better prepare for future hazards, supporting disaster management and planning efforts, particularly enhancing the resilience of radiotherapy practice. Additionally, our study can serve as a resource for shaping policy initiatives aimed at mitigating the impact of natural hazards.
背景:当今世界面临着因气候变化而加剧的自然灾害的挑战,影响着不断增长的世界人口。与此同时,癌症仍然是一个持久的威胁,每年造成 1,000 万至 1,000 万人死亡。放疗是全球癌症治疗的基石,其疗效取决于不间断的治疗过程。然而,自然灾害严重破坏了放射治疗服务的连续性,给癌症治疗带来了严峻挑战。本文探讨了自然灾害如何影响放射治疗实践,将其与人为灾害进行了比较,并概述了减轻自然灾害不利影响的策略。方法:我们进行了结构化文献综述(Structured Literature Review),通过收集和评估相关学术出版物,对这一问题进行了全面调查。我们探讨了自然灾害如何影响放射治疗实践,并研究了世界各地的放射治疗中心在此类事件发生后恢复运作的经验。随后,我们通过一项涉及放射治疗专业人员的全球在线调查,验证并扩展了我们的研究结果。结果:结构化文献综述确定了 12 篇学术出版物,这些出版物将飓风、洪水和地震描述为放射治疗实践的主要干扰因素。该分析证实并补充了我们之前研究中确定的风险缓解主题,该研究的重点是 COVID-19 大流行期间放射治疗实践的连续性。我们的工作描述了九个首要主题,为 36 个不同群体的分类奠定了基础。随后的在线确认调查支持并巩固了我们的研究结果,并以此为基础制定了抵御自然灾害的放射治疗概念框架。讨论:自然灾害造成的威胁日益严重,这凸显了制定业务连续性计划和确定风险缓解措施的必要性,以确保不间断地提供放射治疗服务。通过从过去的灾害中吸取教训,我们可以更好地为未来的灾害做好准备,支持灾害管理和规划工作,特别是提高放射治疗实践的复原力。此外,我们的研究还可以作为一种资源,用于制定旨在减轻自然灾害影响的政策措施。
{"title":"Radiotherapy continuity for cancer treatment: lessons learned from natural disasters","authors":"Ralf Müller-Polyzou, Melanie Reuter-Oppermann","doi":"10.1101/2024.07.18.24310636","DOIUrl":"https://doi.org/10.1101/2024.07.18.24310636","url":null,"abstract":"Background:\u0000The contemporary world is challenged by natural disasters accelerated by climate change, affecting a growing world population. Simultaneously, cancer remains a persistent threat as a leading cause of death, killing 10~million people annually. The efficacy of radiotherapy, a cornerstone in cancer treatment worldwide, depends on an uninterrupted course of therapy. However, natural disasters cause significant disruptions to the continuity of radiotherapy services, posing a critical challenge to cancer treatment. This paper explores how natural disasters impact radiotherapy practice, compares them to man-made disasters, and outlines strategies to mitigate adverse effects of natural disasters. Through this analysis, the study seeks to contribute to developing resilient healthcare frameworks capable of sustaining essential cancer treatment amidst the challenges posed by natural disasters.\u0000Method:\u0000We conducted a Structured Literature Review to investigate this matter comprehensively, gathering and evaluating relevant academic publications. We explored how natural disasters affected radiotherapy practice and examined the experience of radiotherapy centres worldwide in resuming operations after such events. Subsequently, we validated and extended our research findings through a global online survey involving radiotherapy professionals.\u0000Results:\u0000The Structured Literature Review identified twelve academic publications describing hurricanes, floods, and earthquakes as the primary disruptors of radiotherapy practice. The analysis confirms and complements risk mitigation themes identified in our previous research, which focused on the continuity of radiotherapy practice during the COVID-19 pandemic. Our work describes nine overarching themes, forming the basis for a taxonomy of 36 distinct groups. The subsequent confirmative online survey supported and solidified our findings and served as a basis for developing a conceptual framework for natural disaster-resilient radiotherapy.\u0000Discussion:\u0000The growing threat posed by natural disasters underscores the need to develop business continuity programs and define risk mitigation measures to ensure the uninterrupted provision of radiotherapy services. By drawing lessons from past disasters, we can better prepare for future hazards, supporting disaster management and planning efforts, particularly enhancing the resilience of radiotherapy practice. Additionally, our study can serve as a resource for shaping policy initiatives aimed at mitigating the impact of natural hazards.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of Ambient Artificial Intelligence Notes on Provider Burnout 环境人工智能笔记对医务人员职业倦怠的影响
Pub Date : 2024-07-19 DOI: 10.1101/2024.07.18.24310656
Jason MIsurac, Lindsey A Knake, James M Blum
Background: Healthcare provider burnout is a critical issue with significant implications for individual well-being, patient care, and healthcare system efficiency. Addressing burnout is essential for improving both provider well-being and the quality of patient care. Ambient artificial intelligence (AI) offers a novel approach to mitigating burnout by reducing the documentation burden through advanced speech recognition and natural language processing technologies that summarize the patient encounter into a clinical note to be reviewed by clinicians.Objective: To assess provider burnout and professional fulfilment associated with Ambient AI technology during a pilot study, assessed using the Stanford Professional Fulfillment Index (PFI). Methods: A pre-post observational study was conducted at University of Iowa Health Care with 38 volunteer physicians and advanced practice providers. Participants used a commercial ambient AI tool, over a 5-week trial in ambulatory environments. The AI tool transcribed patient-clinician conversations and generated preliminary clinical notes for review and entry into the electronic medical record. Burnout and professional fulfillment were assessed using the Stanford PFI at baseline and post-intervention. Results: Pre-test and post-test surveys were completed by 35/38 participants (92% survey completion rate). Results showed a significant reduction in burnout scores, with the median burnout score improving from 4.16 to 3.16 (p=0.005), with validated Stanford PFI cutoff for overall burnout 3.33. Burnout rates decreased from 69% to 43%. There was a notable improvement in interpersonal disengagement scores (3.6 vs. 2.5, p<0.001), although work exhaustion scores did not significantly change. Professional fulfillment showed a modest, non-significant increase (6.1 vs. 6.5, p=0.10). Conclusions: Ambient AI significantly reduces healthcare provider burnout and modestly enhances professional fulfillment. By alleviating documentation burdens, ambient AI improves operational efficiency and provider well-being. These findings suggest that broader implementation of ambient AI could be a strategic intervention to combat burnout in healthcare settings.
背景:医疗服务提供者的职业倦怠是一个关键问题,对个人福祉、患者护理和医疗系统效率都有重大影响。解决职业倦怠问题对于提高医疗服务提供者的福利和患者护理质量至关重要。环境人工智能(AI)通过先进的语音识别和自然语言处理技术,将患者就诊情况总结为临床笔记供临床医生审阅,从而减轻了记录负担,为减轻职业倦怠提供了一种新方法:目的:在一项试点研究中,评估与 Ambient AI 技术相关的医疗服务提供者的职业倦怠和职业满足感,并使用斯坦福职业满足感指数 (PFI) 进行评估。方法爱荷华大学医疗保健中心对 38 名志愿医生和高级医疗服务提供者进行了一项前后观察研究。参与者在门诊环境中使用了一款商用环境人工智能工具,试用期为 5 周。该人工智能工具转录了病人与医生的对话,并生成了初步的临床笔记,以供审查并输入电子病历。在基线和干预后,使用斯坦福 PFI 对职业倦怠和职业成就感进行了评估。结果35/38 名参与者完成了测试前和测试后的调查(调查完成率为 92%)。结果显示,倦怠感得分明显降低,倦怠感得分中位数从 4.16 降至 3.16(p=0.005),经验证的斯坦福 PFI 整体倦怠感临界值为 3.33。倦怠率从 69% 降至 43%。人际关系疏离得分有明显改善(3.6 对 2.5,p<0.001),但工作枯竭得分没有显著变化。职业成就感则略有提高,但不明显(6.1 对 6.5,p=0.10)。结论环境人工智能大大降低了医疗服务提供者的职业倦怠,并适度提高了职业成就感。通过减轻文档负担,环境人工智能提高了运营效率和医疗服务提供者的幸福感。这些研究结果表明,在医疗机构中更广泛地实施环境人工智能可以成为消除职业倦怠的战略性干预措施。
{"title":"Impact of Ambient Artificial Intelligence Notes on Provider Burnout","authors":"Jason MIsurac, Lindsey A Knake, James M Blum","doi":"10.1101/2024.07.18.24310656","DOIUrl":"https://doi.org/10.1101/2024.07.18.24310656","url":null,"abstract":"Background: Healthcare provider burnout is a critical issue with significant implications for individual well-being, patient care, and healthcare system efficiency. Addressing burnout is essential for improving both provider well-being and the quality of patient care. Ambient artificial intelligence (AI) offers a novel approach to mitigating burnout by reducing the documentation burden through advanced speech recognition and natural language processing technologies that summarize the patient encounter into a clinical note to be reviewed by clinicians.\u0000Objective: To assess provider burnout and professional fulfilment associated with Ambient AI technology during a pilot study, assessed using the Stanford Professional Fulfillment Index (PFI). Methods: A pre-post observational study was conducted at University of Iowa Health Care with 38 volunteer physicians and advanced practice providers. Participants used a commercial ambient AI tool, over a 5-week trial in ambulatory environments. The AI tool transcribed patient-clinician conversations and generated preliminary clinical notes for review and entry into the electronic medical record. Burnout and professional fulfillment were assessed using the Stanford PFI at baseline and post-intervention. Results: Pre-test and post-test surveys were completed by 35/38 participants (92% survey completion rate). Results showed a significant reduction in burnout scores, with the median burnout score improving from 4.16 to 3.16 (p=0.005), with validated Stanford PFI cutoff for overall burnout 3.33. Burnout rates decreased from 69% to 43%. There was a notable improvement in interpersonal disengagement scores (3.6 vs. 2.5, p&lt;0.001), although work exhaustion scores did not significantly change. Professional fulfillment showed a modest, non-significant increase (6.1 vs. 6.5, p=0.10). Conclusions: Ambient AI significantly reduces healthcare provider burnout and modestly enhances professional fulfillment. By alleviating documentation burdens, ambient AI improves operational efficiency and provider well-being. These findings suggest that broader implementation of ambient AI could be a strategic intervention to combat burnout in healthcare settings.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"116 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protocol for: A Simple, Accessible, Literature-based Drug Repurposing Pipeline 协议:基于文献的简单、易用的药物再利用管道
Pub Date : 2024-07-19 DOI: 10.1101/2024.07.18.24310641
Maximin Lange, Eoin Gogarty, Meredith Martyn, Philip Braude, Feras Fayez, Ben Carter
We will develop a novel approach to drug repurposing, utilising Natural Language Processing (NLP) and Literature Based Discovery (LBD) techniques. This will present a simplified, accessible drug repurposing pipeline using Word2Vec embeddings trained on PubMed abstracts to identify potential new medications to be repurposed. We present this approach in the context of antipsychotics, but it could be repeated for any available medication. The research is structured in three stages:1. Identification of candidate medications using Word2Vec algorithm trained on scientific literature.2. Empirical testing of identified candidates using a large hospital dataset to explore protective effects against disease onset.3. Validation of findings using a second, independent dataset to assess generalizability. This method addresses limitations in current machine learning-based drug repurposing approaches, including lack of external validation and limited accessibility. By leveraging Word2Vec's ability to capture semantic relationships between words, the study aims to uncover hidden connections in medical literature that may lead to novel therapeutic discoveries. The protocol emphasizes transparency and reproducibility, utilizing publicly available electronic health record (EHR) databases for validation. This approach allows for tangible results even for researchers with limited machine learning expertise, bridging the gap between biomedical and information systems communities.
我们将利用自然语言处理(NLP)和基于文献的发现(LBD)技术,开发一种新的药物再利用方法。这将提供一个简化的、可访问的药物再利用管道,使用在PubMed摘要上训练的Word2Vec嵌入来识别潜在的新药再利用。我们以抗精神病药物为背景介绍了这种方法,但任何现有药物都可以重复使用这种方法。研究分为三个阶段:1.使用在科学文献上训练的 Word2Vec 算法识别候选药物;2.使用大型医院数据集对识别出的候选药物进行经验测试,以探索其对疾病发作的保护作用;3.使用第二个独立数据集对研究结果进行验证,以评估其通用性。这种方法解决了目前基于机器学习的药物再利用方法的局限性,包括缺乏外部验证和可及性有限。通过利用 Word2Vec 捕捉词与词之间语义关系的能力,该研究旨在发现医学文献中隐藏的联系,从而发现新的治疗方法。该方案强调透明度和可重复性,利用公开的电子健康记录(EHR)数据库进行验证。即使是机器学习专业知识有限的研究人员也能通过这种方法获得切实的成果,从而缩小生物医学和信息系统界之间的差距。
{"title":"Protocol for: A Simple, Accessible, Literature-based Drug Repurposing Pipeline","authors":"Maximin Lange, Eoin Gogarty, Meredith Martyn, Philip Braude, Feras Fayez, Ben Carter","doi":"10.1101/2024.07.18.24310641","DOIUrl":"https://doi.org/10.1101/2024.07.18.24310641","url":null,"abstract":"We will develop a novel approach to drug repurposing, utilising Natural Language Processing (NLP) and Literature Based Discovery (LBD) techniques. This will present a simplified, accessible drug repurposing pipeline using Word2Vec embeddings trained on PubMed abstracts to identify potential new medications to be repurposed. We present this approach in the context of antipsychotics, but it could be repeated for any available medication. The research is structured in three stages:\u00001. Identification of candidate medications using Word2Vec algorithm trained on scientific literature.\u00002. Empirical testing of identified candidates using a large hospital dataset to explore protective effects against disease onset.\u00003. Validation of findings using a second, independent dataset to assess generalizability. This method addresses limitations in current machine learning-based drug repurposing approaches, including lack of external validation and limited accessibility. By leveraging Word2Vec's ability to capture semantic relationships between words, the study aims to uncover hidden connections in medical literature that may lead to novel therapeutic discoveries. The protocol emphasizes transparency and reproducibility, utilizing publicly available electronic health record (EHR) databases for validation. This approach allows for tangible results even for researchers with limited machine learning expertise, bridging the gap between biomedical and information systems communities.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141744330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable Machine Learning for Predicting Multiple Sclerosis Conversion from Clinically Isolated Syndrome 用于预测多发性硬化症从临床孤立综合征转归的可解释机器学习
Pub Date : 2024-07-19 DOI: 10.1101/2024.07.18.24310578
Eden Caroline Daniel, SANTOSH TIRUNAGARI, Karan Batth, David Windridge, Yashaswini Balla
Background: Machine learning (ML) prediction of clinically isolated syndrome (CIS) conversion to multiple sclerosis (MS) could be used as a remote, preliminary tool by clinicians to identify high-risk patients that would benefit from early treatment. Objective: This study evaluates ML models to predict CIS to MS conversion and identifies key predictors. Methods: Five supervised learning techniques (Naive Bayes, Logistic Regression, Decision Trees, Random Forests and Support Vector Machines) were applied to clinical data from 138 Lithuanian and 273 Mexican CIS patients. Seven different feature combinations were evaluated to determine the most effective models and predictors. Results: Key predictors common to both datasets included sex, presence of oligoclonal bands in CSF, MRI spinal lesions, abnormal visual evoked potentials and brainstem auditory evoked potentials. The Lithuanian dataset confirmed predictors identified by previous clinical research, while the Mexican dataset partially validated them. The highest F1 score of 1.0 was achieved using Random Forests on all features for the Mexican dataset and Logistic Regression with SMOTE Upsampling on all features for the Lithuanian dataset. Conclusion: Applying the identified high-performing ML models to the CIS patient datasets shows potential in assisting clinicians to identify high-risk patients.
背景:临床孤立综合征(CIS)转化为多发性硬化症(MS)的机器学习(ML)预测可作为一种远程初步工具,供临床医生用于识别可从早期治疗中获益的高风险患者。研究目的本研究评估了预测 CIS 向 MS 转化的 ML 模型,并确定了关键预测因子。方法:将五种监督学习技术(Naive Bayes、逻辑回归、决策树、随机森林和支持向量机)应用于 138 名立陶宛和 273 名墨西哥 CIS 患者的临床数据。对七种不同的特征组合进行了评估,以确定最有效的模型和预测因子。结果:两个数据集共同的关键预测因素包括性别、CSF 中是否存在寡克隆带、MRI 脊柱病变、异常视觉诱发电位和脑干听觉诱发电位。立陶宛数据集证实了之前临床研究确定的预测因子,而墨西哥数据集则部分验证了这些预测因子。在墨西哥数据集的所有特征上使用随机森林,在立陶宛数据集的所有特征上使用逻辑回归和 SMOTE 提升采样,均获得了 1.0 的最高 F1 分数。结论将已确定的高性能 ML 模型应用于 CIS 患者数据集显示出了帮助临床医生识别高风险患者的潜力。
{"title":"Interpretable Machine Learning for Predicting Multiple Sclerosis Conversion from Clinically Isolated Syndrome","authors":"Eden Caroline Daniel, SANTOSH TIRUNAGARI, Karan Batth, David Windridge, Yashaswini Balla","doi":"10.1101/2024.07.18.24310578","DOIUrl":"https://doi.org/10.1101/2024.07.18.24310578","url":null,"abstract":"Background: Machine learning (ML) prediction of clinically isolated syndrome (CIS) conversion to multiple sclerosis (MS) could be used as a remote, preliminary tool by clinicians to identify high-risk patients that would benefit from early treatment. Objective: This study evaluates ML models to predict CIS to MS conversion and identifies key predictors. Methods: Five supervised learning techniques (Naive Bayes, Logistic Regression, Decision Trees, Random Forests and Support Vector Machines) were applied to clinical data from 138 Lithuanian and 273 Mexican CIS patients. Seven different feature combinations were evaluated to determine the most effective models and predictors. Results: Key predictors common to both datasets included sex, presence of oligoclonal bands in CSF, MRI spinal lesions, abnormal visual evoked potentials and brainstem auditory evoked potentials. The Lithuanian dataset confirmed predictors identified by previous clinical research, while the Mexican dataset partially validated them. The highest F1 score of 1.0 was achieved using Random Forests on all features for the Mexican dataset and Logistic Regression with SMOTE Upsampling on all features for the Lithuanian dataset. Conclusion: Applying the identified high-performing ML models to the CIS patient datasets shows potential in assisting clinicians to identify high-risk patients.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
His-MMDM: Multi-domain and Multi-omics Translation of Histopathology Images with Diffusion Models His-MMDM:利用扩散模型对组织病理学图像进行多域和多组学转换
Pub Date : 2024-07-12 DOI: 10.1101/2024.07.11.24310294
Zhongxiao Li, Tianqi Su, Bin Zhang, Wenkai Han, Sibin Zhang, Guiyin Sun, Yuwei Cong, Xin Chen, Jiping Qi, Yujie Wang, Shiguang Zhao, Hongxue Meng, Peng Liang, Xin Gao
Generative AI (GenAI) has advanced computational pathology through various image translation models. These models synthesize histopathological images from existing ones, facilitating tasks such as color normalization and virtual staining. Current models, while effective, are mostly dedicated to specific source-target domain pairs and lack scalability for multi-domain translations. Here we introduce His-MMDM, a diffusion model-based framework enabling multi-domain and multi-omics histopathological image translation. His-MMDM can translate images across an unlimited number of categorical domains, enabling new applications like the translation of tumor images across various tumor types, while performing comparably to dedicated models on previous tasks such as transforming cryosectioned images to formalin-fixed paraffin-embedded (FFPE) ones. Additionally, it can perform genomics- and/or transcriptomics-guided editing of histopathological images, illustrating the impact of driver mutations and oncogenic pathway alterations on tissue histopathology. These versatile capabilities position His-MMDM as a versatile tool in the GenAI toolkit for future pathologists.
生成式人工智能(GenAI)通过各种图像转换模型推动了计算病理学的发展。这些模型能从现有图像中合成组织病理学图像,从而为颜色归一化和虚拟染色等任务提供便利。目前的模型虽然有效,但大多专用于特定的源-目标域对,缺乏多域翻译的可扩展性。在此,我们介绍基于扩散模型的框架 His-MMDM,该框架可实现多领域和多组学组织病理学图像翻译。His-MMDM 可以跨无限数量的分类域翻译图像,从而实现新的应用,如跨各种肿瘤类型翻译肿瘤图像,同时在将冷冻切片图像转换为福尔马林固定石蜡包埋(FFPE)图像等以往任务上的表现可与专用模型相媲美。此外,它还能对组织病理学图像进行基因组学和/或转录组学指导的编辑,说明驱动突变和致癌通路改变对组织病理学的影响。这些多功能使 His-MMDM 成为未来病理学家 GenAI 工具包中的一个通用工具。
{"title":"His-MMDM: Multi-domain and Multi-omics Translation of Histopathology Images with Diffusion Models","authors":"Zhongxiao Li, Tianqi Su, Bin Zhang, Wenkai Han, Sibin Zhang, Guiyin Sun, Yuwei Cong, Xin Chen, Jiping Qi, Yujie Wang, Shiguang Zhao, Hongxue Meng, Peng Liang, Xin Gao","doi":"10.1101/2024.07.11.24310294","DOIUrl":"https://doi.org/10.1101/2024.07.11.24310294","url":null,"abstract":"Generative AI (GenAI) has advanced computational pathology through various image translation models. These models synthesize histopathological images from existing ones, facilitating tasks such as color normalization and virtual staining. Current models, while effective, are mostly dedicated to specific source-target domain pairs and lack scalability for multi-domain translations. Here we introduce His-MMDM, a diffusion model-based framework enabling multi-domain and multi-omics histopathological image translation. His-MMDM can translate images across an unlimited number of categorical domains, enabling new applications like the translation of tumor images across various tumor types, while performing comparably to dedicated models on previous tasks such as transforming cryosectioned images to formalin-fixed paraffin-embedded (FFPE) ones. Additionally, it can perform genomics- and/or transcriptomics-guided editing of histopathological images, illustrating the impact of driver mutations and oncogenic pathway alterations on tissue histopathology. These versatile capabilities position His-MMDM as a versatile tool in the GenAI toolkit for future pathologists.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data 用于大数据时代临床观察研究语义感知数据协调的预训练语言模型
Pub Date : 2024-07-12 DOI: 10.1101/2024.07.12.24310136
Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface
In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.
在临床研究中,人们强烈希望利用来自人群队列研究和常规电子医疗记录的大数据来设计新的干预措施、改善健康结果并提高医疗服务效率。然而,要实现这些潜在需求,需要在协调源数据集和整理研究数据方面付出巨大努力,而这目前依赖于成本高、耗时长的人工和劳动密集型方法。我们评估了自然语言处理(NLP)和无监督机器学习(ML)的人工智能方法在应对大数据语义协调和整理挑战方面的适用性。我们的目标是为开发支持大型临床数据集数据整理的自动化工具奠定高效稳健的技术基础。我们对 NLP 算法和无监督 ML 算法进行了评估,并提出了两个用于自动语义协调的管道:一个用于对领域相关变量进行语义感知搜索的管道和一个用于对语义相似变量进行聚类的管道。我们使用英语老龄化纵向研究(ELSA)数据库中的 94,037 个文本变量描述来评估管道性能。我们发现语义搜索管道的准确度很高,AUC为0.899(SD=0.056)。我们的语义聚类管道实现了0.237(SD=0.157)的V-measure,与其他相关领域的领先实现相当。自动化可以大大加快数据集协调过程。人工标注的速度为每分钟 2.1 条描述,而我们的自动标注速度提高到了每分钟 245 条描述。我们的研究结果凸显了 NLP 和无监督 ML 等人工智能技术在临床研究大数据自动协调和整理方面的潜力。通过建立强大的技术基础,我们为开发简化流程的自动化工具铺平了道路,使健康数据科学家能够在研究中更高效、更有效地利用大数据,加快从数据中获得临床益处的洞察力。
{"title":"Pretrained Language Models for Semantics-Aware Data Harmonisation of Observational Clinical Studies in the Era of Big Data","authors":"Jakub Jan Dylag, Zlatko Zlatev, Michael Boniface","doi":"10.1101/2024.07.12.24310136","DOIUrl":"https://doi.org/10.1101/2024.07.12.24310136","url":null,"abstract":"In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase efficiency of healthcare delivery. Yet, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly and time-consuming manual and labour-intensive methods. We evaluate the applicability of AI methods for natural language processing (NLP) and unsupervised machine learning (ML) to the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We assess NLP and unsupervised ML algorithms and propose two pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline with an AUC of 0.899 (SD=0.056). Our Semantic Clustering pipeline achieves a V-measure of 0.237 (SD=0.157), which is on par with leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonization. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies, accelerating insights from data for clinical benefit.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"70 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of Race, Sex, and Language Proficiency Disparities in Documented Medical Decisions 有据可查的医疗决定中的种族、性别和语言能力差异分析
Pub Date : 2024-07-12 DOI: 10.1101/2024.07.11.24310289
Hadi Amiri, Nidhi Vakil, Mohamed Elgaar, Jiali Cheng, Mitra Mohtarami, Adrian Wong, Mehrnaz Sadrolashrafi, Leo Anthony G. Celi
AbstractImportance: Detecting potential disparities in documented medical decisions is a crucial step toward achieving more equitable practices and care, informing healthcare policy making, and preventing computational models from learning and perpetuating such biases. Objective: To identify disparities associated with race, sex and language proficiency of patients in the documentation of medical decisions. Design: This cross-sectional study included 451 discharge summaries from MIMIC-III, with all medical decisions annotated by domain experts according to the 10 medical decision categories defined in the Decision Identification and Classification Taxonomy for Use in Medicine. Annotated discharge summaries were stratified by race, sex, language proficiency, diagnosis codes, type of ICU, patient status code, and patient comorbidities (quantified by Elixhauser Comorbidity Index) to account for potential confounding factors. Welch's t-test with Bonferroni correction was used to identify significant disparities in the frequency of medical decisions. Setting: The study used the MIMIC-III data set, which contains de-identified health data for patients admitted to the critical care units at the Beth Israel Deaconess Medical Center. Participants: The population reflects the race, sex, and clinical conditions of patients in a data set developed by previous work for patient phenotyping. Main Outcomes and Measures: The primary outcomes were different types of disparities associated with language proficiency of patients in documented medical decisions within discharge summaries, and the secondary outcome was the prevalence of medical decisions documented in discharge summaries. The data set will be made available at https://physionet.org/ Results: This study analyzed 56,759 medical decision text segments documented in 451 discharge summaries. Analysis across demographic groups revealed a higher documentation frequency for English proficient patients compared to non-English proficient patients in several categories, suggesting potential disparities in documentation or care. Specifically, English proficient patients consistently had more documented decisions in critical decision categories such as "Defining Problem" in conditions related to circulatory system and endocrine, nutritional and metabolic diseases. However, this study found no significant disparities in medical decision documentation based on sex or race. Conclusions and Relevance: This study illustrates disparities in the documentation of medical decisions, with English proficient patients receiving more comprehensive documentation compared to non-English proficient patients. Conversely, no significant disparity was identified in terms of sex or race. These findings suggest a potential need for targeted interventions to improve the equity of medical documentation practices so that all patients receive the same level of detailed care documentation and prevent computational models from learning and
摘要重要性:检测记录在案的医疗决策中可能存在的差异是实现更公平的医疗实践和护理、为医疗政策制定提供信息以及防止计算模型学习和延续此类偏见的关键一步。目标:确定在医疗决策记录中与患者的种族、性别和语言能力相关的差异。设计:这项横断面研究纳入了来自 MIMIC-III 的 451 份出院摘要,所有医疗决策均由领域专家根据《医学决策识别与分类标准》中定义的 10 个医疗决策类别进行注释。注释后的出院摘要按种族、性别、语言能力、诊断代码、重症监护室类型、患者状态代码和患者合并症(以Elixhauser合并症指数量化)进行分层,以考虑潜在的混杂因素。采用 Welch's t 检验和 Bonferroni 校正来确定医疗决策频率的显著差异。环境:研究使用了 MIMIC-III 数据集,该数据集包含贝斯以色列女执事医疗中心重症监护病房住院患者的去标识化健康数据。参与者:该数据集反映了患者的种族、性别和临床状况,该数据集是由以前的患者表型分析工作开发的。主要结果和测量指标:主要结果是出院摘要中记录的医疗决定中与患者语言能力相关的不同类型的差异,次要结果是出院摘要中记录的医疗决定的普遍性。数据集将在 https://physionet.org/ 网站上公布:本研究分析了 451 份出院摘要中记录的 56,759 个医疗决定文本片段。对不同人群的分析显示,与非英语熟练的患者相比,英语熟练的患者在多个类别中记录的频率更高,这表明在记录或护理方面可能存在差异。具体来说,在与循环系统、内分泌、营养和代谢疾病相关的 "定义问题 "等关键决策类别中,英语熟练的患者记录的决策一直较多。不过,本研究并未发现基于性别或种族的医疗决策记录存在明显差异。结论和相关性:本研究说明了医疗决策记录方面的差异,与非英语熟练的患者相比,英语熟练的患者获得的医疗决策记录更全面。相反,在性别或种族方面没有发现明显的差异。这些发现表明,可能需要采取有针对性的干预措施来改善医疗记录的公平性,从而使所有患者都能获得同样详细的护理记录,并防止计算模型学习和延续这种偏见。
{"title":"Analysis of Race, Sex, and Language Proficiency Disparities in Documented Medical Decisions","authors":"Hadi Amiri, Nidhi Vakil, Mohamed Elgaar, Jiali Cheng, Mitra Mohtarami, Adrian Wong, Mehrnaz Sadrolashrafi, Leo Anthony G. Celi","doi":"10.1101/2024.07.11.24310289","DOIUrl":"https://doi.org/10.1101/2024.07.11.24310289","url":null,"abstract":"Abstract\u0000Importance: Detecting potential disparities in documented medical decisions is a crucial step toward achieving more equitable practices and care, informing healthcare policy making, and preventing computational models from learning and perpetuating such biases. Objective: To identify disparities associated with race, sex and language proficiency of patients in the documentation of medical decisions. Design: This cross-sectional study included 451 discharge summaries from MIMIC-III, with all medical decisions annotated by domain experts according to the 10 medical decision categories defined in the Decision Identification and Classification Taxonomy for Use in Medicine. Annotated discharge summaries were stratified by race, sex, language proficiency, diagnosis codes, type of ICU, patient status code, and patient comorbidities (quantified by Elixhauser Comorbidity Index) to account for potential confounding factors. Welch's t-test with Bonferroni correction was used to identify significant disparities in the frequency of medical decisions. Setting: The study used the MIMIC-III data set, which contains de-identified health data for patients admitted to the critical care units at the Beth Israel Deaconess Medical Center. Participants: The population reflects the race, sex, and clinical conditions of patients in a data set developed by previous work for patient phenotyping. Main Outcomes and Measures: The primary outcomes were different types of disparities associated with language proficiency of patients in documented medical decisions within discharge summaries, and the secondary outcome was the prevalence of medical decisions documented in discharge summaries. The data set will be made available at https://physionet.org/ Results: This study analyzed 56,759 medical decision text segments documented in 451 discharge summaries. Analysis across demographic groups revealed a higher documentation frequency for English proficient patients compared to non-English proficient patients in several categories, suggesting potential disparities in documentation or care. Specifically, English proficient patients consistently had more documented decisions in critical decision categories such as \"Defining Problem\" in conditions related to circulatory system and endocrine, nutritional and metabolic diseases. However, this study found no significant disparities in medical decision documentation based on sex or race. Conclusions and Relevance: This study illustrates disparities in the documentation of medical decisions, with English proficient patients receiving more comprehensive documentation compared to non-English proficient patients. Conversely, no significant disparity was identified in terms of sex or race. These findings suggest a potential need for targeted interventions to improve the equity of medical documentation practices so that all patients receive the same level of detailed care documentation and prevent computational models from learning and ","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Machine Learning-Based Prediction of Hospital Mortality in Mechanically Ventilated ICU Patients 基于机器学习的机械通气 ICU 患者住院死亡率预测方法
Pub Date : 2024-07-12 DOI: 10.1101/2024.07.12.24310325
Hexin Li, Negin Ashrafi, Chris Kang, Guanlan Zhao, Yubing Chen, Maryam Pishgar
Background:Mechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts. Methods:We developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots. Results:The study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost. Conclusion:The preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.
背景:机械通气(MV)对重症监护病房(ICU)的危重病人至关重要,但也有很大的死亡风险。本研究旨在开发一种预测模型,利用全面的健康数据估算机械通气患者的住院死亡率,以协助重症监护室医生发出早期警报。方法:我们开发了一个机器学习(ML)框架来预测接受 MV 治疗的 ICU 患者的住院死亡率。利用 MIMIC-III 数据库,我们通过 ICD-9 编码确定了 25202 名符合条件的患者。我们采用了反向排除法和拉索法,根据临床见解和文献选择了 32 个特征。数据预处理包括剔除数据缺失率超过 90% 的列,并对剩余的缺失值进行平均估算。为了解决类不平衡问题,我们使用了合成少数群体过度采样技术(SMOTE)。我们评估了多个 ML 模型,包括 CatBoost、XGBoost、决策树、随机森林、支持向量机 (SVM)、K-近邻 (KNN) 和逻辑回归,采用 70/30 的训练-测试比例。之所以选择 CatBoost 模型,是因为它在准确度、精确度、召回率、F1 分数、AUROC 指标和校准图等方面表现出色。结果:该研究涉及 25202 名 MV 患者。CatBoost 模型的 AUROC 为 0.862,比文献报道的最佳 AUROC 0.821 有所提高。该模型的准确度为 0.789,F1 分数为 0.747,校准效果更好,优于其他模型。这些改进归功于 CatBoost 系统化的特征选择和稳健的梯度提升架构。结论:预处理方法大大减少了相关特征的数量,简化了计算过程,并找出了之前被忽视的关键特征。整合这些特征并调整参数后,我们的模型对未见数据表现出了很强的泛化能力。这凸显了人工智能作为重症监护室重要工具的潜力,它能提高资源分配效率,为重症监护室患者提供更个性化的干预措施。
{"title":"A Machine Learning-Based Prediction of Hospital Mortality in Mechanically Ventilated ICU Patients","authors":"Hexin Li, Negin Ashrafi, Chris Kang, Guanlan Zhao, Yubing Chen, Maryam Pishgar","doi":"10.1101/2024.07.12.24310325","DOIUrl":"https://doi.org/10.1101/2024.07.12.24310325","url":null,"abstract":"Background:\u0000Mechanical ventilation (MV) is vital for critically ill ICU patients but carries significant mortality risks. This study aims to develop a predictive model to estimate hospital mortality among MV patients, utilizing comprehensive health data to assist ICU physicians with early-stage alerts. Methods:\u0000We developed a Machine Learning (ML) framework to predict hospital mortality in ICU patients receiving MV. Using the MIMIC-III database, we identified 25,202 eligible patients through ICD-9 codes. We employed backward elimination and the Lasso method, selecting 32 features based on clinical insights and literature. Data preprocessing included eliminating columns with over 90% missing data and using mean imputation for the remaining missing values. To address class imbalance, we used the Synthetic Minority Over-sampling Technique (SMOTE). We evaluated several ML models, including CatBoost, XGBoost, Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Logistic Regression, using a 70/30 train-test split. The CatBoost model was chosen for its superior performance in terms of accuracy, precision, recall, F1-score, AUROC metrics, and calibration plots. Results:\u0000The study involved a cohort of 25,202 patients on MV. The CatBoost model attained an AUROC of 0.862, an increase from an initial AUROC of 0.821, which was the best reported in the literature. It also demonstrated an accuracy of 0.789, an F1-score of 0.747, and better calibration, outperforming other models. These improvements are due to systematic feature selection and the robust gradient boosting architecture of CatBoost. Conclusion:\u0000The preprocessing methodology significantly reduced the number of relevant features, simplifying computational processes, and identified critical features previously overlooked. Integrating these features and tuning the parameters, our model demonstrated strong generalization to unseen data. This highlights the potential of ML as a crucial tool in ICUs, enhancing resource allocation and providing more personalized interventions for MV patients.","PeriodicalId":501454,"journal":{"name":"medRxiv - Health Informatics","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
medRxiv - Health Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1