Ayana Sarrieddine, Claire Lai, Oliver Bear Don't Walk, Nick F H Reid, Gregory Sawicki, Ariel Berlinski, Margaret Rosenfeld, Andrea L Hartzler
As the integration of informatics into clinical research reshapes the landscape of decentralized studies, optimizing participant experience remains a key challenge. Although prior research has established foundations for decentralized study design, a more comprehensive understanding of participant perspectives is essential to ensure remote methods for data collection meet participant needs. This study contributes to a growing literature in participant-centered decentralized studies through an analysis of OUTREACH, a 3-month home spirometry study among individuals with cystic fibrosis. Through a qualitative analysis of 46 participant exit interviews, we identified three overarching categories that influenced participant experience: motivators, technological infrastructure, and human coordination. Our findings emphasize the value of reliable technology and comprehensive interpersonal support systems. These findings shed light upon the importance of sociotechnical elements for optimizing participant experience, which may enhance the quality of clinical study data through meaningful participant engagement.
{"title":"Technology and Human Support Systems in Decentralized Studies: A Participant-Centered Case Study in Cystic Fibrosis.","authors":"Ayana Sarrieddine, Claire Lai, Oliver Bear Don't Walk, Nick F H Reid, Gregory Sawicki, Ariel Berlinski, Margaret Rosenfeld, Andrea L Hartzler","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>As the integration of informatics into clinical research reshapes the landscape of decentralized studies, optimizing participant experience remains a key challenge. Although prior research has established foundations for decentralized study design, a more comprehensive understanding of participant perspectives is essential to ensure remote methods for data collection meet participant needs. This study contributes to a growing literature in participant-centered decentralized studies through an analysis of OUTREACH, a 3-month home spirometry study among individuals with cystic fibrosis. Through a qualitative analysis of 46 participant exit interviews, we identified three overarching categories that influenced participant experience: motivators, technological infrastructure, and human coordination. Our findings emphasize the value of reliable technology and comprehensive interpersonal support systems. These findings shed light upon the importance of sociotechnical elements for optimizing participant experience, which may enhance the quality of clinical study data through meaningful participant engagement.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"1130-1139"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919454/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147273135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xubing Hao, Rashmie Abeysinghe, Jay Shi, Guo-Qiang Zhang, Licong Cui
Ensuring the completeness of IS-A relations in SNOMED CT is crucial for maintaining its accuracy in clinical applications. In this study, we propose a hybrid approach leveraging non-lattice subgraphs and pre-trained language models (PLMs) to identify missing IS-A relations in SNOMED CT. We fine-tuned four BERT-based models: BERT, DistillBERT, DeBERTa, and BioClinicalBERT, and four generative large language models (LLMs): BioMistral, Llama3, Gemma2, and Phi-4. Missing IS-A relations were identified through consensus predictions by all eight models. De-BERTa achieved the best performance (precision: 0.96, recall: 0.97, F1-score: 0.965) for IS-A relation prediction. Our approach identified 678 potential missing IS-A relations in SNOMED CT (March 2023 US Edition), of which 100 randomly selected cases were manually reviewed by a domain expert, confirming 93 as valid (93% precision). These results demonstrate the effectiveness of fine-tuned PLMs in detecting missing IS-A relations within non-lattice subgraphs, offering a promising avenue for improving SNOMED CT's quality.
{"title":"Identifying Missing IS-A Relations in SNOMED CT with Fine-Tuned Pre-trained Language Models and Non-lattice Subgraphs.","authors":"Xubing Hao, Rashmie Abeysinghe, Jay Shi, Guo-Qiang Zhang, Licong Cui","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Ensuring the completeness of IS-A relations in SNOMED CT is crucial for maintaining its accuracy in clinical applications. In this study, we propose a hybrid approach leveraging non-lattice subgraphs and pre-trained language models (PLMs) to identify missing IS-A relations in SNOMED CT. We fine-tuned four BERT-based models: BERT, DistillBERT, DeBERTa, and BioClinicalBERT, and four generative large language models (LLMs): BioMistral, Llama3, Gemma2, and Phi-4. Missing IS-A relations were identified through consensus predictions by all eight models. De-BERTa achieved the best performance (precision: 0.96, recall: 0.97, F1-score: 0.965) for IS-A relation prediction. Our approach identified 678 potential missing IS-A relations in SNOMED CT (March 2023 US Edition), of which 100 randomly selected cases were manually reviewed by a domain expert, confirming 93 as valid (93% precision). These results demonstrate the effectiveness of fine-tuned PLMs in detecting missing IS-A relations within non-lattice subgraphs, offering a promising avenue for improving SNOMED CT's quality.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"433-442"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919620/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147273142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katherine A Zellner, Sifan Yuan, Emily R Ernst, Dylan W Arkowitz, Aaron H Mun, Mary S Kim, Ivan Marsic, Randall S Burd, Aleksandra Sarcevic
Delays and process inefficiencies during trauma resuscitation can contribute to adverse patient outcomes. While tracking elapsed time may improve the trauma team's temporal awareness and reduce delays, reliance on manual activation of stop clocks can introduce variability. To address this limitation, we implemented a computer vision-powered automatic stop clock designed to activate upon patient arrival without requiring manual input. We conducted a retrospective video review of 50 trauma resuscitations to assess how the clock was used in practice, followed by semi-structured interviews with nine trauma team members to elicit their feedback and perceptions. This study contributes to the broader discussion on AI-assisted clinical tools, highlighting the role of automation in supporting trauma teams, reducing variability in time tracking, and improving process efficiency.
{"title":"STop Clock for Automated Tracking (STAT) during Time-Critical Medical Work: Evaluating the Accuracy and Usability of an AI-Driven Automated Stop Clock.","authors":"Katherine A Zellner, Sifan Yuan, Emily R Ernst, Dylan W Arkowitz, Aaron H Mun, Mary S Kim, Ivan Marsic, Randall S Burd, Aleksandra Sarcevic","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Delays and process inefficiencies during trauma resuscitation can contribute to adverse patient outcomes. While tracking elapsed time may improve the trauma team's temporal awareness and reduce delays, reliance on manual activation of stop clocks can introduce variability. To address this limitation, we implemented a computer vision-powered automatic stop clock designed to activate upon patient arrival without requiring manual input. We conducted a retrospective video review of 50 trauma resuscitations to assess how the clock was used in practice, followed by semi-structured interviews with nine trauma team members to elicit their feedback and perceptions. This study contributes to the broader discussion on AI-assisted clinical tools, highlighting the role of automation in supporting trauma teams, reducing variability in time tracking, and improving process efficiency.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"1502-1510"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919504/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147273145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aditya Nagori, Ayush Gautam, Matthew O Wiens, Vuong Nguyen, Nathan Kenya Mugisha, Jerome Kabakyenga, Niranjan Kissoon, John Mark Ansermino, Rishikesan Kamaleswaran
The clustering of patient subgroups is essential for personalized care and efficient use of resources. Traditional clustering methods struggle with high-dimensional heterogeneous healthcare data and lack contextual understanding. This study evaluates clustering based on the Large Language Model (LLM) against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical variables and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated the quality and distinctiveness of the cluster. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with a higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight the potential of LLMs for contextual phenotyping and informed decision making in resource-limited settings.
{"title":"Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models.","authors":"Aditya Nagori, Ayush Gautam, Matthew O Wiens, Vuong Nguyen, Nathan Kenya Mugisha, Jerome Kabakyenga, Niranjan Kissoon, John Mark Ansermino, Rishikesan Kamaleswaran","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The clustering of patient subgroups is essential for personalized care and efficient use of resources. Traditional clustering methods struggle with high-dimensional heterogeneous healthcare data and lack contextual understanding. This study evaluates clustering based on the Large Language Model (LLM) against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical variables and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated the quality and distinctiveness of the cluster. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with a higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight the potential of LLMs for contextual phenotyping and informed decision making in resource-limited settings.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"929-938"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12919534/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147273189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keerthika Sunchu, Megha M Moncy, Saptarshi Purkayastha, Cathy R Fulton
This study examines the integration of OpenEMR, a Meaningful Use-certified open-source electronic health record (EHR) system, into a Health Informatics curriculum. The primary objective was to address the disparity between theoretical knowledge and practical application in health informatics education. The implementation process revealed several significant challenges, including unintended system modifications that compromised functionality, data entry errors that impacted usability, and technical issues that impeded accessibility. To mitigate these challenges, a series of interventions were implemented. These included backend modifications to enhance data entry accuracy, usability improvements such as limiting open tabs to facilitate navigation, and the implementation ofproactive measures to expedite the resolution of technical issues. The experiences gained from this integration process highlight three critical aspects of health informatics education: the significance of practical proficiency in EHR systems, the necessity for user-centric interface design, and the importance of adaptability and problem-solving skills. The study proposes several future directions for research and practice. These include fostering global collaboration, developing standardized curricula for EHR education, and establishing robust mechanisms for continuous assessment and improvement. The findings underscore the pivotal role of integrating hands-on EHR experience into health informatics education, emphasizing its potential to equip students with the essential competencies required to navigate the complex and dynamic healthcare landscape.
{"title":"Lessons Learned from OpenEMR Implementation in Graduate Health Informatics Curriculum.","authors":"Keerthika Sunchu, Megha M Moncy, Saptarshi Purkayastha, Cathy R Fulton","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This study examines the integration of OpenEMR, a Meaningful Use-certified open-source electronic health record (EHR) system, into a Health Informatics curriculum. The primary objective was to address the disparity between theoretical knowledge and practical application in health informatics education. The implementation process revealed several significant challenges, including unintended system modifications that compromised functionality, data entry errors that impacted usability, and technical issues that impeded accessibility. To mitigate these challenges, a series of interventions were implemented. These included backend modifications to enhance data entry accuracy, usability improvements such as limiting open tabs to facilitate navigation, and the implementation ofproactive measures to expedite the resolution of technical issues. The experiences gained from this integration process highlight three critical aspects of health informatics education: the significance of practical proficiency in EHR systems, the necessity for user-centric interface design, and the importance of adaptability and problem-solving skills. The study proposes several future directions for research and practice. These include fostering global collaboration, developing standardized curricula for EHR education, and establishing robust mechanisms for continuous assessment and improvement. The findings underscore the pivotal role of integrating hands-on EHR experience into health informatics education, emphasizing its potential to equip students with the essential competencies required to navigate the complex and dynamic healthcare landscape.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"1079-1088"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099383/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating "ideal" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.
{"title":"RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions.","authors":"Gregory Kell, Angus Roberts, Serge Umansky, Yuti Khare, Najma Ahmed, Nikhil Patel, Chloe Simela, Jack Coumbe, Julian Rozario, Ryan-Rhys Griffiths, Iain J Marshall","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Clinical question answering systems have the potential to provide clinicians with relevant and timely answers to their questions. Nonetheless, despite the advances that have been made, adoption of these systems in clinical settings has been slow. One issue is a lack of question-answering datasets which reflect the real-world needs of health professionals. In this work, we present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM. We describe the process for generating and verifying the QA pairs and assess several QA models on BioASQ and RealMedQA to assess the relative difficulty of matching answers to questions. We show that the LLM is more cost-efficient for generating \"ideal\" QA pairs. Additionally, we achieve a lower lexical similarity between questions and answers than BioASQ which provides an additional challenge to the top two QA models, as per the results. We release our code and our dataset publicly to encourage further research.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"590-599"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099375/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHap-ley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.
{"title":"Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques.","authors":"Yubo Li, Saba Al-Sayouri, Rema Padman","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHap-ley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"664-673"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099416/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emma Croxford, Yanjun Gao, Brian Patterson, Daniel To, Samuel Tesch, Dmitriy Dligach, Anoop Mayampurath, Matthew M Churpek, Majid Afshar
In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.
{"title":"Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses.","authors":"Emma Croxford, Yanjun Gao, Brian Patterson, Daniel To, Samuel Tesch, Dmitriy Dligach, Anoop Mayampurath, Matthew M Churpek, Majid Afshar","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In the evolving landscape of clinical Natural Language Generation (NLG), assessing abstractive text quality remains challenging, as existing methods often overlook generative task complexities. This work aimed to examine the current state of automated evaluation metrics in NLG in healthcare. To have a robust and well-validated baseline with which to examine the alignment of these metrics, we created a comprehensive human evaluation framework. Employing ChatGPT-3.5-turbo generative output, we correlated human judgments with each metric. None of the metrics demonstrated high alignment; however, the SapBERT score-a Unified Medical Language System (UMLS)- showed the best results. This underscores the importance of incorporating domain-specific knowledge into evaluation efforts. Our work reveals the deficiency in quality evaluations for generated text and introduces our comprehensive human evaluation framework as a baseline. Future efforts should prioritize integrating medical knowledge databases to enhance the alignment of automated metrics, particularly focusing on refining the SapBERT score for improved assessments.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"309-318"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiuhao Lu, Rui Li, Andrew Wen, Jinlian Wang, Liwei Wang, Hongfang Liu
Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPTfor token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.
{"title":"Large Language Models Struggle in Token-Level Clinical Named Entity Recognition.","authors":"Qiuhao Lu, Rui Li, Andrew Wen, Jinlian Wang, Liwei Wang, Hongfang Liu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPTfor token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"748-757"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099373/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of artificial intelligence (AI) in medicine has shown promise to improve the quality of healthcare decisions. However, AI can be biased in a manner that produces unfair predictions for certain demographic subgroups. In MIMIC-CXR, a publicly available dataset of over 300,000 chest X-ray images, diagnostic AI has been shown to have a higher false negative rate for racial minorities. We evaluated the capacity of synthetic data augmentation, oversampling, and demographic-based corrections to enhance the fairness of AI predictions. We show that adjusting unfair predictions for demographic attributes, such as race, is ineffective at improving fairness or predictive performance. However, using oversampling and synthetic data augmentation to modify disease prevalence reduced such disparities by 74.7% and 10.6%, respectively. Moreover, such fairness gains were accomplished without reduction in performance (95% CI AUC: [0.816, 0.820] versus [0.810, 0.819] versus [0.817, 0.821] for baseline, oversampling, and augmentation, respectively).
人工智能(AI)在医学领域的应用有望提高医疗保健决策的质量。然而,人工智能可能会以某种方式产生对某些人口统计子群体的不公平预测。MIMIC-CXR是一个公开的超过30万张胸部x射线图像数据集,在该数据集中,人工智能诊断对少数种族的假阴性率更高。我们评估了合成数据增强、过采样和基于人口统计的修正的能力,以提高人工智能预测的公平性。我们表明,调整人口统计属性(如种族)的不公平预测在提高公平性或预测性能方面是无效的。然而,使用过采样和合成数据增强来修改患病率,分别将这种差异缩小了74.7%和10.6%。此外,这种公平性的提高在不降低性能的情况下实现(95% CI AUC分别为基线、过采样和增强的[0.816,0.820]、[0.810,0.819]和[0.817,0.821])。
{"title":"Enhancement of Fairness in AI for Chest X-ray Classification.","authors":"Nicholas J Jackson, Chao Yan, Bradley A Malin","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The use of artificial intelligence (AI) in medicine has shown promise to improve the quality of healthcare decisions. However, AI can be biased in a manner that produces unfair predictions for certain demographic subgroups. In MIMIC-CXR, a publicly available dataset of over 300,000 chest X-ray images, diagnostic AI has been shown to have a higher false negative rate for racial minorities. We evaluated the capacity of synthetic data augmentation, oversampling, and demographic-based corrections to enhance the fairness of AI predictions. We show that adjusting unfair predictions for demographic attributes, such as race, is ineffective at improving fairness or predictive performance. However, using oversampling and synthetic data augmentation to modify disease prevalence reduced such disparities by 74.7% and 10.6%, respectively. Moreover, such fairness gains were accomplished without reduction in performance (95% CI AUC: [0.816, 0.820] versus [0.810, 0.819] versus [0.817, 0.821] for baseline, oversampling, and augmentation, respectively).</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"551-560"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099404/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144144579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}