To develop and externally validate novel, data-driven algorithms that are based on appropriate variable selection methods for identifying patients with Behçet’s disease in Japan.
Methods
This retrospective cross-sectional study included 13,538 patients from six tertiary hospitals (November–December 2023). One year of claims data was linked to chart-confirmed Behçet’s disease diagnoses. Patients were randomly divided into training (n = 8,811) and test (n = 3,775) sets, with external validation (n = 952) from another hospital. Feature selection among Behçet’s disease-coded patients used the Least Absolute Shrinkage and Selection Operator, Boruta, and Recursive Feature Elimination. The diagnostic performance of the rule-based algorithms, which were derived from the decision tree models, was evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value, and F1 score.
Results
Diagnosis codes alone achieved high sensitivity (1.000) and specificity (0.992) but modest PPV (0.767, test set; 0.850, external validation). Incorporating sulphamethoxazole–trimethoprim and colchicine prescriptions improved the positive predictive value, which was 0.793 in the test set and 0.865 in external validation.
Conclusion
Incorporating prescriptions alongside diagnosis codes improved PPV while maintaining high sensitivity and specificity. Building upon a data-driven framework that integrates variable selection methods and decision tree analysis, this study provides a validated and scalable approach for reliable claims-based research on Behçet’s disease.
{"title":"Development and validation of data-driven, decision tree–based algorithms for identifying Behçet’s disease in claims data","authors":"Ken-ei Sada , Yoshia Miyawaki , Ryo Yanai , Takashi Kida , Akira Onishi , Ryusuke Yoshimi , Kunihiro Ichinose , Yasuhiro Shimojima","doi":"10.1016/j.ijmedinf.2026.106266","DOIUrl":"10.1016/j.ijmedinf.2026.106266","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and externally validate novel, data-driven algorithms that are based on appropriate variable selection methods for identifying patients with Behçet’s disease in Japan.</div></div><div><h3>Methods</h3><div>This retrospective cross-sectional study included 13,538 patients from six tertiary hospitals (November–December 2023). One year of claims data was linked to chart-confirmed Behçet’s disease diagnoses. Patients were randomly divided into training (n = 8,811) and test (n = 3,775) sets, with external validation (n = 952) from another hospital. Feature selection among Behçet’s disease-coded patients used the Least Absolute Shrinkage and Selection Operator, Boruta, and Recursive Feature Elimination. The diagnostic performance of the rule-based algorithms, which were derived from the decision tree models, was evaluated using accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value, and F1 score.</div></div><div><h3>Results</h3><div>Diagnosis codes alone achieved high sensitivity (1.000) and specificity (0.992) but modest PPV (0.767, test set; 0.850, external validation). Incorporating sulphamethoxazole–trimethoprim and colchicine prescriptions improved the positive predictive value, which was 0.793 in the test set and 0.865 in external validation.</div></div><div><h3>Conclusion</h3><div>Incorporating prescriptions alongside diagnosis codes improved PPV while maintaining high sensitivity and specificity. Building upon a data-driven framework that integrates variable selection methods and decision tree analysis, this study provides a validated and scalable approach for reliable claims-based research on Behçet’s disease.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106266"},"PeriodicalIF":4.1,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-04DOI: 10.1016/j.ijmedinf.2026.106265
Maryam Y. Garza , Zhan Wang , Bhargav Adagarla , Michael W. Rutherford , Umit Topaloglu , Daniel K. Benjamin , Kanecia O. Zimmerman , Eric L. Eisenstein , Karan R. Kumar , on behalf of the Best Pharmaceuticals for Children Act – Pediatric Trials Network Steering Committee
Background
eSource technologies that exchange patient data from electronic health records (EHR) to clinical study electronic data capture (EDC) systems can reduce data quality errors and decrease data collection time. However, the availability of site-specific EHR data to support pediatric studies has not been evaluated.
Methods
We used a previously developed data element mapping procedure to evaluate the HL7® FHIR® standard’s coverage in multi-center pediatric clinical studies. Four study sites independently mapped three pediatric studies’ case report forms (CRFs) to their site’s EHR and FHIR® server data elements.
Results
Site investigators mapped 4152 total and 2070 distinct data elements. Only 33.8 % of total CRF data elements (n = 1402) and 27.4 % of distinct data elements (n = 568) were able to be mapped in FHIR® at the four sites. However, the percent of total data elements mapped varied by pediatric study (55.3 %, 30.8 %, and 26.2 %) and study site (46.4 %, 32.3 %, 27.8 %, and 26.6 %). The percent of total CRF data elements mapped was higher in domains containing standard of care data (e.g., Concomitant Medications, Demographics, Diagnosis/Procedures, Medical History, and Vital Signs) and lower in domains containing protocol-specific data (e.g., Adverse Events, Concomitant Medications, Enrollment/Eligibility/Consent, and study treatment-related Labs, and Vital Signs).
Conclusions
There is substantial between-study and between-site variability in the percentage of pediatric study data elements available in FHIR® at study sites. These results suggest that mapping solutions for pediatric studies utilizing eSource technologies will need to be site-specific.
{"title":"Evaluation of electronic health record to HL7® FHIR® mappings in pediatric research studies","authors":"Maryam Y. Garza , Zhan Wang , Bhargav Adagarla , Michael W. Rutherford , Umit Topaloglu , Daniel K. Benjamin , Kanecia O. Zimmerman , Eric L. Eisenstein , Karan R. Kumar , on behalf of the Best Pharmaceuticals for Children Act – Pediatric Trials Network Steering Committee","doi":"10.1016/j.ijmedinf.2026.106265","DOIUrl":"10.1016/j.ijmedinf.2026.106265","url":null,"abstract":"<div><h3>Background</h3><div>eSource technologies that exchange patient data from electronic health records (EHR) to clinical study electronic data capture (EDC) systems can reduce data quality errors and decrease data collection time. However, the availability of site-specific EHR data to support pediatric studies has not been evaluated.</div></div><div><h3>Methods</h3><div>We used a previously developed data element mapping procedure to evaluate the HL7® FHIR® standard’s coverage in multi-center pediatric clinical studies. Four study sites independently mapped three pediatric studies’ case report forms (CRFs) to their site’s EHR and FHIR® server data elements.</div></div><div><h3>Results</h3><div>Site investigators mapped 4152 total and 2070 distinct data elements. Only 33.8 % of total CRF data elements (n = 1402) and 27.4 % of distinct data elements (n = 568) were able to be mapped in FHIR® at the four sites. However, the percent of total data elements mapped varied by pediatric study (55.3 %, 30.8 %, and 26.2 %) and study site (46.4 %, 32.3 %, 27.8 %, and 26.6 %). The percent of total CRF data elements mapped was higher in domains containing standard of care data (e.g., Concomitant Medications, Demographics, Diagnosis/Procedures, Medical History, and Vital Signs) and lower in domains containing protocol-specific data (e.g., Adverse Events, Concomitant Medications, Enrollment/Eligibility/Consent, and study treatment-related Labs, and Vital Signs).</div></div><div><h3>Conclusions</h3><div>There is substantial between-study and between-site variability in the percentage of pediatric study data elements available in FHIR® at study sites. These results suggest that mapping solutions for pediatric studies utilizing eSource technologies will need to be site-specific.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106265"},"PeriodicalIF":4.1,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1016/j.ijmedinf.2026.106263
Ying Wang , Yanan Zhou , Yu Gong , Zhenbin Ding , Liuxiao Yang , Ting Wang
Backgrounds
Liver transplantation (LT) is a life-saving procedure for patients with end-stage liver disease, yet post-operative complications, particularly the need for respiratory support, remain a significant challenge. We aimed to develop and validate a machine learning (ML)-based predictive tool for postoperative respiratory support requirement in liver transplant recipients.
Methods
This single-center retrospective study was conducted at Zhongshan Hospital, Fudan University (Shanghai, China) from January 2018 to October 2023. Following data preprocessing, key variables were selected through univariate analysis, recursive feature elimination (RFE), Chi-square test, and correlation analysis. Nine ML models were initially constructed and optimized via grid search with 5-fold cross-validation. The final model was selected based on area under the curve (AUC), accuracy, sensitivity, specificity, and F1-score, followed by comparative analysis with conventional scoring systems. Model interpretability was achieved using shapley additive explanations (SHAP) analysis, providing both global and local explanations. For clinical implementation, we developed an online application platform for real-time prediction.
Results
The study included 1121 liver transplant recipients, divided into a discovery cohort (n = 749) and validation cohort (n = 372). Significant differences (P < 0.05) were observed between patients requiring versus not requiring respiratory support across multiple preoperative, intraoperative, and postoperative parameters. After hyperparameter optimization, the random forest (RF), stochastic gradient boosting (SGB), and logistic regression (LR) models were applied to the validation cohort, with RF ultimately being selected as the final predictive tool, achieving an AUC of 0.790 (95 % CI: 0.723–0.857) in the test set and 0.713 (95 % CI: 0.658–0.767) in the validation cohort, significantly outperforming both model for end-stage liver disease (MELD) and acute physiology and chronic health evaluation II (APACHE II) scores. SHAP analysis revealed complex bidirectional relationships between predictors and outcomes, with certain variables showing both protective and risk-enhancing effects depending on clinical context.
Conclusions
Based on large-scale clinical data, we developed a robust predictive model that can effectively assess the need for postoperative respiratory support in liver transplant recipients, thereby facilitating clinical decision-making and potentially improving patient outcomes. However, future multi-center validation was warranted to confirm generalizability.
{"title":"A machine learning-driven app for predicting the need for post-operative respiratory support in liver transplant recipients","authors":"Ying Wang , Yanan Zhou , Yu Gong , Zhenbin Ding , Liuxiao Yang , Ting Wang","doi":"10.1016/j.ijmedinf.2026.106263","DOIUrl":"10.1016/j.ijmedinf.2026.106263","url":null,"abstract":"<div><h3>Backgrounds</h3><div>Liver transplantation (LT) is a life-saving procedure for patients with end-stage liver disease, yet post-operative complications, particularly the need for respiratory support, remain a significant challenge. We aimed to develop and validate a machine learning (ML)-based predictive tool for postoperative respiratory support requirement in liver transplant recipients.</div></div><div><h3>Methods</h3><div>This single-center retrospective study was conducted at Zhongshan Hospital, Fudan University (Shanghai, China) from January 2018 to October 2023. Following data preprocessing, key variables were selected through univariate analysis, recursive feature elimination (RFE), Chi-square test, and correlation analysis. Nine ML models were initially constructed and optimized via grid search with 5-fold cross-validation. The final model was selected based on area under the curve (AUC), accuracy, sensitivity, specificity, and F1-score, followed by comparative analysis with conventional scoring systems. Model interpretability was achieved using shapley additive explanations (SHAP) analysis, providing both global and local explanations. For clinical implementation, we developed an online application platform for real-time prediction.</div></div><div><h3>Results</h3><div>The study included 1121 liver transplant recipients, divided into a discovery cohort (n = 749) and validation cohort (n = 372). Significant differences (P < 0.05) were observed between patients requiring versus not requiring respiratory support across multiple preoperative, intraoperative, and postoperative parameters. After hyperparameter optimization, the random forest (RF), stochastic gradient boosting (SGB), and logistic regression (LR) models were applied to the validation cohort, with RF ultimately being selected as the final predictive tool, achieving an AUC of 0.790 (95 % CI: 0.723–0.857) in the test set and 0.713 (95 % CI: 0.658–0.767) in the validation cohort, significantly outperforming both model for end-stage liver disease (MELD) and acute physiology and chronic health evaluation II (APACHE II) scores. SHAP analysis revealed complex bidirectional relationships between predictors and outcomes, with certain variables showing both protective and risk-enhancing effects depending on clinical context.</div></div><div><h3>Conclusions</h3><div>Based on large-scale clinical data, we developed a robust predictive model that can effectively assess the need for postoperative respiratory support in liver transplant recipients, thereby facilitating clinical decision-making and potentially improving patient outcomes. However, future multi-center validation was warranted to confirm generalizability.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106263"},"PeriodicalIF":4.1,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1016/j.ijmedinf.2026.106261
Egidio de Mattia , Filippo Paoletti , Daniela Pedicino , Giovanna Liuzzo , Carmen Angioletti , Alessia d’Aiello , Alessio Perilli , Andrea Adduci , Giovanni Arcuri , Emilio Meneschincheri , Barbara Ruffo , Melissa D’Agostino , Rita De Donno , Antonio Giulio de Belvis
Background
Timely primary percutaneous coronary intervention (pPCI) is the most important treatment to improve outcomes in ST-segment elevation myocardial infarction (STEMI), with a strong relationship between treatment delays and morbidity and mortality. The present study aims to define the main steps for setting up a real-time digital monitoring dashboard to improve the clinical performance of STEMI management and to evaluate the impact of its implementation on the proportion of patients receiving primary percutaneous coronary intervention (pPCI) within 90 min.
Methods
The set-up of the digital monitoring system required the definition of detailed algorithms for the diagnosis, treatment, and rehab/follow-up phase. For each patient with a diagnosis of STEMI included in the clinical pathway (CP) a multidisciplinary working group identified i) rules for flagging patients alongside the CP, based on specific risk scores; ii) the critical points of the CP to be monitored, such as door-to-balloon time, intensive care unit length of stay, and total hospital length of stay. An interrupted time series analysis and multivariable logistic regression models were performed to assess for changes in the outcome (pPCI within 90 min) after the platform implementation, adjusting for temporal and individual confounders.
Results
After the introduction of the dashboard, the proportion of timely pPCI improved from 40 % pre-implementation to 65 % post-implementation. Adjusted models indicated a twofold increase in the odds of meeting the 90-minute benchmark (OR = 2.00; 95 % CI: 0.99–4.12).
Conclusion
The real-time monitoring system showed a positive impact on the timely management of STEMI, highlighting the potential for improving healthcare efficiency and patient outcomes.
{"title":"Advantages and challenges of tracking st-segment elevation myocardial infarction patients with a real-time dashboard: A single-centre experience","authors":"Egidio de Mattia , Filippo Paoletti , Daniela Pedicino , Giovanna Liuzzo , Carmen Angioletti , Alessia d’Aiello , Alessio Perilli , Andrea Adduci , Giovanni Arcuri , Emilio Meneschincheri , Barbara Ruffo , Melissa D’Agostino , Rita De Donno , Antonio Giulio de Belvis","doi":"10.1016/j.ijmedinf.2026.106261","DOIUrl":"10.1016/j.ijmedinf.2026.106261","url":null,"abstract":"<div><h3>Background</h3><div>Timely primary percutaneous coronary intervention (pPCI) is the most important treatment to improve outcomes in ST-segment elevation myocardial infarction (STEMI), with a strong relationship between treatment delays and morbidity and mortality. The present study aims to define the main steps for setting up a real-time digital monitoring dashboard to improve the clinical performance of STEMI management and to evaluate the impact of its implementation on the proportion of patients receiving primary percutaneous coronary intervention (pPCI) within 90 min.</div></div><div><h3>Methods</h3><div>The set-up of the digital monitoring system required the definition of detailed algorithms for the diagnosis, treatment, and rehab/follow-up phase. For each patient with a diagnosis of STEMI included in the clinical pathway (CP) a multidisciplinary working group identified i) rules for flagging patients alongside the CP, based on specific risk scores; ii) the critical points of the CP to be monitored, such as door-to-balloon time, intensive care unit length of stay, and total hospital length of stay. An interrupted time series analysis and multivariable logistic regression models were performed to assess for changes in the outcome (pPCI within 90 min) after the platform implementation, adjusting for temporal and individual confounders.</div></div><div><h3>Results</h3><div>After the introduction of the dashboard, the proportion of timely pPCI improved from 40 % pre-implementation to 65 % post-implementation. Adjusted models indicated a twofold increase in the odds of meeting the 90-minute benchmark (OR = 2.00; 95 % CI: 0.99–4.12).</div></div><div><h3>Conclusion</h3><div>The real-time monitoring system showed a positive impact on the timely management of STEMI, highlighting the potential for improving healthcare efficiency and patient outcomes.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106261"},"PeriodicalIF":4.1,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145979483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.ijmedinf.2025.106248
Yuanyuan Liu , Yu Zhang, Haoran Mao
Background
The rapid growth in research on ChatGPT’s healthcare applications has led to diverse evaluation methods and substantially heterogeneous findings, undermining evidence reliability and hindering clinical translation.
Objectives
This review aims to examine how different evaluation methods shape our understanding of ChatGPT’s effectiveness in healthcare.
Methods
Studies published between 2023 and 2024 that assess the use of ChatGPT in medical or healthcare-related contexts were included. Evidence was obtained from peer-reviewed literature analyzing ChatGPT’s applications across clinical, educational, and diagnostic domains. Following the PRISMA guidelines, this systematic review analyzed 131 studies published during 2023–2024 that assess the use of ChatGPT in medical contexts.
Results
The results indicate that predominant evaluation approaches—controlled trial studies, expert assessment studies, measurement-based evaluation studies, and prompt generation analysis studies—systematically influence conclusions about ChatGPT’s performance due to their inherent methodological characteristics, such as subjectivity, objectivity, and differences in ecological validity. Further analysis reveals that ChatGPT’s performance is highly context-dependent, shaped by specific application scenarios, model versions, and prompting strategies.
Conclusions
To address methodological heterogeneity and the lack of standardization, this study recommends multi-method cross-validation strategies and a risk-stratified, standardized evaluation framework. These steps are essential to enhance the scientific rigor and reliability of ChatGPT’s assessment in healthcare and to provide a solid foundation for its clinical integration.
{"title":"A scoping review: how evaluation methods shape our understanding of ChatGPT’s effectiveness in healthcare","authors":"Yuanyuan Liu , Yu Zhang, Haoran Mao","doi":"10.1016/j.ijmedinf.2025.106248","DOIUrl":"10.1016/j.ijmedinf.2025.106248","url":null,"abstract":"<div><h3>Background</h3><div>The rapid growth in research on ChatGPT’s healthcare applications has led to diverse evaluation methods and substantially heterogeneous findings, undermining evidence reliability and hindering clinical translation.</div></div><div><h3>Objectives</h3><div>This review aims to examine how different evaluation methods shape our understanding of ChatGPT’s effectiveness in healthcare.</div></div><div><h3>Methods</h3><div>Studies published between 2023 and 2024 that assess the use of ChatGPT in medical or healthcare-related contexts were included. Evidence was obtained from peer-reviewed literature analyzing ChatGPT’s applications across clinical, educational, and diagnostic domains. Following the PRISMA guidelines, this systematic review analyzed 131 studies published during 2023–2024 that assess the use of ChatGPT in medical contexts.</div></div><div><h3>Results</h3><div>The results indicate that predominant evaluation approaches—controlled trial studies, expert assessment studies, measurement-based evaluation studies, and prompt generation analysis studies—systematically influence conclusions about ChatGPT’s performance due to their inherent methodological characteristics, such as subjectivity, objectivity, and differences in ecological validity. Further analysis reveals that ChatGPT’s performance is highly context-dependent, shaped by specific application scenarios, model versions, and prompting strategies.</div></div><div><h3>Conclusions</h3><div>To address methodological heterogeneity and the lack of standardization, this study recommends multi-method cross-validation strategies and a risk-stratified, standardized evaluation framework. These steps are essential to enhance the scientific rigor and reliability of ChatGPT’s assessment in healthcare and to provide a solid foundation for its clinical integration.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106248"},"PeriodicalIF":4.1,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.1016/j.ijmedinf.2025.106250
Maud M.G. Jacobs , Jacobien H.F. Oosterhoff , Rintje Agricola , Walter van der Weegen
Objective
The rapid expansion of digital healthcare has heightened the volume of patient communication, thereby increasing the workload for healthcare professionals. Large Language Models (LLMs) hold promises for offering automated responses to patient questions relayed through eHealth platforms, yet concerns persist regarding their effectiveness, accuracy, and limitations in healthcare settings. This study aims to evaluate the current evidence on the performance and perceived suitability of LLMs in healthcare, focusing on their role in supporting clinical decision-making and patient communication.
Materials and methods
A systematic search in PubMed and Embase up to June 11, 2025 identified 330 studies, of which 20 met the inclusion criteria for comparing the accuracy and adequacy of medical information provided by LLMs versus healthcare professionals and guidelines. The search strategy combined terms related to LLMs, healthcare professionals, and patient questions. The ROBINS-I tool assessed the risk of bias.
Results
A total of nineteen studies focused on medical specialties and one on the primary care setting. Twelve studies favored the responses generated by LLMs, six reported mixed results, and two favored the healthcare professionals’ response. Bias components generally scored moderate to low, indicating a low risk of bias.
Discussion and conclusions
The review summarizes current evidence on the accuracy and adequacy of medical information provided by LLMs in response to patient questions, compared to healthcare professionals and clinical guidelines. While LLMs show potential as supportive tools in healthcare, their integration should be approached cautiously due to inconsistent performance and possible risks. Further research is essential before widespread adoption.
{"title":"Large language models versus healthcare professionals in providing medical information to patient questions: A systematic review","authors":"Maud M.G. Jacobs , Jacobien H.F. Oosterhoff , Rintje Agricola , Walter van der Weegen","doi":"10.1016/j.ijmedinf.2025.106250","DOIUrl":"10.1016/j.ijmedinf.2025.106250","url":null,"abstract":"<div><h3>Objective</h3><div>The rapid expansion of digital healthcare has heightened the volume of patient communication, thereby increasing the workload for healthcare professionals. Large Language Models (LLMs) hold promises for offering automated responses to patient questions relayed through eHealth platforms, yet concerns persist regarding their effectiveness, accuracy, and limitations in healthcare settings. This study aims to evaluate the current evidence on the performance and perceived suitability of LLMs in healthcare, focusing on their role in supporting clinical decision-making and patient communication.</div></div><div><h3>Materials and methods</h3><div>A systematic search in PubMed and Embase up to June 11, 2025 identified 330 studies, of which 20 met the inclusion criteria for comparing the accuracy and adequacy of medical information provided by LLMs versus healthcare professionals and guidelines. The search strategy combined terms related to LLMs, healthcare professionals, and patient questions. The ROBINS-I tool assessed the risk of bias.</div></div><div><h3>Results</h3><div>A total of nineteen studies focused on medical specialties and one on the primary care setting. Twelve studies favored the responses generated by LLMs, six reported mixed results, and two favored the healthcare professionals’ response. Bias components generally scored moderate to low, indicating a low risk of bias.</div></div><div><h3>Discussion and conclusions</h3><div>The review summarizes current evidence on the accuracy and adequacy of medical information provided by LLMs in response to patient questions, compared to healthcare professionals and clinical guidelines. While LLMs show potential as supportive tools in healthcare, their integration should be approached cautiously due to inconsistent performance and possible risks. Further research is essential before widespread adoption.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106250"},"PeriodicalIF":4.1,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1016/j.ijmedinf.2025.106247
Sudarshan Srinivasan , Caitlin Rizy , Maria Mahbub , David Bolme , Alina Peluso , Jodie Trafton , Ioana Danciu
Objective
To develop and evaluate an automated system for identifying healthcare barriers focusing on transportation issues in veterans’ clinical notes using large language models (LLMs) and to assess the impact of different prompting strategies on classification performance and explanation consistency.
Methods
We developed a hybrid system combining pattern matching for templated notes with LLM analysis for free-text notes. Using 2000 manually annotated clinical notes, we compared four prompting strategies (dual-role short, dual-role long, analysis-first, analysis-only) across Mistral-7B and Llama-3.1 models. We evaluated classification performance using standard metrics and assessed explanation consistency through embedding similarity analysis.
Results
The analysis-first strategy achieved superior performance, with Mistral-7B reaching an F1 score of 0.914, outperforming traditional machine learning approaches (GBM: 0.786, BERT: 0.811). LLMs demonstrated higher explanation consistency within models (mean cosine similarity 0.887–0.908) compared to cross-model similarities (0.767–0.872). Pattern matching successfully handled 6.7% of templated notes deterministically. Mistral-7B showed greater internal consistency but higher abstention rates compared to Llama-3.1.
Conclusion
Requiring LLMs to analyze evidence before classification improves both accuracy and explanation consistency for identifying transportation barriers in clinical notes. This approach enables automated barrier detection at scale while providing clinically relevant explanations, supporting both population-level healthcare planning and individual patient care decisions.
目的:利用大语言模型(large language models, LLMs)开发和评估一套自动识别退伍军人临床记录中交通问题医疗障碍的系统,并评估不同提示策略对分类性能和解释一致性的影响。方法:开发了模板笔记模式匹配与自由文本笔记LLM分析相结合的混合系统。使用2000份人工注释的临床记录,我们比较了Mistral-7B和lama-3.1模型的四种提示策略(双角色短、双角色长、分析优先、仅分析)。我们使用标准指标评估分类性能,并通过嵌入相似度分析评估解释一致性。结果:分析优先策略取得了优异的性能,Mistral-7B达到了0.914的F1分数,优于传统的机器学习方法(GBM: 0.786, BERT: 0.811)。与跨模型相似性(0.767-0.872)相比,llm在模型内表现出更高的解释一致性(平均余弦相似性0.887-0.908)。模式匹配成功地确定地处理了6.7%的模板注释。与羊驼-3.1相比,Mistral-7B表现出更大的内部一致性,但更高的弃权率。结论:要求llm在分类前分析证据,提高了临床记录中运输障碍识别的准确性和解释的一致性。这种方法可以实现大规模的自动屏障检测,同时提供临床相关的解释,支持人群层面的医疗保健计划和个体患者护理决策。
{"title":"Leveraging large language models to automate the identification of healthcare access barriers for veterans","authors":"Sudarshan Srinivasan , Caitlin Rizy , Maria Mahbub , David Bolme , Alina Peluso , Jodie Trafton , Ioana Danciu","doi":"10.1016/j.ijmedinf.2025.106247","DOIUrl":"10.1016/j.ijmedinf.2025.106247","url":null,"abstract":"<div><h3>Objective</h3><div>To develop and evaluate an automated system for identifying healthcare barriers focusing on transportation issues in veterans’ clinical notes using large language models (LLMs) and to assess the impact of different prompting strategies on classification performance and explanation consistency.</div></div><div><h3>Methods</h3><div>We developed a hybrid system combining pattern matching for templated notes with LLM analysis for free-text notes. Using 2000 manually annotated clinical notes, we compared four prompting strategies (dual-role short, dual-role long, analysis-first, analysis-only) across Mistral-7B and Llama-3.1 models. We evaluated classification performance using standard metrics and assessed explanation consistency through embedding similarity analysis.</div></div><div><h3>Results</h3><div>The analysis-first strategy achieved superior performance, with Mistral-7B reaching an F1 score of 0.914, outperforming traditional machine learning approaches (GBM: 0.786, BERT: 0.811). LLMs demonstrated higher explanation consistency within models (mean cosine similarity 0.887–0.908) compared to cross-model similarities (0.767–0.872). Pattern matching successfully handled 6.7% of templated notes deterministically. Mistral-7B showed greater internal consistency but higher abstention rates compared to Llama-3.1.</div></div><div><h3>Conclusion</h3><div>Requiring LLMs to analyze evidence before classification improves both accuracy and explanation consistency for identifying transportation barriers in clinical notes. This approach enables automated barrier detection at scale while providing clinically relevant explanations, supporting both population-level healthcare planning and individual patient care decisions.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106247"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1016/j.ijmedinf.2025.106237
Martina Cavallucci , Alice Andalò , Valentina Danesi , Nicola Gentili , Ilaria Massa , Emanuela Scarpi , Maria Chiara Restuccia , Roberto Vespignani , Alice Conficconi , Michela Palleschi , Ugo De Giorgi , Antonino Musolino , Filippo Merloni
Background
Artificial Intelligence (AI) is increasingly integrated into oncology, offering opportunities to improve diagnostics, treatment planning, and operational efficiency. However, patient perspectives on AI, especially regarding data protection and ethical implications, remain underexplored.
Objective
The objective of this study is to investigate cancer patients’ attitudes toward the use of Artificial Intelligence (AI) in healthcare, focusing on their awareness of data protection, perceived risks and benefits, and the conditions under which AI is considered acceptable. Additionally, the study aims to examine how demographic and educational factors influence patients’ views within the context of an Italian comprehensive cancer center.
Methods
A cross-sectional survey was conducted with 117 cancer patients who completed a 28-item online questionnaire. The survey evaluated levels of AI knowledge, perceptions of data privacy, concerns about AI in medical contexts, and willingness to share health data for research.
Results
Most participants demonstrated moderate awareness of AI (70.1%) and its medical applications (85.5%), with higher familiarity observed among younger and more educated individuals. While data protection understanding varied, 76.9% were willing to share personal health data for research aimed at improving cancer care. Concerns included reduced physician autonomy (52.1%) and diminished physician-patient interaction (63.3%). However, 82.9% of respondents found AI acceptable when clinical decisions remained under physician control. AI was most favorably viewed for administrative support and care process optimization.
Conclusion
Cancer patients generally view AI in healthcare positively, especially when it maintains physician oversight and safeguards data privacy. To ensure equitable and informed adoption, targeted educational initiatives and transparent communication strategies should address generational, educational, and digital literacy differences.
{"title":"Survey on cancer patients’ attitudes towards AI and data protection: A cross-sectional study from an Italian cancer center","authors":"Martina Cavallucci , Alice Andalò , Valentina Danesi , Nicola Gentili , Ilaria Massa , Emanuela Scarpi , Maria Chiara Restuccia , Roberto Vespignani , Alice Conficconi , Michela Palleschi , Ugo De Giorgi , Antonino Musolino , Filippo Merloni","doi":"10.1016/j.ijmedinf.2025.106237","DOIUrl":"10.1016/j.ijmedinf.2025.106237","url":null,"abstract":"<div><div><strong>Background</strong></div><div>Artificial Intelligence (AI) is increasingly integrated into oncology, offering opportunities to improve diagnostics, treatment planning, and operational efficiency. However, patient perspectives on AI, especially regarding data protection and ethical implications, remain underexplored.</div><div><strong>Objective</strong></div><div>The objective of this study is to investigate cancer patients’ attitudes toward the use of Artificial Intelligence (AI) in healthcare, focusing on their awareness of data protection, perceived risks and benefits, and the conditions under which AI is considered acceptable. Additionally, the study aims to examine how demographic and educational factors influence patients’ views within the context of an Italian comprehensive cancer center.</div><div><strong>Methods</strong></div><div>A cross-sectional survey was conducted with 117 cancer patients who completed a 28-item online questionnaire. The survey evaluated levels of AI knowledge, perceptions of data privacy, concerns about AI in medical contexts, and willingness to share health data for research.</div><div><strong>Results</strong></div><div>Most participants demonstrated moderate awareness of AI (70.1%) and its medical applications (85.5%), with higher familiarity observed among younger and more educated individuals. While data protection understanding varied, 76.9% were willing to share personal health data for research aimed at improving cancer care. Concerns included reduced physician autonomy (52.1%) and diminished physician-patient interaction (63.3%). However, 82.9% of respondents found AI acceptable when clinical decisions remained under physician control. AI was most favorably viewed for administrative support and care process optimization.</div><div><strong>Conclusion</strong></div><div>Cancer patients generally view AI in healthcare positively, especially when it maintains physician oversight and safeguards data privacy. To ensure equitable and informed adoption, targeted educational initiatives and transparent communication strategies should address generational, educational, and digital literacy differences.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106237"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.1016/j.ijmedinf.2025.106260
Xiaoyu Bai , Shuaijing Huang , Shu Huang , Bin Wang , Aijing Zhu , Suxia Qi , Yujia Gao , Hao Zhu , Tingwang Jiang , Bin Zhang , Yadong Feng
Objective
This study aims to develop and validate interpretable machine learning (ML) models to dynamically predict mortality risk among intensive care unit (ICU) patients diagnosed with acute pancreatitis complicated by acute kidney injury (AP-AKI).
Methods
The clinical data in the training set, including demographic characteristics, laboratory indicators, scoring systems, treatment modalities, and clinical management strategies, were obtained from three large-scale medical databases: the Medical Information Mart for Intensive Care, and the eICU Collaborative Research Database. The external validation set consisted of patients recruited from two independent hospitals. Predictive feature selection was conducted using univariate logistic regression, LASSO regularization, and multivariate logistic regression. Eleven machine learning (ML) algorithms—eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Adaptive Boosting (AdaBoost), Decision Tree, Gaussian Naive Bayes (GNB), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), Bernoulli Naive Bayes (BernoulliNB), Linear Discriminant Analysis, LinearSVC, and Stochastic Gradient Descent (SGD)—were employed to develop predictive models. Finally, the SHapley Additive exPlanations (SHAP) method was applied to interpret the importance and directional effects of individual features.
Results
Dynamic in-hospital mortality prediction was performed at 24 h, 48 h, and 7 days post-ICU admission, identifying nine to twelve variables respectively. The XGBoost model outperformed 10 other machine learning models, achieving training set AUROCs of 0.961 (95 % CI 0.95–0.97), 0.947 (95 % CI 0.94–0.96), and 0.968 (95 % CI 0.96–0.98) at these time points. The corresponding external validation results were 0.871 (95 % CI 0.79–0.95), 0.799 (95 % CI 0.66–0.94), and 0.667 (95 % CI 0.47–0.87). Regarding 90-day post-discharge mortality prediction, six variables were selected. The XGBoost model demonstrated superior performance, with a training set AUROC of 0.966 (95 % CI 0.96–0.97) and an external validation AUROC of 0.745 (95 % CI 0.61–0.88).
Conclusion
Web-based prognostic tools were developed to support clinical decision-making and optimize ICU bed resource management.
目的:本研究旨在开发和验证可解释的机器学习(ML)模型,以动态预测诊断为急性胰腺炎合并急性肾损伤(AP-AKI)的重症监护病房(ICU)患者的死亡风险。方法:训练集的临床数据包括人口学特征、实验室指标、评分体系、治疗方式、临床管理策略等,数据来源于重症监护医学信息集市和eICU协同研究数据库。外部验证集包括从两家独立医院招募的患者。使用单变量逻辑回归、LASSO正则化和多变量逻辑回归进行预测特征选择。采用极端梯度增强(XGBoost)、逻辑回归(LR)、自适应增强(AdaBoost)、决策树、高斯朴素贝叶斯(GNB)、多层感知器(MLP)、支持向量机(SVM)、伯努利朴素贝叶斯(Bernoulli朴素贝叶斯(BernoulliNB)、线性判别分析、线性svc和随机梯度下降(SGD)等11种机器学习(ML)算法建立预测模型。最后,应用SHapley加性解释(SHAP)方法解释个体特征的重要性和方向性效应。结果:对icu入院后24小时、48小时和7天的住院死亡率进行了动态预测,分别确定了9到12个变量。XGBoost模型优于其他10个机器学习模型,在这些时间点上的训练集auroc分别为0.961 (95% CI 0.95-0.97)、0.947 (95% CI 0.94-0.96)和0.968 (95% CI 0.96-0.98)。相应的外部验证结果分别为0.871 (95% CI 0.79 ~ 0.95)、0.799 (95% CI 0.66 ~ 0.94)和0.667 (95% CI 0.47 ~ 0.87)。出院后90天死亡率预测选择6个变量。XGBoost模型表现出优异的性能,其训练集AUROC为0.966 (95% CI 0.96-0.97),外部验证AUROC为0.745 (95% CI 0.61-0.88)。结论:开发了基于网络的预后工具,以支持临床决策并优化ICU床位资源管理。
{"title":"Development and validation of interpretable machine learning models for dynamic prediction of prognosis in acute pancreatitis complicated by acute kidney injury: A multicenter study","authors":"Xiaoyu Bai , Shuaijing Huang , Shu Huang , Bin Wang , Aijing Zhu , Suxia Qi , Yujia Gao , Hao Zhu , Tingwang Jiang , Bin Zhang , Yadong Feng","doi":"10.1016/j.ijmedinf.2025.106260","DOIUrl":"10.1016/j.ijmedinf.2025.106260","url":null,"abstract":"<div><h3>Objective</h3><div>This study aims to develop and validate interpretable machine learning (ML) models to dynamically predict mortality risk among intensive care unit (ICU) patients diagnosed with acute pancreatitis complicated by acute kidney injury (AP-AKI).</div></div><div><h3>Methods</h3><div>The clinical data in the training set, including demographic characteristics, laboratory indicators, scoring systems, treatment modalities, and clinical management strategies, were obtained from three large-scale medical databases: the Medical Information Mart for Intensive Care, and the eICU Collaborative Research Database. The external validation set consisted of patients recruited from two independent hospitals. Predictive feature selection was conducted using univariate logistic regression, LASSO regularization, and multivariate logistic regression. Eleven machine learning (ML) algorithms—eXtreme Gradient Boosting (XGBoost), Logistic Regression (LR), Adaptive Boosting (AdaBoost), Decision Tree, Gaussian Naive Bayes (GNB), Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), Bernoulli Naive Bayes (BernoulliNB), Linear Discriminant Analysis, LinearSVC, and Stochastic Gradient Descent (SGD)—were employed to develop predictive models. Finally, the SHapley Additive exPlanations (SHAP) method was applied to interpret the importance and directional effects of individual features.</div></div><div><h3>Results</h3><div>Dynamic in-hospital mortality prediction was performed at 24 h, 48 h, and 7 days post-ICU admission, identifying nine to twelve variables respectively. The XGBoost model outperformed 10 other machine learning models, achieving training set AUROCs of 0.961 (95 % CI 0.95–0.97), 0.947 (95 % CI 0.94–0.96), and 0.968 (95 % CI 0.96–0.98) at these time points. The corresponding external validation results were 0.871 (95 % CI 0.79–0.95), 0.799 (95 % CI 0.66–0.94), and 0.667 (95 % CI 0.47–0.87). Regarding 90-day post-discharge mortality prediction, six variables were selected. The XGBoost model demonstrated superior performance, with a training set AUROC of 0.966 (95 % CI 0.96–0.97) and an external validation AUROC of 0.745 (95 % CI 0.61–0.88).</div></div><div><h3>Conclusion</h3><div>Web-based prognostic tools were developed to support clinical decision-making and optimize ICU bed resource management.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106260"},"PeriodicalIF":4.1,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145901712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-28DOI: 10.1016/j.ijmedinf.2025.106249
Wen-Jiang Yang
{"title":"Commentary on “Towards practical federated learning and evaluation for medical prediction models”","authors":"Wen-Jiang Yang","doi":"10.1016/j.ijmedinf.2025.106249","DOIUrl":"10.1016/j.ijmedinf.2025.106249","url":null,"abstract":"","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"208 ","pages":"Article 106249"},"PeriodicalIF":4.1,"publicationDate":"2025-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}