Unlabelled: Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.
{"title":"Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use.","authors":"Xinyi Tu, Chenghao Shi, Peilin Qian, Lizhu Wang","doi":"10.2196/79922","DOIUrl":"10.2196/79922","url":null,"abstract":"<p><strong>Unlabelled: </strong>Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79922"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey
<p><strong>Background: </strong>This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.</p><p><strong>Objective: </strong>TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.</p><p><strong>Methods: </strong>We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.</p><p><strong>Results: </strong>Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P<.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.</p><p><strong>Conclusions: </strong>An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc
{"title":"Large Language Model-Enabled Editing of Patient Audio Interviews From \"This Is My Story\" Conversations: Comparative Study.","authors":"Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey","doi":"10.2196/80205","DOIUrl":"10.2196/80205","url":null,"abstract":"<p><strong>Background: </strong>This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.</p><p><strong>Objective: </strong>TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.</p><p><strong>Methods: </strong>We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.</p><p><strong>Results: </strong>Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P<.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.</p><p><strong>Conclusions: </strong>An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e80205"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.
Objective: This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.
Methods: We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).
Results: Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.
Conclusions: To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.
{"title":"Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping.","authors":"Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath","doi":"10.2196/74162","DOIUrl":"10.2196/74162","url":null,"abstract":"<p><strong>Background: </strong>Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.</p><p><strong>Objective: </strong>This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.</p><p><strong>Methods: </strong>We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).</p><p><strong>Results: </strong>Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.</p><p><strong>Conclusions: </strong>To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e74162"},"PeriodicalIF":3.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen
<p><strong>Background: </strong>Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.</p><p><strong>Objective: </strong>Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.</p><p><strong>Methods: </strong>We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.</p><p><strong>Results: </strong>DeepSeek R1 achieved the highest "good response" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P<.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question "What is the standard treatment regimen for ASCVD?"</p><p><strong>Conclusions: </strong>DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa
{"title":"Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis.","authors":"Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen","doi":"10.2196/81422","DOIUrl":"10.2196/81422","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.</p><p><strong>Objective: </strong>Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.</p><p><strong>Methods: </strong>We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.</p><p><strong>Results: </strong>DeepSeek R1 achieved the highest \"good response\" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P<.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question \"What is the standard treatment regimen for ASCVD?\"</p><p><strong>Conclusions: </strong>DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e81422"},"PeriodicalIF":3.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>The pathological and physiological state of patients with intracerebral hemorrhage (ICH) after minimally invasive surgery (MIS) is a dynamic evolution, and the traditional models cannot dynamically predict prognosis. Clinical data at multiple time points often show the characteristics of different categories, different numbers, and missing data. The existing models lack methods to deal with imbalanced data.</p><p><strong>Objective: </strong>This study aims to develop and validate a dynamic prognostic model using multi-time point data from patients with ICH undergoing MIS to predict survival and functional outcomes.</p><p><strong>Methods: </strong>In this study, 287 patients who underwent MIS for ICH were retrospectively collected on the day of surgery, days 1, 3, 7, and 14 after surgery, and the day of drainage tube removal. Their general information, vital signs, laboratory test findings, neurological function scores, head hematoma volume, and MIS-related indicators were collected. In addition, this study proposes a multistep attention model, namely the MultiStep Transformer. The model can simultaneously output 3 types of prediction probabilities for 30-day survival probability, 180-day survival probability, and 180-day favorable functional outcome (modified Rankin Scale [mRS] 0-3) probability. Five-fold cross-validation was used to evaluate the performance of the model and compare it with mainstream models and traditional scores. The main evaluation indexes included accuracy, precision, recall, and F<sub>1</sub>-score. The predictive performance of the model was evaluated using receiver operating characteristic (ROC) curves; its calibration was assessed via calibration curves; and its clinical utility was examined using decision curve analysis (DCA). Attributable value analysis was conducted to assess the key predictive features.</p><p><strong>Results: </strong>The 30‑day survival rate, 180‑day survival rate, and 180‑day favorable functional outcome rate among 287 patients were 92.3%, 88.8%, and 52.3%, respectively. In terms of predictive efficacy for survival and functional outcomes, the MultiStep Transformer model showed a remarkable superiority over traditional scoring systems and other deep learning models. For these three outcomes, the model achieved areas under the receiver operating characteristic curves (AUROCs) of 0.87 (95% CI 0.82-0.92), 0.85 (95% CI 0.77-0.93), and 0.75 (95% CI 0.72-0.78), with corresponding Brier scores of 0.1041, 0.1115, and 0.231. DCA confirmed that the model provided a definite clinical net benefit when threshold probabilities ranged within 0.06-0.26, 0.04-0.5, and 0.21-0.71.</p><p><strong>Conclusions: </strong>The MultiStep Transformer model proposed in this study can effectively use imbalanced data to construct a model. It possesses good dynamic prediction ability for short-term and long-term survival and functional outcome of patients with ICH undergoing MIS, providing a novel t
背景:脑出血(ICH)患者微创手术(MIS)后的病理生理状态是一个动态演变的过程,传统模型无法动态预测预后。多个时间点的临床资料往往表现为不同类别、不同数量、缺失数据的特点。现有模型缺乏处理不平衡数据的方法。目的:本研究旨在开发和验证一个动态预后模型,该模型使用脑出血患者接受MIS的多时间点数据来预测生存和功能结局。方法:回顾性收集287例因脑出血行MIS的患者,分别于手术当日、术后第1、3、7、14天及拔除引流管当日。收集他们的一般信息、生命体征、实验室检查结果、神经功能评分、头部血肿量和mis相关指标。此外,本研究提出了一个多步骤注意模型,即multistep Transformer。该模型可同时输出30天生存概率、180天生存概率和180天功能预后良好(修正Rankin量表[mRS] 0-3)概率3种预测概率。采用五重交叉验证对模型的性能进行评价,并与主流模型和传统分数进行比较。主要评价指标包括正确率、精密度、召回率和f1分。采用受试者工作特征(ROC)曲线评价模型的预测性能;通过标定曲线对其进行标定;并采用决策曲线分析(DCA)检验其临床应用价值。进行归因价值分析以评估关键预测特征。结果:287例患者的30天生存率为92.3%,180天生存率为88.8%,180天良好功能转归率为52.3%。就生存和功能结果的预测效果而言,MultiStep Transformer模型比传统评分系统和其他深度学习模型显示出显著的优势。对于这三个结果,该模型在受试者工作特征曲线(auroc)下的面积分别为0.87 (95% CI 0.82-0.92)、0.85 (95% CI 0.77-0.93)和0.75 (95% CI 0.72-0.78),对应的Brier评分分别为0.1041、0.1115和0.231。DCA证实,当阈值概率范围在0.06-0.26、0.04-0.5和0.21-0.71之间时,该模型提供了明确的临床净效益。结论:本研究提出的多步变压器模型可以有效地利用不平衡数据构建模型。对脑出血行MIS患者的短期和长期生存及功能结局具有良好的动态预测能力,为脑出血行MIS患者预后的个体化评估提供了一种新的工具。
{"title":"Deep Learning for Dynamic Prognostic Prediction in Minimally Invasive Surgery for Intracerebral Hemorrhage: Model Development and Validation Study.","authors":"Jingxuan Wang, Jian Shi, Qing Ye, Danyang Chen, Yuhao Sun, Chao Pan, Yingxin Tang, Ping Zhang, Zhouping Tang","doi":"10.2196/86327","DOIUrl":"10.2196/86327","url":null,"abstract":"<p><strong>Background: </strong>The pathological and physiological state of patients with intracerebral hemorrhage (ICH) after minimally invasive surgery (MIS) is a dynamic evolution, and the traditional models cannot dynamically predict prognosis. Clinical data at multiple time points often show the characteristics of different categories, different numbers, and missing data. The existing models lack methods to deal with imbalanced data.</p><p><strong>Objective: </strong>This study aims to develop and validate a dynamic prognostic model using multi-time point data from patients with ICH undergoing MIS to predict survival and functional outcomes.</p><p><strong>Methods: </strong>In this study, 287 patients who underwent MIS for ICH were retrospectively collected on the day of surgery, days 1, 3, 7, and 14 after surgery, and the day of drainage tube removal. Their general information, vital signs, laboratory test findings, neurological function scores, head hematoma volume, and MIS-related indicators were collected. In addition, this study proposes a multistep attention model, namely the MultiStep Transformer. The model can simultaneously output 3 types of prediction probabilities for 30-day survival probability, 180-day survival probability, and 180-day favorable functional outcome (modified Rankin Scale [mRS] 0-3) probability. Five-fold cross-validation was used to evaluate the performance of the model and compare it with mainstream models and traditional scores. The main evaluation indexes included accuracy, precision, recall, and F<sub>1</sub>-score. The predictive performance of the model was evaluated using receiver operating characteristic (ROC) curves; its calibration was assessed via calibration curves; and its clinical utility was examined using decision curve analysis (DCA). Attributable value analysis was conducted to assess the key predictive features.</p><p><strong>Results: </strong>The 30‑day survival rate, 180‑day survival rate, and 180‑day favorable functional outcome rate among 287 patients were 92.3%, 88.8%, and 52.3%, respectively. In terms of predictive efficacy for survival and functional outcomes, the MultiStep Transformer model showed a remarkable superiority over traditional scoring systems and other deep learning models. For these three outcomes, the model achieved areas under the receiver operating characteristic curves (AUROCs) of 0.87 (95% CI 0.82-0.92), 0.85 (95% CI 0.77-0.93), and 0.75 (95% CI 0.72-0.78), with corresponding Brier scores of 0.1041, 0.1115, and 0.231. DCA confirmed that the model provided a definite clinical net benefit when threshold probabilities ranged within 0.06-0.26, 0.04-0.5, and 0.21-0.71.</p><p><strong>Conclusions: </strong>The MultiStep Transformer model proposed in this study can effectively use imbalanced data to construct a model. It possesses good dynamic prediction ability for short-term and long-term survival and functional outcome of patients with ICH undergoing MIS, providing a novel t","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e86327"},"PeriodicalIF":3.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Minimally invasive posterior lumbar interbody fusion (MIS-PLIF) is commonly performed to treat degenerative lumbar spinal conditions. Patients of advanced age often present with multiple comorbidities and reduced physiological reserves, influencing surgical risks and recovery. The growing aging population has led to a rising demand for care for older adults, posing significant challenges for health care systems worldwide.
Objective: This study aimed to identify the associations between different age groups and MIS-PLIF outcomes.
Methods: This study retrospectively analyzed data from the United States Nationwide Inpatient Sample collected between 2016 and 2020. Patients aged ≥60 years who underwent MIS-PLIF were eligible for inclusion in this study. Patients were categorized into age groups (60-69, 70-79, and ≥80 y). Logistic and linear regressions were used to determine the associations between the study variables and outcomes, including in-hospital mortality, complications, nonroutine discharge, and length of stay.
Results: A total of 785 patients aged ≥60 (mean age 69.4, SD 0.2) years who underwent MIS-PLIF were included in the analysis, and 18.7% (147/785) experienced at least one complication. After adjustment, compared with patients aged 60 to 69 years, the risk of nonroutine discharge was significantly increased in patients aged 70 to 79 years (adjusted odds ratio 2.33, 95% CI 1.57-3.46; P<.001) and ≥80 years (adjusted odds ratio 4.79, 95% CI 2.64-8.67; P<.001). No significant differences in the risk of complications or length of hospital stay were observed across the age groups.
Conclusions: In older patients undergoing MIS-PLIF, advanced age is an independent predictor of nonroutine discharge. Furthermore, our findings suggest that age alone is not an independent risk factor for complications or extended hospital stays among older patients. These findings underscore that MIS-PLIF is a viable option for older patients, for whom extra attention may still be needed for postoperative care. Implementing age-stratified management for older patients undergoing MIS-PLIF may have important clinical policy implications.
{"title":"Impact of Age on Hospital Outcomes Following Minimally Invasive Posterior Lumbar Interbody Fusion: Retrospective Analysis of the Nationwide Inpatient Sample Database from 2016 to 2020.","authors":"Yu-Jun Lin, Fu-Yuan Shih, Jin-Fu Huang, Chun-Wei Ting, Yu-Chin Tsai, Lin Chang, Yu-Hua Huang, Ming-Jung Chuang","doi":"10.2196/76424","DOIUrl":"10.2196/76424","url":null,"abstract":"<p><strong>Background: </strong>Minimally invasive posterior lumbar interbody fusion (MIS-PLIF) is commonly performed to treat degenerative lumbar spinal conditions. Patients of advanced age often present with multiple comorbidities and reduced physiological reserves, influencing surgical risks and recovery. The growing aging population has led to a rising demand for care for older adults, posing significant challenges for health care systems worldwide.</p><p><strong>Objective: </strong>This study aimed to identify the associations between different age groups and MIS-PLIF outcomes.</p><p><strong>Methods: </strong>This study retrospectively analyzed data from the United States Nationwide Inpatient Sample collected between 2016 and 2020. Patients aged ≥60 years who underwent MIS-PLIF were eligible for inclusion in this study. Patients were categorized into age groups (60-69, 70-79, and ≥80 y). Logistic and linear regressions were used to determine the associations between the study variables and outcomes, including in-hospital mortality, complications, nonroutine discharge, and length of stay.</p><p><strong>Results: </strong>A total of 785 patients aged ≥60 (mean age 69.4, SD 0.2) years who underwent MIS-PLIF were included in the analysis, and 18.7% (147/785) experienced at least one complication. After adjustment, compared with patients aged 60 to 69 years, the risk of nonroutine discharge was significantly increased in patients aged 70 to 79 years (adjusted odds ratio 2.33, 95% CI 1.57-3.46; P<.001) and ≥80 years (adjusted odds ratio 4.79, 95% CI 2.64-8.67; P<.001). No significant differences in the risk of complications or length of hospital stay were observed across the age groups.</p><p><strong>Conclusions: </strong>In older patients undergoing MIS-PLIF, advanced age is an independent predictor of nonroutine discharge. Furthermore, our findings suggest that age alone is not an independent risk factor for complications or extended hospital stays among older patients. These findings underscore that MIS-PLIF is a viable option for older patients, for whom extra attention may still be needed for postoperative care. Implementing age-stratified management for older patients undergoing MIS-PLIF may have important clinical policy implications.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e76424"},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820539/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<p><strong>Background: </strong>Accurate classification and grading of lymphoma subtypes are essential for treatment planning. Traditional diagnostic methods face challenges of subjectivity and inefficiency, highlighting the need for automated solutions based on deep learning techniques.</p><p><strong>Objective: </strong>This study aimed to investigate the application of deep learning technology, specifically the U-Net model, in classifying and grading lymphoma subtypes to enhance diagnostic precision and efficiency.</p><p><strong>Methods: </strong>In this study, the U-Net model was used as the primary tool for image segmentation integrated with attention mechanisms and residual networks for feature extraction and classification. A total of 620 high-quality histopathological images representing 3 major lymphoma subtypes were collected from The Cancer Genome Atlas and the Cancer Imaging Archive. All images underwent standardized preprocessing, including Gaussian filtering for noise reduction, histogram equalization, and normalization. Data augmentation techniques such as rotation, flipping, and scaling were applied to improve the model's generalization capability. The dataset was divided into training (70%), validation (15%), and test (15%) subsets. Five-fold cross-validation was used to assess model robustness. Performance was benchmarked against mainstream convolutional neural network architectures, including fully convolutional network, SegNet, and DeepLabv3+.</p><p><strong>Results: </strong>The U-Net model achieved high segmentation accuracy, effectively delineating lesion regions and improving the quality of input for classification and grading. The incorporation of attention mechanisms further improved the model's ability to extract key features, whereas the residual structure of the residual network enhanced classification accuracy for complex images. In the test set (N=1250), the proposed fusion model achieved an accuracy of 92% (1150/1250), a sensitivity of 91.04% (1138/1250), a specificity of 89.04% (1113/1250), and an F1-score of 90% (1125/1250) for the classification of the 3 lymphoma subtypes, with an area under the receiver operating characteristic curve of 0.95 (95% CI 0.93-0.97). The high sensitivity and specificity of the model indicate strong clinical applicability, particularly as an assistive diagnostic tool.</p><p><strong>Conclusions: </strong>Deep learning techniques based on the U-Net architecture offer considerable advantages in the automated classification and grading of lymphoma subtypes. The proposed model significantly improved diagnostic accuracy and accelerated pathological evaluation, providing efficient and precise support for clinical decision-making. Future work may focus on enhancing model robustness through integration with advanced algorithms and validating performance across multicenter clinical datasets. The model also holds promise for deployment in digital pathology platforms and artificial intelligence-ass
背景:淋巴瘤亚型的准确分类和分级对治疗计划至关重要。传统的诊断方法面临主观性和低效率的挑战,突出了对基于深度学习技术的自动化解决方案的需求。目的:本研究旨在探讨深度学习技术,特别是U-Net模型在淋巴瘤亚型分类和分级中的应用,以提高诊断的准确性和效率。方法:采用U-Net模型作为图像分割的主要工具,结合注意机制和残差网络进行特征提取和分类。从癌症基因组图谱和癌症影像档案中收集了代表3种主要淋巴瘤亚型的620张高质量的组织病理学图像。所有图像都经过标准化的预处理,包括高斯滤波降噪、直方图均衡化和归一化。采用旋转、翻转、缩放等数据增强技术提高模型的泛化能力。数据集被分为训练子集(70%)、验证子集(15%)和测试子集(15%)。采用五重交叉验证来评估模型的稳健性。对主流卷积神经网络架构进行了性能基准测试,包括全卷积网络、SegNet和DeepLabv3+。结果:U-Net模型获得了较高的分割精度,有效地描绘了病变区域,提高了分类分级输入的质量。注意机制的加入进一步提高了模型提取关键特征的能力,残差网络的残差结构提高了对复杂图像的分类精度。在测试集(N=1250)中,该融合模型对3种淋巴瘤亚型的分类准确率为92%(1150/1250),灵敏度为91.04%(1138/1250),特异性为89.04% (1113/1250),f1评分为90%(1125/1250),受试者工作特征曲线下面积为0.95 (95% CI 0.93-0.97)。该模型的高灵敏度和特异性表明了很强的临床适用性,特别是作为辅助诊断工具。结论:基于U-Net架构的深度学习技术在淋巴瘤亚型的自动分类和分级方面具有相当大的优势。该模型显著提高了诊断准确率,加快了病理评估速度,为临床决策提供了高效、精准的支持。未来的工作可能侧重于通过集成先进的算法和验证跨多中心临床数据集的性能来增强模型的鲁棒性。该模型还有望部署在数字病理平台和人工智能辅助诊断工作流程中,提高筛查效率并促进病理分类的一致性。
{"title":"Automated Classification of Lymphoma Subtypes From Histopathological Images Using a U-Net Deep Learning Model: Comparative Evaluation Study.","authors":"Jin Zhao, Xiaolian Wen, Li Ma, Liping Su","doi":"10.2196/72679","DOIUrl":"10.2196/72679","url":null,"abstract":"<p><strong>Background: </strong>Accurate classification and grading of lymphoma subtypes are essential for treatment planning. Traditional diagnostic methods face challenges of subjectivity and inefficiency, highlighting the need for automated solutions based on deep learning techniques.</p><p><strong>Objective: </strong>This study aimed to investigate the application of deep learning technology, specifically the U-Net model, in classifying and grading lymphoma subtypes to enhance diagnostic precision and efficiency.</p><p><strong>Methods: </strong>In this study, the U-Net model was used as the primary tool for image segmentation integrated with attention mechanisms and residual networks for feature extraction and classification. A total of 620 high-quality histopathological images representing 3 major lymphoma subtypes were collected from The Cancer Genome Atlas and the Cancer Imaging Archive. All images underwent standardized preprocessing, including Gaussian filtering for noise reduction, histogram equalization, and normalization. Data augmentation techniques such as rotation, flipping, and scaling were applied to improve the model's generalization capability. The dataset was divided into training (70%), validation (15%), and test (15%) subsets. Five-fold cross-validation was used to assess model robustness. Performance was benchmarked against mainstream convolutional neural network architectures, including fully convolutional network, SegNet, and DeepLabv3+.</p><p><strong>Results: </strong>The U-Net model achieved high segmentation accuracy, effectively delineating lesion regions and improving the quality of input for classification and grading. The incorporation of attention mechanisms further improved the model's ability to extract key features, whereas the residual structure of the residual network enhanced classification accuracy for complex images. In the test set (N=1250), the proposed fusion model achieved an accuracy of 92% (1150/1250), a sensitivity of 91.04% (1138/1250), a specificity of 89.04% (1113/1250), and an F1-score of 90% (1125/1250) for the classification of the 3 lymphoma subtypes, with an area under the receiver operating characteristic curve of 0.95 (95% CI 0.93-0.97). The high sensitivity and specificity of the model indicate strong clinical applicability, particularly as an assistive diagnostic tool.</p><p><strong>Conclusions: </strong>Deep learning techniques based on the U-Net architecture offer considerable advantages in the automated classification and grading of lymphoma subtypes. The proposed model significantly improved diagnostic accuracy and accelerated pathological evaluation, providing efficient and precise support for clinical decision-making. Future work may focus on enhancing model robustness through integration with advanced algorithms and validating performance across multicenter clinical datasets. The model also holds promise for deployment in digital pathology platforms and artificial intelligence-ass","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e72679"},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12773696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.
Objective: This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.
Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.
Results: A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.
Conclusions: Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.
背景:大型语言模型(llm),如GPT-3.5和GPT-4 (OpenAI),通过为标准化患者提供可扩展且具有成本效益的替代方案,已经改变了医学教育中的虚拟患者系统。然而,对其性能的系统评价,特别是对涉及多种并存疾病的多发病情况的评价仍然有限。目的:本系统综述旨在评估基于法学硕士的虚拟患者病史采集系统,解决四个研究问题:(1)模拟患者类型和疾病范围,(2)性能增强技术,(3)实验设计和评估指标,以及(4)数据集特征和可用性。方法:按照PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis) 2020,检索9个数据库(2020年1月1日至2025年8月18日)。非变压器llm和非历史记录任务被排除在外。进行了多维质量和偏倚评估。结果:共纳入39项研究,由一名计算机科学研究人员在监督下筛选。基于法学硕士的虚拟患者系统主要模拟内科和精神健康障碍,其中许多针对不同的单一疾病类型,但很少涵盖多病或罕见疾病。基于角色的提示、少镜头学习、多智能体框架、知识图(KG)集成(top-k准确率16.02%)和微调等技术提高了对话和诊断的准确性。多模态输入(如语音和图像)提高了沉浸感和真实感。评估通常涉及10-50名学生和3-10名专家,表现出很强的表现(最高准确率:0.45-0.98,幻觉率:0.31%-5%,系统可用性量表[SUS]≥80)。然而,小样本,不一致的指标和有限的控制限制了推广。常见的数据集如MIMIC-III(重症监护医疗信息市场- iii)显示重症监护病房(ICU)偏倚,缺乏多样性,影响了可重复性和外部效度。结论:纳入的研究显示偏倚风险中等,指标不一致,队列较小,数据集透明度有限。基于法学硕士的虚拟患者系统在模拟多种疾病类型方面表现出色,但缺乏多病症患者的代表。KGs提高了top-k的准确性,并支持结构化的疾病表示和推理。未来的研究应优先考虑将混合kg -思维链架构与开源kg(例如,UMLS[统一医学语言系统]和SNOMED-CT[系统化医学术语-临床术语])、参数高效的精细调整、对话压缩、多模态llm、标准化指标、更大的队列和开放访问的多模态数据集集成在一起,以进一步提高真实性、诊断准确性、公平性和教育效用。
{"title":"Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review.","authors":"Dongliang Li, Syaheerah Lebai Lutfi","doi":"10.2196/79039","DOIUrl":"10.2196/79039","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.</p><p><strong>Objective: </strong>This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.</p><p><strong>Methods: </strong>Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.</p><p><strong>Results: </strong>A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.</p><p><strong>Conclusions: </strong>Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79039"},"PeriodicalIF":3.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811743/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.
Objective: This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.
Methods: We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.
Results: The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.
Conclusions: ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.
{"title":"Predicting Left Ventricular Ejection Fraction Recovery After Percutaneous Coronary Intervention in Patients With Chronic Coronary Syndrome by Using Interpretable Machine Learning Models: Retrospective Study.","authors":"Jiayi Ding, Guanqi Lyu, Masaharu Nakayama, Kotaro Nochioka, Jun Takahashi, Satoshi Yasuda, Tetsuya Matoba, Takahide Kohro, Naoyuki Akashi, Hideo Fujita, Yusuke Oba, Tomoyuki Kabutoya, Kazuomi Kario, Yasushi Imai, Arihiro Kiyosue, Yoshiko Mizuno, Takamasa Iwai, Yoshihiro Miyamoto, Masanobu Ishii, Kenichi Tsujita, Taishi Nakamura, Hisahiko Sato, Ryozo Nagai","doi":"10.2196/77839","DOIUrl":"10.2196/77839","url":null,"abstract":"<p><strong>Background: </strong>Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.</p><p><strong>Objective: </strong>This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.</p><p><strong>Methods: </strong>We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.</p><p><strong>Results: </strong>The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.</p><p><strong>Conclusions: </strong>ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e77839"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12796882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan
<p><strong>Background: </strong>Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.</p><p><strong>Objective: </strong>This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.</p><p><strong>Methods: </strong>This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P<.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F<sub>1</sub>-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.</p><p><strong>Conclusions: </strong>LLMs are feasible for distinguishing between benign and mali
背景:乳房x线摄影是乳腺癌筛查和诊断的关键成像方式,乳房成像报告和数据系统(BI-RADS)提供标准化的风险分层。然而,BI-RADS 4类病变由于其广泛的恶性概率范围和大量的良恶性重叠,给诊断带来了挑战。此外,目前的解释严重依赖于放射科医生的专业知识,导致可变性和潜在的诊断错误。大型语言模型(llm)的最新进展,如chatgpt - 40、Claude 3 Opus和DeepSeek-R1,为自动医疗报告解释提供了新的可能性。目的:本研究旨在探讨基于自由文本乳腺x线摄影报告的LLMs评估BI-RADS 4类病变良恶性亚类的可行性。方法:这项回顾性的单中心研究纳入了307例患者(平均年龄47.25岁,11.39岁),这些患者在2021年5月至2024年3月期间进行了BI-RADS 4类乳房x光检查。三位法学硕士(chatgpt - 40、Claude 3 Opus和DeepSeek-R1)仅从报告文本中将BI-RADS分为4个子类别,而放射科医生则根据图像审查进行分类。病理学作为参考标准,并评估LLMs预测的可重复性。比较放射科医生和法学硕士的诊断表现,分析法学硕士错误分类背后的内在原因。结果:chatgpt - 40的重复性高于DeepSeek-R1和Claude 3 Opus (Fleiss κ 0.850比0.824和0.732)。虽然LLMs的整体准确率低于放射科医生(高级:74.5%,初级:72.0%,DeepSeek-R1: 63.5%, chatgpt - 40: 62.4%, Claude 3 Opus: 60.8%),但其敏感性更高(高级:80.7%,初级:68.0%,DeepSeek-R1: 84.0%, chatgpt - 40: 84.7%, Claude 3 Opus: 92.7%),而特异性仍然较低(高级:68.3%,初级:76.1%,DeepSeek-R1: 43.0%, chatgpt - 40: 40.1%, Claude 3 Opus: 28.9%)。在llm中,DeepSeek-R1的预测精度最好,在受试者工作特征曲线下的面积为0.64 (95% CI 0.57 ~ 0.70),其次是chatgpt - 40 (0.62, 95% CI 0.56 ~ 0.69)和Claude 3 Opus (0.61, 95% CI 0.54 ~ 0.67)。相比之下,初级和高级放射科医师在接受者工作特征曲线下的面积更高,分别为0.72 (95% CI 0.66-0.78)和0.75 (95% CI 0.69-0.80)。DeLong测试证实,所有三位llm的表现都明显低于初级和高级放射科医生(所有p1得分为47.6%,DeepSeek-R1得分为45.6%,Claude 3 Opus得分为36.2%)。结论:LLMs鉴别BI-RADS 4类良恶性病变是可行的,稳定性好,敏感性高,但特异性相对不足。它们在筛查中显示出潜力,并可能帮助放射科医生减少漏诊。
{"title":"Performance of ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1 in BI-RADS Category 4 Classification and Malignancy Prediction From Mammography Reports: Retrospective Diagnostic Study.","authors":"Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan","doi":"10.2196/80182","DOIUrl":"10.2196/80182","url":null,"abstract":"<p><strong>Background: </strong>Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.</p><p><strong>Objective: </strong>This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.</p><p><strong>Methods: </strong>This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P<.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F<sub>1</sub>-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.</p><p><strong>Conclusions: </strong>LLMs are feasible for distinguishing between benign and mali","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e80182"},"PeriodicalIF":3.8,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}