首页 > 最新文献

JMIR Medical Informatics最新文献

英文 中文
Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use. 临床护理中检索增强生成的伦理责任:对人工智能负责任使用的看法。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-09 DOI: 10.2196/79922
Xinyi Tu, Chenghao Shi, Peilin Qian, Lizhu Wang

Unlabelled: Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.

未标记:检索增强生成(RAG)系统已经成为一种强大的技术,可以通过使大型语言模型能够实时访问外部最新知识来增强其能力,并且RAG系统正越来越多地被医学领域的研究人员采用。在这篇观点文章中,我们探讨了在临床护理环境中实施RAG系统的道德要求,特别关注这些技术如何影响患者护理质量和安全。本文的目的是研究由rag增强的大型语言模型在临床护理中引入的伦理风险,并提出负责任实施的战略指导方针。关键的考虑因素包括确保准确性、公平性、透明度和可问责性,以及维护必要的人类监督,如通过结构化分析所讨论的那样。我们认为,稳健的数据治理、可解释的人工智能(AI)技术和持续监控是负责任的RAG实施策略的关键组成部分。最终,在减轻伦理问题的同时实现RAG的好处需要卫生保健专业人员、人工智能开发人员和政策制定者之间的持续合作,促进人工智能支持患者安全、减少差异并提高护理质量的未来。
{"title":"Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use.","authors":"Xinyi Tu, Chenghao Shi, Peilin Qian, Lizhu Wang","doi":"10.2196/79922","DOIUrl":"10.2196/79922","url":null,"abstract":"<p><strong>Unlabelled: </strong>Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79922"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-Enabled Editing of Patient Audio Interviews From "This Is My Story" Conversations: Comparative Study. “这是我的故事”对话中患者音频访谈的大型语言模型编辑:比较研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-09 DOI: 10.2196/80205
Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey
<p><strong>Background: </strong>This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.</p><p><strong>Objective: </strong>TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.</p><p><strong>Methods: </strong>We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.</p><p><strong>Results: </strong>Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P<.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.</p><p><strong>Conclusions: </strong>An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc
背景:这是我的故事(TIMS)是由牧师伊丽莎白·特雷西(Elizabeth Tracey)发起的,旨在推广以人为本的医学方法。在TIMS项目中,患者是由牧师与患者或他们所爱的人进行引导对话的对象。他们被问及四个问题,以引出临床可操作的信息,这些信息已被证明可以改善患者和医疗提供者之间的沟通,加强医疗提供者的同理心。原始记录的谈话被编辑成一个长度约为1分15秒的压缩音频文件,并放置在电子健康记录中,以便所有照顾患者的提供者都可以轻松访问。目的:TIMS在约翰霍普金斯医院很活跃,并在协助提供者移情和沟通方面显示出价值。它在使用录音来实现这一目的方面是独一无二的。随着程序的扩展,由于手动编辑音频对话所需的时间和资源有限,采用存在障碍。为了解决这个问题,我们提出了一个自动化的解决方案,使用一个大的语言模型来创建有意义和简洁的音频摘要。方法:我们分析了24个TIMS音频访谈,并创建了三个编辑版本:(1)专家编辑,(2)人工智能(AI)使用全自动大型语言模型管道编辑,(3)由专家培训的两名医学生编辑。另一位对编辑不知情的专家,以随机顺序对音频采访进行评级。这位专家对每次采访的音频质量和内容质量都进行了5分李克特评分。我们使用词汇和语义相似性度量量化了与专家编辑的参考文献的抄本相似性,并确定了相对于同一专家访谈的遗漏内容。结果:音频质量(流畅度、节奏、清晰度)和内容质量(连贯性、相关性、细微差别)均以5分李克特量表进行评分。专家编辑的访谈在音频质量(4.84)和内容质量(4.83)方面都获得了最高的平均评分。新手编辑得分中等(音频3.84,内容3.63),而人工智能编辑得分略低(音频3.49,内容3.20)。新手和人工智能编辑的评分明显低于专家编辑(p结论:基于人工智能的编辑管道可以生成内容和音频质量与新手编辑相当的TIMS音频摘要,只需经过一小时的培训。人工智能大大减少了编辑时间,消除了人工训练的需要;经过进一步验证,它可以提供一种解决方案,将TIMS扩展到更大范围的医疗保健环境。
{"title":"Large Language Model-Enabled Editing of Patient Audio Interviews From \"This Is My Story\" Conversations: Comparative Study.","authors":"Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey","doi":"10.2196/80205","DOIUrl":"10.2196/80205","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P&lt;.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e80205"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping. 现有性别评分对德国临床研究数据的适用性:范围审查和数据映射。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-08 DOI: 10.2196/74162
Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath

Background: Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.

Objective: This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.

Methods: We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).

Results: Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.

Conclusions: To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.

背景:考虑性别和社会性别可以提高研究质量、创新和社会公平,忽视性别和社会性别会导致研究结果的不准确和低效率。尽管越来越多的人关注性别和性别敏感的医学,但由于其动态和具体情况的性质,在准确代表性别方面仍然存在挑战。目的:这项工作的目的是促进在德国大学医院和相关研究机构中执行一项收集和评估具体性别数据的标准。方法:我们进行了一项综述,以确定和分类最新的性别得分。我们系统地评估了22份出版物关于他们提出的性别分数的适用性和实用性。具体而言,我们使用医学信息学倡议核心数据集(MII CDS)评估了这些性别评分在德国常规临床实践研究数据中的使用情况。结果:人们提出了不同的性别评估方法,但没有标准化和有效的性别评分用于健康研究。大多数性别评分针对流行病学或公共卫生研究,在这些研究中,关于社会方面和生活习惯的问题已经是问卷的一部分。然而,将性别评分概念应用于临床数据是具有挑战性的。例如,MII的CDS缺乏目前在性别分数中记录的所有变量。虽然一些必要的变量确实存在于常规临床数据中,但它们需要成为MII CDS的一部分。结论:为了能够对常规临床数据进行针对性别的回顾性分析,我们建议更新和扩展MII CDS,包括更多与性别相关的信息。为此,我们提供了具体的行动步骤,说明如何在常规临床实践中捕获与性别相关的变量,并以机器可读的方式表示。
{"title":"Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping.","authors":"Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath","doi":"10.2196/74162","DOIUrl":"10.2196/74162","url":null,"abstract":"<p><strong>Background: </strong>Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.</p><p><strong>Objective: </strong>This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.</p><p><strong>Methods: </strong>We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).</p><p><strong>Results: </strong>Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.</p><p><strong>Conclusions: </strong>To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e74162"},"PeriodicalIF":3.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis. 动脉粥样硬化性心血管疾病患者健康交流的大语言模型:试点横断面比较分析。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-07 DOI: 10.2196/81422
Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen
<p><strong>Background: </strong>Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.</p><p><strong>Objective: </strong>Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.</p><p><strong>Methods: </strong>We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.</p><p><strong>Results: </strong>DeepSeek R1 achieved the highest "good response" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P<.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question "What is the standard treatment regimen for ASCVD?"</p><p><strong>Conclusions: </strong>DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa
背景:大型语言模型(LLMs)已成为增强公众获取医疗信息的有前途的工具,特别是对于慢性疾病,如动脉粥样硬化性心血管疾病(ASCVD)。然而,它们在以患者为中心的健康沟通中的有效性仍未得到充分探索,特别是在多语言背景下。目的:我们的研究旨在对3种先进的llms (deepseek R1、chatgpt - 40和gemini)进行比较评估,以生成ascvd相关患者查询的中英文回复,评估它们在准确性、完整性和可理解性方面的表现。方法:我们基于25个临床验证的ASCVD问题进行了横断面评估,涉及5个领域:定义、诊断、治疗、预防和生活方式。每个问题向3位法学硕士分别提交了5次英文和中文,总共得到750个回答,所有回答都是在默认设置下生成的,以近似真实情况。三名委员会认证的心脏病专家对模型身份一无所知,他们使用带有预定义锚点的标准化李克特量表独立地对反应进行评分。评估遵循严格的多阶段过程,包括随机化、洗脱期和最终共识评分。结果:DeepSeek R1获得了最高的“良好反应”率(24/25,英语和中文均为96%),大大优于chatgpt - 40(21/25, 84%)和Gemini(12/25,英语48%,17/25,中文68%)。与其他模型相比,DeepSeek R1显示出更高的中位数准确性得分(6,两种语言的IQR 6-6)和完整性得分(3,两种语言的IQR 2-3)。结论:DeepSeek R1在生成高质量、面向患者的中英文ASCVD信息方面表现出有希望和一致的表现,突出了开源法学硕士在促进数字健康素养和公平获取慢性病信息方面的潜力。然而,在指南敏感治疗中观察到一个临床关键弱点:模型不能可靠地提供与指南一致的标准治疗方案,这表明LLM的使用应限于低风险的信息子查询(例如,定义,诊断和生活方式教育),除非有专家监督和安全控制。
{"title":"Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis.","authors":"Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen","doi":"10.2196/81422","DOIUrl":"10.2196/81422","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;DeepSeek R1 achieved the highest \"good response\" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P&lt;.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question \"What is the standard treatment regimen for ASCVD?\"&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e81422"},"PeriodicalIF":3.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning for Dynamic Prognostic Prediction in Minimally Invasive Surgery for Intracerebral Hemorrhage: Model Development and Validation Study. 深度学习在脑出血微创手术中的动态预后预测:模型开发和验证研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-07 DOI: 10.2196/86327
Jingxuan Wang, Jian Shi, Qing Ye, Danyang Chen, Yuhao Sun, Chao Pan, Yingxin Tang, Ping Zhang, Zhouping Tang
<p><strong>Background: </strong>The pathological and physiological state of patients with intracerebral hemorrhage (ICH) after minimally invasive surgery (MIS) is a dynamic evolution, and the traditional models cannot dynamically predict prognosis. Clinical data at multiple time points often show the characteristics of different categories, different numbers, and missing data. The existing models lack methods to deal with imbalanced data.</p><p><strong>Objective: </strong>This study aims to develop and validate a dynamic prognostic model using multi-time point data from patients with ICH undergoing MIS to predict survival and functional outcomes.</p><p><strong>Methods: </strong>In this study, 287 patients who underwent MIS for ICH were retrospectively collected on the day of surgery, days 1, 3, 7, and 14 after surgery, and the day of drainage tube removal. Their general information, vital signs, laboratory test findings, neurological function scores, head hematoma volume, and MIS-related indicators were collected. In addition, this study proposes a multistep attention model, namely the MultiStep Transformer. The model can simultaneously output 3 types of prediction probabilities for 30-day survival probability, 180-day survival probability, and 180-day favorable functional outcome (modified Rankin Scale [mRS] 0-3) probability. Five-fold cross-validation was used to evaluate the performance of the model and compare it with mainstream models and traditional scores. The main evaluation indexes included accuracy, precision, recall, and F<sub>1</sub>-score. The predictive performance of the model was evaluated using receiver operating characteristic (ROC) curves; its calibration was assessed via calibration curves; and its clinical utility was examined using decision curve analysis (DCA). Attributable value analysis was conducted to assess the key predictive features.</p><p><strong>Results: </strong>The 30‑day survival rate, 180‑day survival rate, and 180‑day favorable functional outcome rate among 287 patients were 92.3%, 88.8%, and 52.3%, respectively. In terms of predictive efficacy for survival and functional outcomes, the MultiStep Transformer model showed a remarkable superiority over traditional scoring systems and other deep learning models. For these three outcomes, the model achieved areas under the receiver operating characteristic curves (AUROCs) of 0.87 (95% CI 0.82-0.92), 0.85 (95% CI 0.77-0.93), and 0.75 (95% CI 0.72-0.78), with corresponding Brier scores of 0.1041, 0.1115, and 0.231. DCA confirmed that the model provided a definite clinical net benefit when threshold probabilities ranged within 0.06-0.26, 0.04-0.5, and 0.21-0.71.</p><p><strong>Conclusions: </strong>The MultiStep Transformer model proposed in this study can effectively use imbalanced data to construct a model. It possesses good dynamic prediction ability for short-term and long-term survival and functional outcome of patients with ICH undergoing MIS, providing a novel t
背景:脑出血(ICH)患者微创手术(MIS)后的病理生理状态是一个动态演变的过程,传统模型无法动态预测预后。多个时间点的临床资料往往表现为不同类别、不同数量、缺失数据的特点。现有模型缺乏处理不平衡数据的方法。目的:本研究旨在开发和验证一个动态预后模型,该模型使用脑出血患者接受MIS的多时间点数据来预测生存和功能结局。方法:回顾性收集287例因脑出血行MIS的患者,分别于手术当日、术后第1、3、7、14天及拔除引流管当日。收集他们的一般信息、生命体征、实验室检查结果、神经功能评分、头部血肿量和mis相关指标。此外,本研究提出了一个多步骤注意模型,即multistep Transformer。该模型可同时输出30天生存概率、180天生存概率和180天功能预后良好(修正Rankin量表[mRS] 0-3)概率3种预测概率。采用五重交叉验证对模型的性能进行评价,并与主流模型和传统分数进行比较。主要评价指标包括正确率、精密度、召回率和f1分。采用受试者工作特征(ROC)曲线评价模型的预测性能;通过标定曲线对其进行标定;并采用决策曲线分析(DCA)检验其临床应用价值。进行归因价值分析以评估关键预测特征。结果:287例患者的30天生存率为92.3%,180天生存率为88.8%,180天良好功能转归率为52.3%。就生存和功能结果的预测效果而言,MultiStep Transformer模型比传统评分系统和其他深度学习模型显示出显著的优势。对于这三个结果,该模型在受试者工作特征曲线(auroc)下的面积分别为0.87 (95% CI 0.82-0.92)、0.85 (95% CI 0.77-0.93)和0.75 (95% CI 0.72-0.78),对应的Brier评分分别为0.1041、0.1115和0.231。DCA证实,当阈值概率范围在0.06-0.26、0.04-0.5和0.21-0.71之间时,该模型提供了明确的临床净效益。结论:本研究提出的多步变压器模型可以有效地利用不平衡数据构建模型。对脑出血行MIS患者的短期和长期生存及功能结局具有良好的动态预测能力,为脑出血行MIS患者预后的个体化评估提供了一种新的工具。
{"title":"Deep Learning for Dynamic Prognostic Prediction in Minimally Invasive Surgery for Intracerebral Hemorrhage: Model Development and Validation Study.","authors":"Jingxuan Wang, Jian Shi, Qing Ye, Danyang Chen, Yuhao Sun, Chao Pan, Yingxin Tang, Ping Zhang, Zhouping Tang","doi":"10.2196/86327","DOIUrl":"10.2196/86327","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;The pathological and physiological state of patients with intracerebral hemorrhage (ICH) after minimally invasive surgery (MIS) is a dynamic evolution, and the traditional models cannot dynamically predict prognosis. Clinical data at multiple time points often show the characteristics of different categories, different numbers, and missing data. The existing models lack methods to deal with imbalanced data.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to develop and validate a dynamic prognostic model using multi-time point data from patients with ICH undergoing MIS to predict survival and functional outcomes.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;In this study, 287 patients who underwent MIS for ICH were retrospectively collected on the day of surgery, days 1, 3, 7, and 14 after surgery, and the day of drainage tube removal. Their general information, vital signs, laboratory test findings, neurological function scores, head hematoma volume, and MIS-related indicators were collected. In addition, this study proposes a multistep attention model, namely the MultiStep Transformer. The model can simultaneously output 3 types of prediction probabilities for 30-day survival probability, 180-day survival probability, and 180-day favorable functional outcome (modified Rankin Scale [mRS] 0-3) probability. Five-fold cross-validation was used to evaluate the performance of the model and compare it with mainstream models and traditional scores. The main evaluation indexes included accuracy, precision, recall, and F&lt;sub&gt;1&lt;/sub&gt;-score. The predictive performance of the model was evaluated using receiver operating characteristic (ROC) curves; its calibration was assessed via calibration curves; and its clinical utility was examined using decision curve analysis (DCA). Attributable value analysis was conducted to assess the key predictive features.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The 30‑day survival rate, 180‑day survival rate, and 180‑day favorable functional outcome rate among 287 patients were 92.3%, 88.8%, and 52.3%, respectively. In terms of predictive efficacy for survival and functional outcomes, the MultiStep Transformer model showed a remarkable superiority over traditional scoring systems and other deep learning models. For these three outcomes, the model achieved areas under the receiver operating characteristic curves (AUROCs) of 0.87 (95% CI 0.82-0.92), 0.85 (95% CI 0.77-0.93), and 0.75 (95% CI 0.72-0.78), with corresponding Brier scores of 0.1041, 0.1115, and 0.231. DCA confirmed that the model provided a definite clinical net benefit when threshold probabilities ranged within 0.06-0.26, 0.04-0.5, and 0.21-0.71.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The MultiStep Transformer model proposed in this study can effectively use imbalanced data to construct a model. It possesses good dynamic prediction ability for short-term and long-term survival and functional outcome of patients with ICH undergoing MIS, providing a novel t","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e86327"},"PeriodicalIF":3.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of Age on Hospital Outcomes Following Minimally Invasive Posterior Lumbar Interbody Fusion: Retrospective Analysis of the Nationwide Inpatient Sample Database from 2016 to 2020. 年龄对微创后路腰椎椎体间融合术住院疗效的影响:2016 - 2020年全国住院患者样本数据库的回顾性分析
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-06 DOI: 10.2196/76424
Yu-Jun Lin, Fu-Yuan Shih, Jin-Fu Huang, Chun-Wei Ting, Yu-Chin Tsai, Lin Chang, Yu-Hua Huang, Ming-Jung Chuang

Background: Minimally invasive posterior lumbar interbody fusion (MIS-PLIF) is commonly performed to treat degenerative lumbar spinal conditions. Patients of advanced age often present with multiple comorbidities and reduced physiological reserves, influencing surgical risks and recovery. The growing aging population has led to a rising demand for care for older adults, posing significant challenges for health care systems worldwide.

Objective: This study aimed to identify the associations between different age groups and MIS-PLIF outcomes.

Methods: This study retrospectively analyzed data from the United States Nationwide Inpatient Sample collected between 2016 and 2020. Patients aged ≥60 years who underwent MIS-PLIF were eligible for inclusion in this study. Patients were categorized into age groups (60-69, 70-79, and ≥80 y). Logistic and linear regressions were used to determine the associations between the study variables and outcomes, including in-hospital mortality, complications, nonroutine discharge, and length of stay.

Results: A total of 785 patients aged ≥60 (mean age 69.4, SD 0.2) years who underwent MIS-PLIF were included in the analysis, and 18.7% (147/785) experienced at least one complication. After adjustment, compared with patients aged 60 to 69 years, the risk of nonroutine discharge was significantly increased in patients aged 70 to 79 years (adjusted odds ratio 2.33, 95% CI 1.57-3.46; P<.001) and ≥80 years (adjusted odds ratio 4.79, 95% CI 2.64-8.67; P<.001). No significant differences in the risk of complications or length of hospital stay were observed across the age groups.

Conclusions: In older patients undergoing MIS-PLIF, advanced age is an independent predictor of nonroutine discharge. Furthermore, our findings suggest that age alone is not an independent risk factor for complications or extended hospital stays among older patients. These findings underscore that MIS-PLIF is a viable option for older patients, for whom extra attention may still be needed for postoperative care. Implementing age-stratified management for older patients undergoing MIS-PLIF may have important clinical policy implications.

背景:微创后路腰椎椎体间融合术(MIS-PLIF)通常用于治疗腰椎退行性疾病。高龄患者常伴有多种合并症和生理储备减少,影响手术风险和恢复。日益增长的老龄化人口导致对老年人护理的需求不断增加,对世界各地的卫生保健系统构成了重大挑战。目的:本研究旨在确定不同年龄组与miss - plif预后之间的关系。方法:本研究回顾性分析了2016 - 2020年美国全国住院患者样本的数据。年龄≥60岁的miss - plif患者符合纳入本研究的条件。患者按年龄分组(60-69岁、70-79岁和≥80岁)。使用逻辑回归和线性回归来确定研究变量和结果之间的关系,包括住院死亡率、并发症、非常规出院和住院时间。结果:共有785例年龄≥60岁(平均年龄69.4岁,SD 0.2)的miss - plif患者被纳入分析,18.7%(147/785)的患者至少出现了一种并发症。调整后,与60 ~ 69岁患者相比,70 ~ 79岁患者的非常规出院风险显著增加(调整后优势比2.33,95% CI 1.57-3.46; p结论:在接受MIS-PLIF的老年患者中,高龄是非常规出院的独立预测因子。此外,我们的研究结果表明,年龄本身并不是老年患者并发症或延长住院时间的独立危险因素。这些发现强调了miss - plif对于老年患者来说是一个可行的选择,对于他们来说,术后护理可能仍然需要额外的关注。对MIS-PLIF的老年患者实施年龄分层管理可能具有重要的临床政策意义。
{"title":"Impact of Age on Hospital Outcomes Following Minimally Invasive Posterior Lumbar Interbody Fusion: Retrospective Analysis of the Nationwide Inpatient Sample Database from 2016 to 2020.","authors":"Yu-Jun Lin, Fu-Yuan Shih, Jin-Fu Huang, Chun-Wei Ting, Yu-Chin Tsai, Lin Chang, Yu-Hua Huang, Ming-Jung Chuang","doi":"10.2196/76424","DOIUrl":"10.2196/76424","url":null,"abstract":"<p><strong>Background: </strong>Minimally invasive posterior lumbar interbody fusion (MIS-PLIF) is commonly performed to treat degenerative lumbar spinal conditions. Patients of advanced age often present with multiple comorbidities and reduced physiological reserves, influencing surgical risks and recovery. The growing aging population has led to a rising demand for care for older adults, posing significant challenges for health care systems worldwide.</p><p><strong>Objective: </strong>This study aimed to identify the associations between different age groups and MIS-PLIF outcomes.</p><p><strong>Methods: </strong>This study retrospectively analyzed data from the United States Nationwide Inpatient Sample collected between 2016 and 2020. Patients aged ≥60 years who underwent MIS-PLIF were eligible for inclusion in this study. Patients were categorized into age groups (60-69, 70-79, and ≥80 y). Logistic and linear regressions were used to determine the associations between the study variables and outcomes, including in-hospital mortality, complications, nonroutine discharge, and length of stay.</p><p><strong>Results: </strong>A total of 785 patients aged ≥60 (mean age 69.4, SD 0.2) years who underwent MIS-PLIF were included in the analysis, and 18.7% (147/785) experienced at least one complication. After adjustment, compared with patients aged 60 to 69 years, the risk of nonroutine discharge was significantly increased in patients aged 70 to 79 years (adjusted odds ratio 2.33, 95% CI 1.57-3.46; P<.001) and ≥80 years (adjusted odds ratio 4.79, 95% CI 2.64-8.67; P<.001). No significant differences in the risk of complications or length of hospital stay were observed across the age groups.</p><p><strong>Conclusions: </strong>In older patients undergoing MIS-PLIF, advanced age is an independent predictor of nonroutine discharge. Furthermore, our findings suggest that age alone is not an independent risk factor for complications or extended hospital stays among older patients. These findings underscore that MIS-PLIF is a viable option for older patients, for whom extra attention may still be needed for postoperative care. Implementing age-stratified management for older patients undergoing MIS-PLIF may have important clinical policy implications.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e76424"},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820539/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Classification of Lymphoma Subtypes From Histopathological Images Using a U-Net Deep Learning Model: Comparative Evaluation Study. 使用U-Net深度学习模型从组织病理学图像中自动分类淋巴瘤亚型:比较评估研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-06 DOI: 10.2196/72679
Jin Zhao, Xiaolian Wen, Li Ma, Liping Su
<p><strong>Background: </strong>Accurate classification and grading of lymphoma subtypes are essential for treatment planning. Traditional diagnostic methods face challenges of subjectivity and inefficiency, highlighting the need for automated solutions based on deep learning techniques.</p><p><strong>Objective: </strong>This study aimed to investigate the application of deep learning technology, specifically the U-Net model, in classifying and grading lymphoma subtypes to enhance diagnostic precision and efficiency.</p><p><strong>Methods: </strong>In this study, the U-Net model was used as the primary tool for image segmentation integrated with attention mechanisms and residual networks for feature extraction and classification. A total of 620 high-quality histopathological images representing 3 major lymphoma subtypes were collected from The Cancer Genome Atlas and the Cancer Imaging Archive. All images underwent standardized preprocessing, including Gaussian filtering for noise reduction, histogram equalization, and normalization. Data augmentation techniques such as rotation, flipping, and scaling were applied to improve the model's generalization capability. The dataset was divided into training (70%), validation (15%), and test (15%) subsets. Five-fold cross-validation was used to assess model robustness. Performance was benchmarked against mainstream convolutional neural network architectures, including fully convolutional network, SegNet, and DeepLabv3+.</p><p><strong>Results: </strong>The U-Net model achieved high segmentation accuracy, effectively delineating lesion regions and improving the quality of input for classification and grading. The incorporation of attention mechanisms further improved the model's ability to extract key features, whereas the residual structure of the residual network enhanced classification accuracy for complex images. In the test set (N=1250), the proposed fusion model achieved an accuracy of 92% (1150/1250), a sensitivity of 91.04% (1138/1250), a specificity of 89.04% (1113/1250), and an F1-score of 90% (1125/1250) for the classification of the 3 lymphoma subtypes, with an area under the receiver operating characteristic curve of 0.95 (95% CI 0.93-0.97). The high sensitivity and specificity of the model indicate strong clinical applicability, particularly as an assistive diagnostic tool.</p><p><strong>Conclusions: </strong>Deep learning techniques based on the U-Net architecture offer considerable advantages in the automated classification and grading of lymphoma subtypes. The proposed model significantly improved diagnostic accuracy and accelerated pathological evaluation, providing efficient and precise support for clinical decision-making. Future work may focus on enhancing model robustness through integration with advanced algorithms and validating performance across multicenter clinical datasets. The model also holds promise for deployment in digital pathology platforms and artificial intelligence-ass
背景:淋巴瘤亚型的准确分类和分级对治疗计划至关重要。传统的诊断方法面临主观性和低效率的挑战,突出了对基于深度学习技术的自动化解决方案的需求。目的:本研究旨在探讨深度学习技术,特别是U-Net模型在淋巴瘤亚型分类和分级中的应用,以提高诊断的准确性和效率。方法:采用U-Net模型作为图像分割的主要工具,结合注意机制和残差网络进行特征提取和分类。从癌症基因组图谱和癌症影像档案中收集了代表3种主要淋巴瘤亚型的620张高质量的组织病理学图像。所有图像都经过标准化的预处理,包括高斯滤波降噪、直方图均衡化和归一化。采用旋转、翻转、缩放等数据增强技术提高模型的泛化能力。数据集被分为训练子集(70%)、验证子集(15%)和测试子集(15%)。采用五重交叉验证来评估模型的稳健性。对主流卷积神经网络架构进行了性能基准测试,包括全卷积网络、SegNet和DeepLabv3+。结果:U-Net模型获得了较高的分割精度,有效地描绘了病变区域,提高了分类分级输入的质量。注意机制的加入进一步提高了模型提取关键特征的能力,残差网络的残差结构提高了对复杂图像的分类精度。在测试集(N=1250)中,该融合模型对3种淋巴瘤亚型的分类准确率为92%(1150/1250),灵敏度为91.04%(1138/1250),特异性为89.04% (1113/1250),f1评分为90%(1125/1250),受试者工作特征曲线下面积为0.95 (95% CI 0.93-0.97)。该模型的高灵敏度和特异性表明了很强的临床适用性,特别是作为辅助诊断工具。结论:基于U-Net架构的深度学习技术在淋巴瘤亚型的自动分类和分级方面具有相当大的优势。该模型显著提高了诊断准确率,加快了病理评估速度,为临床决策提供了高效、精准的支持。未来的工作可能侧重于通过集成先进的算法和验证跨多中心临床数据集的性能来增强模型的鲁棒性。该模型还有望部署在数字病理平台和人工智能辅助诊断工作流程中,提高筛查效率并促进病理分类的一致性。
{"title":"Automated Classification of Lymphoma Subtypes From Histopathological Images Using a U-Net Deep Learning Model: Comparative Evaluation Study.","authors":"Jin Zhao, Xiaolian Wen, Li Ma, Liping Su","doi":"10.2196/72679","DOIUrl":"10.2196/72679","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Accurate classification and grading of lymphoma subtypes are essential for treatment planning. Traditional diagnostic methods face challenges of subjectivity and inefficiency, highlighting the need for automated solutions based on deep learning techniques.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to investigate the application of deep learning technology, specifically the U-Net model, in classifying and grading lymphoma subtypes to enhance diagnostic precision and efficiency.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;In this study, the U-Net model was used as the primary tool for image segmentation integrated with attention mechanisms and residual networks for feature extraction and classification. A total of 620 high-quality histopathological images representing 3 major lymphoma subtypes were collected from The Cancer Genome Atlas and the Cancer Imaging Archive. All images underwent standardized preprocessing, including Gaussian filtering for noise reduction, histogram equalization, and normalization. Data augmentation techniques such as rotation, flipping, and scaling were applied to improve the model's generalization capability. The dataset was divided into training (70%), validation (15%), and test (15%) subsets. Five-fold cross-validation was used to assess model robustness. Performance was benchmarked against mainstream convolutional neural network architectures, including fully convolutional network, SegNet, and DeepLabv3+.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The U-Net model achieved high segmentation accuracy, effectively delineating lesion regions and improving the quality of input for classification and grading. The incorporation of attention mechanisms further improved the model's ability to extract key features, whereas the residual structure of the residual network enhanced classification accuracy for complex images. In the test set (N=1250), the proposed fusion model achieved an accuracy of 92% (1150/1250), a sensitivity of 91.04% (1138/1250), a specificity of 89.04% (1113/1250), and an F1-score of 90% (1125/1250) for the classification of the 3 lymphoma subtypes, with an area under the receiver operating characteristic curve of 0.95 (95% CI 0.93-0.97). The high sensitivity and specificity of the model indicate strong clinical applicability, particularly as an assistive diagnostic tool.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Deep learning techniques based on the U-Net architecture offer considerable advantages in the automated classification and grading of lymphoma subtypes. The proposed model significantly improved diagnostic accuracy and accelerated pathological evaluation, providing efficient and precise support for clinical decision-making. Future work may focus on enhancing model robustness through integration with advanced algorithms and validating performance across multicenter clinical datasets. The model also holds promise for deployment in digital pathology platforms and artificial intelligence-ass","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e72679"},"PeriodicalIF":3.8,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12773696/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review. 基于大语言模型的医学教育历史记录虚拟病人系统:综合系统综述。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-02 DOI: 10.2196/79039
Dongliang Li, Syaheerah Lebai Lutfi

Background: Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.

Objective: This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.

Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.

Results: A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.

Conclusions: Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.

背景:大型语言模型(llm),如GPT-3.5和GPT-4 (OpenAI),通过为标准化患者提供可扩展且具有成本效益的替代方案,已经改变了医学教育中的虚拟患者系统。然而,对其性能的系统评价,特别是对涉及多种并存疾病的多发病情况的评价仍然有限。目的:本系统综述旨在评估基于法学硕士的虚拟患者病史采集系统,解决四个研究问题:(1)模拟患者类型和疾病范围,(2)性能增强技术,(3)实验设计和评估指标,以及(4)数据集特征和可用性。方法:按照PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis) 2020,检索9个数据库(2020年1月1日至2025年8月18日)。非变压器llm和非历史记录任务被排除在外。进行了多维质量和偏倚评估。结果:共纳入39项研究,由一名计算机科学研究人员在监督下筛选。基于法学硕士的虚拟患者系统主要模拟内科和精神健康障碍,其中许多针对不同的单一疾病类型,但很少涵盖多病或罕见疾病。基于角色的提示、少镜头学习、多智能体框架、知识图(KG)集成(top-k准确率16.02%)和微调等技术提高了对话和诊断的准确性。多模态输入(如语音和图像)提高了沉浸感和真实感。评估通常涉及10-50名学生和3-10名专家,表现出很强的表现(最高准确率:0.45-0.98,幻觉率:0.31%-5%,系统可用性量表[SUS]≥80)。然而,小样本,不一致的指标和有限的控制限制了推广。常见的数据集如MIMIC-III(重症监护医疗信息市场- iii)显示重症监护病房(ICU)偏倚,缺乏多样性,影响了可重复性和外部效度。结论:纳入的研究显示偏倚风险中等,指标不一致,队列较小,数据集透明度有限。基于法学硕士的虚拟患者系统在模拟多种疾病类型方面表现出色,但缺乏多病症患者的代表。KGs提高了top-k的准确性,并支持结构化的疾病表示和推理。未来的研究应优先考虑将混合kg -思维链架构与开源kg(例如,UMLS[统一医学语言系统]和SNOMED-CT[系统化医学术语-临床术语])、参数高效的精细调整、对话压缩、多模态llm、标准化指标、更大的队列和开放访问的多模态数据集集成在一起,以进一步提高真实性、诊断准确性、公平性和教育效用。
{"title":"Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review.","authors":"Dongliang Li, Syaheerah Lebai Lutfi","doi":"10.2196/79039","DOIUrl":"10.2196/79039","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.</p><p><strong>Objective: </strong>This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.</p><p><strong>Methods: </strong>Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.</p><p><strong>Results: </strong>A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.</p><p><strong>Conclusions: </strong>Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79039"},"PeriodicalIF":3.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811743/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Left Ventricular Ejection Fraction Recovery After Percutaneous Coronary Intervention in Patients With Chronic Coronary Syndrome by Using Interpretable Machine Learning Models: Retrospective Study. 利用可解释的机器学习模型预测慢性冠脉综合征患者经皮冠状动脉介入治疗后左心室射血分数恢复:回顾性研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-29 DOI: 10.2196/77839
Jiayi Ding, Guanqi Lyu, Masaharu Nakayama, Kotaro Nochioka, Jun Takahashi, Satoshi Yasuda, Tetsuya Matoba, Takahide Kohro, Naoyuki Akashi, Hideo Fujita, Yusuke Oba, Tomoyuki Kabutoya, Kazuomi Kario, Yasushi Imai, Arihiro Kiyosue, Yoshiko Mizuno, Takamasa Iwai, Yoshihiro Miyamoto, Masanobu Ishii, Kenichi Tsujita, Taishi Nakamura, Hisahiko Sato, Ryozo Nagai

Background: Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.

Objective: This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.

Methods: We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.

Results: The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.

Conclusions: ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.

背景:准确预测慢性冠脉综合征(CCS)患者经皮冠状动脉介入治疗(PCI)后左室射血分数(LVEF)恢复对临床决策至关重要。目的:本研究旨在开发和比较多个机器学习(ML)模型来预测LVEF恢复并确定关键贡献特征。方法:回顾性分析来自临床深度数据积累系统数据库的520例CCS患者。根据基线LVEF(≥50%或10%与≤0%相比)和(2)正常恢复,将患者分为4个二元分类任务,定义为LVEF增加0%至10%与≤0%相比。对于每个任务,将3种特征选择策略(所有特征、最小绝对收缩和选择算子[LASSO]回归和递归特征消除[RFE])与4种ML算法(极端梯度增强[XGBoost]、分类增强、轻梯度增强机和随机森林)相结合,得到48个模型。采用10倍交叉验证对模型进行评估,并通过曲线下面积(AUC)、决策曲线分析和校准图进行评估。结果:保存良好的LVEF, RFE联合XGBoost的AUC最高(AUC=0.93),保存正常的LVEF, LASSO联合XGBoost (AUC=0.79),还原良好的LVEF, LASSO联合XGBoost (AUC=0.88),还原正常的LVEF, RFE联合XGBoost (AUC=0.84)。Shapley加性解释分析发现,尿酸、血小板、红细胞压积、脑利钠肽、糖化血红蛋白、葡萄糖、肌酐、基线LVEF、左室舒张末期内径、心率、V5 R波振幅、V6 R波振幅是LVEF恢复的重要预测因素。结论:结合特征选择策略的ML模型对PCI术后LVEF恢复表现出较强的预测性能。这些可解释的模型可以支持临床决策,并可以改善PCI术后CCS患者的管理。
{"title":"Predicting Left Ventricular Ejection Fraction Recovery After Percutaneous Coronary Intervention in Patients With Chronic Coronary Syndrome by Using Interpretable Machine Learning Models: Retrospective Study.","authors":"Jiayi Ding, Guanqi Lyu, Masaharu Nakayama, Kotaro Nochioka, Jun Takahashi, Satoshi Yasuda, Tetsuya Matoba, Takahide Kohro, Naoyuki Akashi, Hideo Fujita, Yusuke Oba, Tomoyuki Kabutoya, Kazuomi Kario, Yasushi Imai, Arihiro Kiyosue, Yoshiko Mizuno, Takamasa Iwai, Yoshihiro Miyamoto, Masanobu Ishii, Kenichi Tsujita, Taishi Nakamura, Hisahiko Sato, Ryozo Nagai","doi":"10.2196/77839","DOIUrl":"10.2196/77839","url":null,"abstract":"<p><strong>Background: </strong>Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.</p><p><strong>Objective: </strong>This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.</p><p><strong>Methods: </strong>We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.</p><p><strong>Results: </strong>The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.</p><p><strong>Conclusions: </strong>ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e77839"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12796882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1 in BI-RADS Category 4 Classification and Malignancy Prediction From Mammography Reports: Retrospective Diagnostic Study. chatgpt - 40、Claude 3 Opus和DeepSeek-R1在BI-RADS 4类分类和恶性肿瘤预测中的应用:回顾性诊断研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-25 DOI: 10.2196/80182
Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan
<p><strong>Background: </strong>Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.</p><p><strong>Objective: </strong>This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.</p><p><strong>Methods: </strong>This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P<.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F<sub>1</sub>-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.</p><p><strong>Conclusions: </strong>LLMs are feasible for distinguishing between benign and mali
背景:乳房x线摄影是乳腺癌筛查和诊断的关键成像方式,乳房成像报告和数据系统(BI-RADS)提供标准化的风险分层。然而,BI-RADS 4类病变由于其广泛的恶性概率范围和大量的良恶性重叠,给诊断带来了挑战。此外,目前的解释严重依赖于放射科医生的专业知识,导致可变性和潜在的诊断错误。大型语言模型(llm)的最新进展,如chatgpt - 40、Claude 3 Opus和DeepSeek-R1,为自动医疗报告解释提供了新的可能性。目的:本研究旨在探讨基于自由文本乳腺x线摄影报告的LLMs评估BI-RADS 4类病变良恶性亚类的可行性。方法:这项回顾性的单中心研究纳入了307例患者(平均年龄47.25岁,11.39岁),这些患者在2021年5月至2024年3月期间进行了BI-RADS 4类乳房x光检查。三位法学硕士(chatgpt - 40、Claude 3 Opus和DeepSeek-R1)仅从报告文本中将BI-RADS分为4个子类别,而放射科医生则根据图像审查进行分类。病理学作为参考标准,并评估LLMs预测的可重复性。比较放射科医生和法学硕士的诊断表现,分析法学硕士错误分类背后的内在原因。结果:chatgpt - 40的重复性高于DeepSeek-R1和Claude 3 Opus (Fleiss κ 0.850比0.824和0.732)。虽然LLMs的整体准确率低于放射科医生(高级:74.5%,初级:72.0%,DeepSeek-R1: 63.5%, chatgpt - 40: 62.4%, Claude 3 Opus: 60.8%),但其敏感性更高(高级:80.7%,初级:68.0%,DeepSeek-R1: 84.0%, chatgpt - 40: 84.7%, Claude 3 Opus: 92.7%),而特异性仍然较低(高级:68.3%,初级:76.1%,DeepSeek-R1: 43.0%, chatgpt - 40: 40.1%, Claude 3 Opus: 28.9%)。在llm中,DeepSeek-R1的预测精度最好,在受试者工作特征曲线下的面积为0.64 (95% CI 0.57 ~ 0.70),其次是chatgpt - 40 (0.62, 95% CI 0.56 ~ 0.69)和Claude 3 Opus (0.61, 95% CI 0.54 ~ 0.67)。相比之下,初级和高级放射科医师在接受者工作特征曲线下的面积更高,分别为0.72 (95% CI 0.66-0.78)和0.75 (95% CI 0.69-0.80)。DeLong测试证实,所有三位llm的表现都明显低于初级和高级放射科医生(所有p1得分为47.6%,DeepSeek-R1得分为45.6%,Claude 3 Opus得分为36.2%)。结论:LLMs鉴别BI-RADS 4类良恶性病变是可行的,稳定性好,敏感性高,但特异性相对不足。它们在筛查中显示出潜力,并可能帮助放射科医生减少漏诊。
{"title":"Performance of ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1 in BI-RADS Category 4 Classification and Malignancy Prediction From Mammography Reports: Retrospective Diagnostic Study.","authors":"Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan","doi":"10.2196/80182","DOIUrl":"10.2196/80182","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P&lt;.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F&lt;sub&gt;1&lt;/sub&gt;-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;LLMs are feasible for distinguishing between benign and mali","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e80182"},"PeriodicalIF":3.8,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR Medical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1