首页 > 最新文献

JMIR Medical Informatics最新文献

英文 中文
Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset. 帕金森病诊断的提示和微调大语言模型:使用PPMI结构化数据集的比较评估研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-15 DOI: 10.2196/77561
Hyun-Ji Shin, Young Jin Jeong, Sungmin Jun, Do-Young Kang
<p><strong>Background: </strong>Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored.</p><p><strong>Objective: </strong>This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines.</p><p><strong>Methods: </strong>We reformatted structured clinical variables from the Parkinson's Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)-based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F<sub>1</sub>-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability.</p><p><strong>Results: </strong>On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F<sub>1</sub>-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F<sub>1</sub>-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F<sub>1</sub>-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F<sub>1</sub>-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F<sub>1</sub>-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation.</p><p><strong>Conclusions: </strong>This study provides an exploratory benchmark of how modern
背景:帕金森病(PD)由于其异质性的运动和非运动表现,给诊断带来了挑战。传统的机器学习(ML)方法已经在结构化临床变量上进行了评估。然而,使用结构化临床数据的自然语言表示的大型语言模型(llm)的诊断效用仍未得到充分探索。目的:本研究旨在评估基于结构化临床数据的自然语言提示对多种LLMs的诊断分类性能,并将其性能与传统ML基线进行比较。方法:我们将帕金森进展标记计划(PPMI)数据集中的结构化临床变量重新格式化为自然语言提示,并将其用作几个llm的输入。剔除多重共线性较高的变量,采用基于Shapley加性解释(SHAP)的特征排序方法选出前10个特征。通过少量提示、双输出提示和监督微调来检查LLM的性能,双输出提示额外生成临时解释性文本作为探索性组件,并监督微调。逻辑回归(LR)和支持向量机(SVM)分类器作为ML基线。使用f1分数在测试集和有限大小的时间独立验证集(时间验证集)上评估模型性能,并进行重复输出生成以评估稳定性。结果:在122名受试者的测试集上,对10个shap选择的临床变量进行训练的LR和SVM的宏观平均f1得分均为0.960(准确率0.975)。接收来自相同变量的自然语言提示的llm达到了相当的性能,最佳的少射配置实现了0.987的宏观平均f1分数(准确率0.992)。在31个参与者的时间验证集中,LR保持了0.903的宏观平均f1得分,而SVM表现出明显的性能下降。相比之下,多个llm保持了较高的诊断性能,宏观平均f1得分高达0.968,PD的召回率也很高。在LLM条件下重复生成输出,通常会产生稳定的预测,在运行期间观察到罕见的可变性。在双输出提示下,诊断性能相对于少量提示有所下降,但总体保持稳定。轻量级模型的监督微调提高了稳定性,使gpt - 40 -mini在测试集中实现了宏观平均f1得分0.987,在小时间验证集中观察到一致正确的预测,考虑到有限的样本量和评估的探索性,应该谨慎解释。结论:本研究为现代法学硕士如何以自然语言形式处理结构化临床变量提供了探索性基准。虽然有几个模型在测试和时间验证数据集上实现了与LR相当的诊断性能,但它们的输出对提示格式、模型选择和类别分布很敏感。重复输出代之间的偶然性反映了llm的随机性质,轻量级模型需要监督微调以实现稳定的泛化。这些发现强调了当前llm在处理表格临床信息方面的能力和局限性,并强调了谨慎应用和进一步研究的必要性。
{"title":"Prompting and Fine-Tuning Large Language Models for Parkinson Disease Diagnosis: Comparative Evaluation Study Using the PPMI Structured Dataset.","authors":"Hyun-Ji Shin, Young Jin Jeong, Sungmin Jun, Do-Young Kang","doi":"10.2196/77561","DOIUrl":"10.2196/77561","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Parkinson disease (PD) presents diagnostic challenges due to its heterogeneous motor and nonmotor manifestations. Traditional machine learning (ML) approaches have been evaluated on structured clinical variables. However, the diagnostic utility of large language models (LLMs) using natural language representations of structured clinical data remains underexplored.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to evaluate the diagnostic classification performance of multiple LLMs using natural language prompts derived from structured clinical data and to compare their performance with traditional ML baselines.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We reformatted structured clinical variables from the Parkinson's Progression Markers Initiative (PPMI) dataset into natural language prompts and used them as inputs for several LLMs. Variables with high multicollinearity were removed, and the top 10 features were selected using Shapley additive explanations (SHAP)-based feature ranking. LLM performance was examined across few-shot prompting, dual-output prompting that additionally generated post hoc explanatory text as an exploratory component, and supervised fine-tuning. Logistic regression (LR) and support vector machine (SVM) classifiers served as ML baselines. Model performance was evaluated using F&lt;sub&gt;1&lt;/sub&gt;-scores on both the test set and a temporally independent validation set (temporal validation set) of limited size, and repeated output generation was carried out to assess stability.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;On the test set of 122 participants, LR and SVM trained on the 10 SHAP-selected clinical variables each achieved a macro-averaged F&lt;sub&gt;1&lt;/sub&gt;-score of 0.960 (accuracy 0.975). LLMs receiving natural language prompts derived from the same variables reached comparable performance, with the best few-shot configurations achieving macro-averaged F&lt;sub&gt;1&lt;/sub&gt;-scores of 0.987 (accuracy 0.992). In the temporal validation set of 31 participants, LR maintained a macro-averaged F&lt;sub&gt;1&lt;/sub&gt;-score of 0.903, whereas SVM showed substantial performance degradation. In contrast, multiple LLMs sustained high diagnostic performance, reaching macro-averaged F&lt;sub&gt;1&lt;/sub&gt;-scores up to 0.968 and high recall for PD. Repeated output generation across LLM conditions produced generally stable predictions, with rare variability observed across runs. Under dual-output prompting, diagnostic performance showed a reduction relative to few-shot prompting while remaining generally stable. Supervised fine-tuning of lightweight models improved stability and enabled GPT-4o-mini to achieve a macro-averaged F&lt;sub&gt;1&lt;/sub&gt;-score of 0.987 on the test set, with uniformly correct predictions observed in the small temporal validation set, which should be interpreted cautiously given the limited sample size and exploratory nature of the evaluation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;This study provides an exploratory benchmark of how modern","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e77561"},"PeriodicalIF":3.8,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12856398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Developing a Suicide Risk Prediction Algorithm Using Electronic Health Record Data in Mental Health Care: Real-World Case Study. 在精神卫生保健中使用电子健康记录数据开发自杀风险预测算法:现实世界案例研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-14 DOI: 10.2196/74240
Linda Hummel, Karin C A G Lorenz-Artz, Joyce J P A Bierbooms, Inge M B Bongers
<p><strong>Background: </strong>Artificial intelligence (AI) offers potential solutions to address the challenges faced by a strained mental health care system, such as increasing demand for care, staff shortages, and pressured accessibility. While developing AI-based tools for clinical practice is technically feasible and has the potential to produce real-world impact, only a few are actually implemented into clinical practice. Implementation starts at the algorithm development phase, as this phase bridges theoretical innovation and practical application. The design and the way the AI tool is developed may either facilitate or hinder later implementation and use.</p><p><strong>Objective: </strong>This study aims to examine the development process of a suicide risk prediction algorithm using real-world electronic health record (EHR) data through a qualitative case study approach for clinical use in mental health care. It explores which challenges the development team encountered in creating the algorithm and how they addressed these challenges. This study identifies key considerations for the integration of technical and clinical perspectives in algorithms, facilitating the evolution of mental health organizations toward data-driven practice. The studied algorithm remains exploratory and has not yet been implemented in clinical practice.</p><p><strong>Methods: </strong>An exploratory, multimethod qualitative case study was conducted, using a hybrid approach with both inductive and deductive analysis. Data were collected through desk research, reflective team meetings, and iterative feedback sessions with the development team. Thematic analysis was used to identify development challenges and the team's responses. Based on these findings, key considerations for future algorithm development were derived.</p><p><strong>Results: </strong>Key challenges included defining, operationalizing, and measuring suicide incidents within EHRs due to issues such as missing data, underreporting, and differences between data sources. Predicting factors were identified by consulting clinical experts; however, psychosocial variables had to be constructed as they could not directly be extracted from EHR data. Risk of bias occurred when traditional suicide prevention questionnaires, unequally distributed across patients, were used as input. Analyzing unstructured data by natural language processing was challenging due to data noise, but ultimately enabled successful sentiment analysis, which provided dynamic, clinically relevant information for the algorithm. A complex model enhanced predictive accuracy but posed challenges regarding understandability, which was highly valued by clinicians.</p><p><strong>Conclusions: </strong>To advance mental health care as a data-driven field, several critical considerations must be addressed: ensuring robust data governance and quality, fostering cultural shifts in data documentation practices, establishing mechanisms for continuous
背景:人工智能(AI)为解决紧张的精神卫生保健系统所面临的挑战提供了潜在的解决方案,例如对护理的需求增加、人员短缺和可及性压力。虽然为临床实践开发基于人工智能的工具在技术上是可行的,并且有可能产生现实世界的影响,但实际上只有少数应用于临床实践。实现从算法开发阶段开始,因为这个阶段是理论创新和实际应用的桥梁。人工智能工具的设计和开发方式可能会促进或阻碍以后的实施和使用。目的:本研究旨在通过定性案例研究方法,探讨一种基于现实世界电子健康记录(EHR)数据的自杀风险预测算法的开发过程,以供临床精神卫生保健使用。它探讨了开发团队在创建算法时遇到的挑战,以及他们如何处理这些挑战。本研究确定了算法中技术和临床观点整合的关键考虑因素,促进了精神卫生组织向数据驱动实践的发展。所研究的算法仍然是探索性的,尚未在临床实践中实现。方法:采用归纳和演绎相结合的方法,进行了一项探索性的、多方法的定性案例研究。数据是通过桌面研究、反思团队会议和与开发团队的迭代反馈会议收集的。专题分析用于确定发展挑战和团队的应对措施。基于这些发现,推导了未来算法开发的关键考虑因素。结果:主要挑战包括在电子病历中定义、操作和测量自杀事件,这是由于数据缺失、少报和数据源之间的差异等问题造成的。通过咨询临床专家确定预测因素;然而,由于不能直接从电子病历数据中提取社会心理变量,因此必须构建社会心理变量。当使用传统的自杀预防问卷作为输入时,不均匀分布在患者之间,会产生偏倚风险。由于数据噪声,通过自然语言处理分析非结构化数据具有挑战性,但最终实现了成功的情感分析,为算法提供了动态的临床相关信息。一个复杂的模型提高了预测的准确性,但对可理解性提出了挑战,这受到临床医生的高度重视。结论:为了推动精神卫生保健成为一个数据驱动的领域,必须解决几个关键问题:确保稳健的数据治理和质量,促进数据文档实践中的文化转变,建立持续监测人工智能工具使用的机制,减轻偏见风险,平衡预测性能与可解释性,并保持临床医生的“循环”方法。未来的研究应优先考虑与人工智能在精神卫生保健实践中的发展、实施和日常使用相关的社会技术方面。
{"title":"Developing a Suicide Risk Prediction Algorithm Using Electronic Health Record Data in Mental Health Care: Real-World Case Study.","authors":"Linda Hummel, Karin C A G Lorenz-Artz, Joyce J P A Bierbooms, Inge M B Bongers","doi":"10.2196/74240","DOIUrl":"10.2196/74240","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Artificial intelligence (AI) offers potential solutions to address the challenges faced by a strained mental health care system, such as increasing demand for care, staff shortages, and pressured accessibility. While developing AI-based tools for clinical practice is technically feasible and has the potential to produce real-world impact, only a few are actually implemented into clinical practice. Implementation starts at the algorithm development phase, as this phase bridges theoretical innovation and practical application. The design and the way the AI tool is developed may either facilitate or hinder later implementation and use.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to examine the development process of a suicide risk prediction algorithm using real-world electronic health record (EHR) data through a qualitative case study approach for clinical use in mental health care. It explores which challenges the development team encountered in creating the algorithm and how they addressed these challenges. This study identifies key considerations for the integration of technical and clinical perspectives in algorithms, facilitating the evolution of mental health organizations toward data-driven practice. The studied algorithm remains exploratory and has not yet been implemented in clinical practice.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;An exploratory, multimethod qualitative case study was conducted, using a hybrid approach with both inductive and deductive analysis. Data were collected through desk research, reflective team meetings, and iterative feedback sessions with the development team. Thematic analysis was used to identify development challenges and the team's responses. Based on these findings, key considerations for future algorithm development were derived.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Key challenges included defining, operationalizing, and measuring suicide incidents within EHRs due to issues such as missing data, underreporting, and differences between data sources. Predicting factors were identified by consulting clinical experts; however, psychosocial variables had to be constructed as they could not directly be extracted from EHR data. Risk of bias occurred when traditional suicide prevention questionnaires, unequally distributed across patients, were used as input. Analyzing unstructured data by natural language processing was challenging due to data noise, but ultimately enabled successful sentiment analysis, which provided dynamic, clinically relevant information for the algorithm. A complex model enhanced predictive accuracy but posed challenges regarding understandability, which was highly valued by clinicians.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;To advance mental health care as a data-driven field, several critical considerations must be addressed: ensuring robust data governance and quality, fostering cultural shifts in data documentation practices, establishing mechanisms for continuous","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e74240"},"PeriodicalIF":3.8,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12803502/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nomograms Based on X-Ray Radiomics for Predicting Pain Progression in Knee Osteoarthritis Using Data From the Foundation for the National Institutes of Health: Development and Validation Study. 基于x射线放射组学的nomogram预测膝关节骨关节炎疼痛进展的方法,使用来自美国国立卫生研究院基金会的数据:开发和验证研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-14 DOI: 10.2196/78338
Yingwei Sun, Jing Liu, Chunbo Deng, Chengbao Peng, Shinong Pan, Xueyong Liu

Background: Knee osteoarthritis (KOA) is one of the most prevalent chronic musculoskeletal disorders among the older adult population. Screening populations at risk of rapid progression of osteoarthritis and implementing appropriate early intervention strategies is advantageous for the treatment and prognosis of affected patients.

Objective: This study aimed to construct and validate a nomogram model based on x-ray radiomics to effectively identify individuals experiencing progression of KOA pain.

Methods: The Foundation for the National Institutes of Health Biomarkers Consortium included a total of 600 participants who were classified as pain progressors (n=297, 49.5%) and non-pain progressors (n=303, 50.5%) according to an increase in the Western Ontario and McMaster Universities Osteoarthritis Index pain score of ≥9 points (on a scale from 0 to 100) during the follow-up period of 24 to 48 months. X-rays that lacked defined spacing in the DICOM image were excluded. Fully automatic selection of subchondral bone regions on the inner and outer edges of the tibia and femur as regions of interest and extraction of radiomics features for different combinations of regions of interest were conducted. Least absolute shrinkage and selection operator regression was used to select features and generate a radiomics score using Shapley additive explanations for interpretability. The radiomics score, along with clinical indicators, was incorporated into nomograms using a multivariable logistic regression model. The subgroup analysis focused solely on the progression of pain and cases with no progression at all. The receiver operating characteristic curve, along with calibration and decision curves, was used to assess the discriminative performance.

Results: A total of 450 participants were included in the study. Shapley additive explanations analysis identified Wavelet-HH_gldm_HighGrayLevelEmphasis as the primary radiomics feature. Nomogram 1 and nomogram 2 for predicting KOA pain progression achieved area under the curve values of 0.766 and 0.753, respectively, with mean absolute errors of 0.012 and 0.008, respectively, in the calibration curves. Decision curve analysis showed a positive net benefit across a range of threshold probabilities. In subgroup analyses, nomogram 3 and nomogram 4 yielded areas under the curve of 0.795 and 0.740, respectively.

Conclusions: The nomograms based on x-ray radiomics demonstrated excellent predictive capability and accuracy in forecasting the progression of KOA pain.

背景:膝骨关节炎(KOA)是老年人中最常见的慢性肌肉骨骼疾病之一。筛查有骨关节炎快速进展风险的人群,并实施适当的早期干预策略,有利于受影响患者的治疗和预后。目的:本研究旨在建立并验证基于x线放射组学的nomogram模型,以有效识别KOA疼痛进展的个体。方法:美国国立卫生研究院生物标志物联盟基金会共纳入600名参与者,根据西安大略省和麦克马斯特大学骨关节炎指数疼痛评分≥9分(从0到100分)的增加,他们被分为疼痛进展者(n=297, 49.5%)和非疼痛进展者(n=303, 50.5%),随访时间为24至48个月。排除DICOM图像中缺乏确定间距的x射线。全自动选择胫骨和股骨内外边缘的软骨下骨区域作为感兴趣区域,并提取不同感兴趣区域组合的放射组学特征。最小绝对收缩和选择算子回归用于选择特征,并使用Shapley加法解释可解释性生成放射组学评分。放射组学评分以及临床指标使用多变量logistic回归模型纳入nomogram。亚组分析仅关注疼痛的进展和完全没有进展的病例。采用受试者工作特征曲线、校准曲线和决策曲线来评估鉴别性能。结果:共有450名参与者被纳入研究。Shapley加性解释分析确定了Wavelet-HH_gldm_HighGrayLevelEmphasis为放射组学的主要特征。预测KOA疼痛进展的Nomogram 1和Nomogram 2的曲线下面积分别为0.766和0.753,校准曲线的平均绝对误差分别为0.012和0.008。决策曲线分析显示,在一系列阈值概率范围内,净收益为正。在亚群分析中,图3和图4的曲线下面积分别为0.795和0.740。结论:基于x线放射组学的形态图在预测KOA疼痛进展方面具有良好的预测能力和准确性。
{"title":"Nomograms Based on X-Ray Radiomics for Predicting Pain Progression in Knee Osteoarthritis Using Data From the Foundation for the National Institutes of Health: Development and Validation Study.","authors":"Yingwei Sun, Jing Liu, Chunbo Deng, Chengbao Peng, Shinong Pan, Xueyong Liu","doi":"10.2196/78338","DOIUrl":"10.2196/78338","url":null,"abstract":"<p><strong>Background: </strong>Knee osteoarthritis (KOA) is one of the most prevalent chronic musculoskeletal disorders among the older adult population. Screening populations at risk of rapid progression of osteoarthritis and implementing appropriate early intervention strategies is advantageous for the treatment and prognosis of affected patients.</p><p><strong>Objective: </strong>This study aimed to construct and validate a nomogram model based on x-ray radiomics to effectively identify individuals experiencing progression of KOA pain.</p><p><strong>Methods: </strong>The Foundation for the National Institutes of Health Biomarkers Consortium included a total of 600 participants who were classified as pain progressors (n=297, 49.5%) and non-pain progressors (n=303, 50.5%) according to an increase in the Western Ontario and McMaster Universities Osteoarthritis Index pain score of ≥9 points (on a scale from 0 to 100) during the follow-up period of 24 to 48 months. X-rays that lacked defined spacing in the DICOM image were excluded. Fully automatic selection of subchondral bone regions on the inner and outer edges of the tibia and femur as regions of interest and extraction of radiomics features for different combinations of regions of interest were conducted. Least absolute shrinkage and selection operator regression was used to select features and generate a radiomics score using Shapley additive explanations for interpretability. The radiomics score, along with clinical indicators, was incorporated into nomograms using a multivariable logistic regression model. The subgroup analysis focused solely on the progression of pain and cases with no progression at all. The receiver operating characteristic curve, along with calibration and decision curves, was used to assess the discriminative performance.</p><p><strong>Results: </strong>A total of 450 participants were included in the study. Shapley additive explanations analysis identified Wavelet-HH_gldm_HighGrayLevelEmphasis as the primary radiomics feature. Nomogram 1 and nomogram 2 for predicting KOA pain progression achieved area under the curve values of 0.766 and 0.753, respectively, with mean absolute errors of 0.012 and 0.008, respectively, in the calibration curves. Decision curve analysis showed a positive net benefit across a range of threshold probabilities. In subgroup analyses, nomogram 3 and nomogram 4 yielded areas under the curve of 0.795 and 0.740, respectively.</p><p><strong>Conclusions: </strong>The nomograms based on x-ray radiomics demonstrated excellent predictive capability and accuracy in forecasting the progression of KOA pain.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e78338"},"PeriodicalIF":3.8,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12853086/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145985808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study. 基于多中心真实世界临床记录的精神病诊断大语言模型:一项比较研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-13 DOI: 10.2196/77699
Maoqian Sun, Jia Yu, Zhuhong Long, Yun Yang, Tao Xiao, Jiaquan Liang, Jun Feng, Huaili Deng, Guoping Huang

Background: Psychiatric disorders are diagnostically challenging and often rely on subjective clinical judgment, particularly in resource-limited settings. Large language models (LLMs) have demonstrated potential in supporting psychiatric diagnosis; however, robust evidence from large-scale, real-world clinical data remains limited.

Objective: This study aimed to evaluate and compare the diagnostic performance of multiple LLMs for psychiatric disorders using multicenter real-world electronic health records (EHRs).

Methods: We retrospectively analyzed 9923 inpatient EHRs collected from 6 psychiatric centers across China, encompassing all ICD-10 (International Statistical Classification of Diseases, Tenth Revision) psychiatric categories. In total, 3 LLMs-GPT-4.0 (OpenAI), GPT-3.5 (OpenAI), and GLM-4-Plus (Zhipu AI)-were evaluated against physician-confirmed discharge diagnoses. Diagnostic performance was assessed using strict accuracy criteria and lenient classification metrics, with subgroup analyses conducted across diagnostic categories and age groups.

Results: GPT-4.0 achieved the highest overall strict diagnostic accuracy (71.7%) and the highest weighted F1-score under lenient evaluation (0.881), particularly for high-prevalence disorders, such as mood disorders and schizophrenia spectrum disorders. Diagnostic performance varied across age groups, with the highest accuracy observed in older adult patients (up to 79.5%) and lower accuracy in adolescents. Across centers, model performance remained stable, with no significant intercenter differences.

Conclusions: LLMs-especially GPT-4.0-demonstrate promising capability in supporting psychiatric diagnosis using real-world EHRs. However, diagnostic performance varies by age group and disorder category. LLMs should be regarded as assistive tools rather than replacements for clinical judgment, and further validation is needed before routine clinical implementation.

背景:精神疾病的诊断具有挑战性,往往依赖于主观的临床判断,特别是在资源有限的情况下。大型语言模型(LLMs)在支持精神病诊断方面已经显示出潜力;然而,来自大规模真实临床数据的有力证据仍然有限。目的:本研究旨在评估和比较使用多中心真实世界电子健康记录的多种大语言模型对精神疾病的诊断性能。方法:我们回顾性分析了从中国6个精神病学中心收集的9923例住院患者电子健康记录,包括所有ICD-10精神病学类别。三个LLMs-GPT-4.0, GPT-3.5和glm -4- plus根据医生确认的出院诊断进行评估。使用严格的准确性标准和宽松的分类指标评估诊断性能,并在诊断类别和年龄组之间进行亚组分析。结果:GPT-4.0获得了最高的总体严格诊断准确率(71.7%)和最高的宽松评估加权F1评分(0.881),特别是对于高患病率的疾病,如情绪障碍和精神分裂症谱系障碍。诊断表现因年龄组而异,在老年患者中观察到的准确率最高(高达79.5%),而在青少年中准确率较低。在不同的中心,模型的表现保持稳定,中心之间没有显著的差异。结论:大型语言模型——尤其是gpt -4.0——在使用真实世界的电子健康记录支持精神病诊断方面表现出了很好的能力。然而,诊断表现因年龄组和障碍类别而异。法学硕士应被视为辅助工具,而不是临床判断的替代品,在常规临床应用之前需要进一步验证。临床试验:
{"title":"Large Language Models for Psychiatric Diagnosis Based on Multicenter Real-World Clinical Records: Comparative Study.","authors":"Maoqian Sun, Jia Yu, Zhuhong Long, Yun Yang, Tao Xiao, Jiaquan Liang, Jun Feng, Huaili Deng, Guoping Huang","doi":"10.2196/77699","DOIUrl":"10.2196/77699","url":null,"abstract":"<p><strong>Background: </strong>Psychiatric disorders are diagnostically challenging and often rely on subjective clinical judgment, particularly in resource-limited settings. Large language models (LLMs) have demonstrated potential in supporting psychiatric diagnosis; however, robust evidence from large-scale, real-world clinical data remains limited.</p><p><strong>Objective: </strong>This study aimed to evaluate and compare the diagnostic performance of multiple LLMs for psychiatric disorders using multicenter real-world electronic health records (EHRs).</p><p><strong>Methods: </strong>We retrospectively analyzed 9923 inpatient EHRs collected from 6 psychiatric centers across China, encompassing all ICD-10 (International Statistical Classification of Diseases, Tenth Revision) psychiatric categories. In total, 3 LLMs-GPT-4.0 (OpenAI), GPT-3.5 (OpenAI), and GLM-4-Plus (Zhipu AI)-were evaluated against physician-confirmed discharge diagnoses. Diagnostic performance was assessed using strict accuracy criteria and lenient classification metrics, with subgroup analyses conducted across diagnostic categories and age groups.</p><p><strong>Results: </strong>GPT-4.0 achieved the highest overall strict diagnostic accuracy (71.7%) and the highest weighted F1-score under lenient evaluation (0.881), particularly for high-prevalence disorders, such as mood disorders and schizophrenia spectrum disorders. Diagnostic performance varied across age groups, with the highest accuracy observed in older adult patients (up to 79.5%) and lower accuracy in adolescents. Across centers, model performance remained stable, with no significant intercenter differences.</p><p><strong>Conclusions: </strong>LLMs-especially GPT-4.0-demonstrate promising capability in supporting psychiatric diagnosis using real-world EHRs. However, diagnostic performance varies by age group and disorder category. LLMs should be regarded as assistive tools rather than replacements for clinical judgment, and further validation is needed before routine clinical implementation.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":"e77699"},"PeriodicalIF":3.8,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12848494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145776737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neutrophil Percentage-to-Albumin Ratio as a Novel Prognostic Biomarker in Adult Diffuse Gliomas: Retrospective Study Integrating 3 Machine Learning Models and Cox Regression. 中性粒细胞百分比-白蛋白比率作为成人弥漫性胶质瘤的一种新的预后生物标志物:整合3种机器学习模型和Cox回归的回顾性研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-13 DOI: 10.2196/79945
Congcong Zhu, Jiyang An, Lili Zhou

Background: Adult-type diffuse glioma (ADG) is the most common primary malignant tumor of the central nervous system. Its highly invasive nature, marked heterogeneity, and resistance to therapy contribute to a high risk of recurrence and poor prognosis. At present, the lack of reliable prognostic tools poses a significant barrier to the development of individualized treatment strategies.

Objective: This study aimed to develop an effective prognostic model for ADG by integrating multiple machine learning algorithms, in order to enhance the precision of individualized clinical decision-making.

Methods: In this retrospective study, 160 newly diagnosed patients with ADG who underwent surgical resection and histopathological confirmation at our institution between June 2019 and September 2021 were included. A total of 32 variables, including clinical characteristics, molecular biomarkers, and preoperative hematological indicators, were collected. Overall survival (OS) and progression-free survival (PFS) were defined as the study endpoints. Feature selection was performed using least absolute shrinkage and selection operator regression, extreme gradient boosting, and random forest algorithms. Kaplan-Meier survival curves and log-rank tests were used for survival analysis. Multivariate Cox proportional hazards models were constructed to identify independent prognostic factors, and nomograms were developed accordingly. The model's discriminative ability, calibration, and clinical utility were evaluated using the concordance index, area under the receiver operating characteristic curve (area under the curve), calibration plots, and Kaplan-Meier analysis.

Results: Age, neutrophil percentage-to-albumin ratio (NPAR), and platelet-to-mean platelet volume ratio were identified as independent prognostic factors for OS, while age and NPAR were independent predictors for PFS (all P<.001). The prognostic models based on these variables demonstrated good predictive performance, with concordance index values of 0.731 and 0.763 for the training and validation cohorts in the OS model, respectively. The PFS model also showed robust performance. Area under the curve values and calibration curves further supported the models' accuracy and stability. Risk stratification analysis revealed clear survival differences between risk groups (all P<.05), indicating strong clinical applicability.

Conclusions: This study is the first to identify preoperative NPAR as a significant prognostic biomarker for ADG using machine learning approaches. The prognostic model incorporating NPAR, platelet-to-mean platelet volume ratio, and age demonstrated favorable predictive performance, offering a novel perspective for accurate risk stratification and personalized treatment in patients with ADG.

背景:成人型弥漫性胶质瘤(ADG)是中枢神经系统最常见的原发性恶性肿瘤。其高度侵袭性、明显的异质性和对治疗的抵抗导致复发风险高,预后差。目前,缺乏可靠的预后工具对个体化治疗策略的发展构成了重大障碍。目的:本研究旨在通过整合多种机器学习算法,建立有效的ADG预后模型,以提高个体化临床决策的准确性。方法:在这项回顾性研究中,纳入了2019年6月至2021年9月在我院接受手术切除和组织病理学证实的160例新诊断的ADG患者。共收集临床特征、分子生物标志物、术前血液学指标等32项变量。总生存期(OS)和无进展生存期(PFS)被定义为研究终点。使用最小绝对收缩和选择算子回归、极端梯度增强和随机森林算法进行特征选择。Kaplan-Meier生存曲线和log-rank检验用于生存分析。构建多变量Cox比例风险模型以确定独立的预后因素,并绘制相应的nomogram。采用一致性指数、受试者工作特征曲线下面积(曲线下面积)、校准图和Kaplan-Meier分析来评估模型的判别能力、校准和临床效用。结果:年龄、中性粒细胞百分比-白蛋白比(NPAR)和血小板-平均血小板体积比被确定为OS的独立预后因素,而年龄和NPAR是PFS的独立预测因素。结论:本研究首次使用机器学习方法确定术前NPAR是ADG的重要预后生物标志物。结合NPAR、血小板与平均血小板体积比和年龄的预后模型显示出良好的预测性能,为ADG患者的准确风险分层和个性化治疗提供了新的视角。
{"title":"Neutrophil Percentage-to-Albumin Ratio as a Novel Prognostic Biomarker in Adult Diffuse Gliomas: Retrospective Study Integrating 3 Machine Learning Models and Cox Regression.","authors":"Congcong Zhu, Jiyang An, Lili Zhou","doi":"10.2196/79945","DOIUrl":"10.2196/79945","url":null,"abstract":"<p><strong>Background: </strong>Adult-type diffuse glioma (ADG) is the most common primary malignant tumor of the central nervous system. Its highly invasive nature, marked heterogeneity, and resistance to therapy contribute to a high risk of recurrence and poor prognosis. At present, the lack of reliable prognostic tools poses a significant barrier to the development of individualized treatment strategies.</p><p><strong>Objective: </strong>This study aimed to develop an effective prognostic model for ADG by integrating multiple machine learning algorithms, in order to enhance the precision of individualized clinical decision-making.</p><p><strong>Methods: </strong>In this retrospective study, 160 newly diagnosed patients with ADG who underwent surgical resection and histopathological confirmation at our institution between June 2019 and September 2021 were included. A total of 32 variables, including clinical characteristics, molecular biomarkers, and preoperative hematological indicators, were collected. Overall survival (OS) and progression-free survival (PFS) were defined as the study endpoints. Feature selection was performed using least absolute shrinkage and selection operator regression, extreme gradient boosting, and random forest algorithms. Kaplan-Meier survival curves and log-rank tests were used for survival analysis. Multivariate Cox proportional hazards models were constructed to identify independent prognostic factors, and nomograms were developed accordingly. The model's discriminative ability, calibration, and clinical utility were evaluated using the concordance index, area under the receiver operating characteristic curve (area under the curve), calibration plots, and Kaplan-Meier analysis.</p><p><strong>Results: </strong>Age, neutrophil percentage-to-albumin ratio (NPAR), and platelet-to-mean platelet volume ratio were identified as independent prognostic factors for OS, while age and NPAR were independent predictors for PFS (all P<.001). The prognostic models based on these variables demonstrated good predictive performance, with concordance index values of 0.731 and 0.763 for the training and validation cohorts in the OS model, respectively. The PFS model also showed robust performance. Area under the curve values and calibration curves further supported the models' accuracy and stability. Risk stratification analysis revealed clear survival differences between risk groups (all P<.05), indicating strong clinical applicability.</p><p><strong>Conclusions: </strong>This study is the first to identify preoperative NPAR as a significant prognostic biomarker for ADG using machine learning approaches. The prognostic model incorporating NPAR, platelet-to-mean platelet volume ratio, and age demonstrated favorable predictive performance, offering a novel perspective for accurate risk stratification and personalized treatment in patients with ADG.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79945"},"PeriodicalIF":3.8,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12848496/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145968133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Factors Associated With the Stalled Implementation of a Ground-Up Electronic Health Record System in South Africa: Qualitative Insights From the E-Tick Case Study Using the Consolidated Framework for Implementation Research (CFIR). 探索与南非电子健康记录系统实施停滞相关的因素:使用实施研究综合框架(CFIR)的E-Tick案例研究的定性见解。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-12 DOI: 10.2196/73831
Campion Zharima, Frances Griffiths, Jane Goudge

Background: Electronic health records (EHRs) have the potential to improve service delivery through record keeping and monitoring health outcomes. As countries move toward universal health coverage, digital health tools such as EHRs are essential for achieving this goal. However, EHR implementation in middle-income countries like South Africa faces obstacles.

Objective: This study explores the reasons behind a stalled implementation of the electronic tick register (E-tick) system (an electronic version of a paper primary health care register to record services provided), using the Consolidated Framework for Implementation Research.

Methods: Using a qualitative design, in-depth interviews were conducted with 38 participants to explore their perceptions and experiences, and the factors surrounding the success and stalling of E-ticks. Participants included managers, stakeholders, implementers, and end users from the 3 implementation clinics. Data was collected using semistructured interview guides. The Thematic and Consolidated Framework for Implementation Research framework analysis (innovation, inner setting, individual characteristics, implementation process, and outer setting) was applied.

Results: The E-tick system was designed to improve data quality in paper health registers, addressing inaccuracies in reporting to district and provincial health departments (Innovation domain). Implementers iteratively developed the system through user input from managers and clinicians, and stakeholder engagement of software developers, funders, health managers, and decision-makers from the provincial health department (individual characteristics). Although the system was initially well adopted by end users, it stalled primarily due to outer setting factors, which included a change of developers, funding cuts, and limited support at the provincial health department level due to capacity gaps, political appointments, and mistrust stemming from corruption and abuse of the tender system. Moreover, resistance to leveraging lessons from locally developed small-scale systems further constrained institutional support for the E-tick.

Conclusions: Although successful implementation of EHRs can be facilitated by strong user engagement and co-design, outer setting factors such as governance, funding, and policy alignment can pose significant threats to sustainability. This underscores the importance of effective synergy between top-down and bottom-up processes for successful implementation.

背景:电子健康记录(EHRs)具有通过记录保存和监测健康结果来改善服务提供的潜力。随着各国向全民健康覆盖迈进,电子病历等数字卫生工具对于实现这一目标至关重要。然而,电子病历在南非等中等收入国家的实施面临障碍。目的:本研究利用实施研究的综合框架,探讨电子签到登记(E-tick)系统(用于记录所提供服务的纸质初级卫生保健登记簿的电子版)实施停滞背后的原因。方法:采用定性设计,对38名参与者进行深度访谈,探讨他们的看法和经验,以及影响电子蜱成功和失速的因素。参与者包括来自3个实现诊所的管理人员、涉众、实现者和最终用户。使用半结构化访谈指南收集数据。应用实施研究的专题和综合框架(创新、内部设置、个体特征、实施过程和外部设置)框架分析。结果:E-tick系统旨在提高纸质卫生登记的数据质量,解决向区和省卫生部门报告的不准确问题(创新领域)。实现者通过管理人员和临床医生的用户输入,以及软件开发人员、资助者、卫生管理人员和省级卫生部门决策者的利益相关者参与(个人特征),迭代地开发系统。虽然该系统最初被最终用户很好地采用,但它主要由于外部环境因素而停滞不前,这些因素包括开发商的变更、资金削减、省级卫生部门由于能力差距、政治任命和腐败和滥用招标制度而产生的不信任而提供的有限支持。此外,抵制利用当地发展的小规模系统的经验教训进一步限制了对电子蜱虫的机构支持。结论:尽管强大的用户参与和协同设计可以促进电子病历的成功实施,但治理、资金和政策一致性等外部环境因素可能对可持续性构成重大威胁。这强调了自顶向下和自底向上过程之间有效协同作用对成功执行的重要性。
{"title":"Exploring Factors Associated With the Stalled Implementation of a Ground-Up Electronic Health Record System in South Africa: Qualitative Insights From the E-Tick Case Study Using the Consolidated Framework for Implementation Research (CFIR).","authors":"Campion Zharima, Frances Griffiths, Jane Goudge","doi":"10.2196/73831","DOIUrl":"10.2196/73831","url":null,"abstract":"<p><strong>Background: </strong>Electronic health records (EHRs) have the potential to improve service delivery through record keeping and monitoring health outcomes. As countries move toward universal health coverage, digital health tools such as EHRs are essential for achieving this goal. However, EHR implementation in middle-income countries like South Africa faces obstacles.</p><p><strong>Objective: </strong>This study explores the reasons behind a stalled implementation of the electronic tick register (E-tick) system (an electronic version of a paper primary health care register to record services provided), using the Consolidated Framework for Implementation Research.</p><p><strong>Methods: </strong>Using a qualitative design, in-depth interviews were conducted with 38 participants to explore their perceptions and experiences, and the factors surrounding the success and stalling of E-ticks. Participants included managers, stakeholders, implementers, and end users from the 3 implementation clinics. Data was collected using semistructured interview guides. The Thematic and Consolidated Framework for Implementation Research framework analysis (innovation, inner setting, individual characteristics, implementation process, and outer setting) was applied.</p><p><strong>Results: </strong>The E-tick system was designed to improve data quality in paper health registers, addressing inaccuracies in reporting to district and provincial health departments (Innovation domain). Implementers iteratively developed the system through user input from managers and clinicians, and stakeholder engagement of software developers, funders, health managers, and decision-makers from the provincial health department (individual characteristics). Although the system was initially well adopted by end users, it stalled primarily due to outer setting factors, which included a change of developers, funding cuts, and limited support at the provincial health department level due to capacity gaps, political appointments, and mistrust stemming from corruption and abuse of the tender system. Moreover, resistance to leveraging lessons from locally developed small-scale systems further constrained institutional support for the E-tick.</p><p><strong>Conclusions: </strong>Although successful implementation of EHRs can be facilitated by strong user engagement and co-design, outer setting factors such as governance, funding, and policy alignment can pose significant threats to sustainability. This underscores the importance of effective synergy between top-down and bottom-up processes for successful implementation.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e73831"},"PeriodicalIF":3.8,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12795486/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use. 临床护理中检索增强生成的伦理责任:对人工智能负责任使用的看法。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-09 DOI: 10.2196/79922
Xinyi Tu, Chenghao Shi, Peilin Qian, Lizhu Wang

Unlabelled: Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.

未标记:检索增强生成(RAG)系统已经成为一种强大的技术,可以通过使大型语言模型能够实时访问外部最新知识来增强其能力,并且RAG系统正越来越多地被医学领域的研究人员采用。在这篇观点文章中,我们探讨了在临床护理环境中实施RAG系统的道德要求,特别关注这些技术如何影响患者护理质量和安全。本文的目的是研究由rag增强的大型语言模型在临床护理中引入的伦理风险,并提出负责任实施的战略指导方针。关键的考虑因素包括确保准确性、公平性、透明度和可问责性,以及维护必要的人类监督,如通过结构化分析所讨论的那样。我们认为,稳健的数据治理、可解释的人工智能(AI)技术和持续监控是负责任的RAG实施策略的关键组成部分。最终,在减轻伦理问题的同时实现RAG的好处需要卫生保健专业人员、人工智能开发人员和政策制定者之间的持续合作,促进人工智能支持患者安全、减少差异并提高护理质量的未来。
{"title":"Ethical Imperatives for Retrieval-Augmented Generation in Clinical Nursing: Viewpoint on Responsible AI Use.","authors":"Xinyi Tu, Chenghao Shi, Peilin Qian, Lizhu Wang","doi":"10.2196/79922","DOIUrl":"10.2196/79922","url":null,"abstract":"<p><strong>Unlabelled: </strong>Retrieval-augmented generation (RAG) systems have emerged as a powerful technique to enhance the capabilities of large language models by enabling them to access external, up-to-date knowledge in real time, and RAG systems are being increasingly adopted by researchers in the medical field. In this viewpoint article, we explore the ethical imperatives for implementing RAG systems in clinical nursing environments, with particular attention to how these technologies affect patient care quality and safety. The purpose of this paper is to examine the ethical risks introduced by RAG-enhanced large language models in clinical nursing and to propose strategic guidelines for their responsible implementation. Key considerations include ensuring accuracy, fairness, transparency, and accountability, as well as maintaining essential human oversight, as discussed through a structured analysis. We argue that robust data governance, explainable artificial intelligence (AI) techniques, and continuous monitoring are critical components of a responsible RAG implementation strategy. Ultimately, realizing the benefits of RAG while mitigating ethical concerns requires sustained collaboration among health care professionals, AI developers, and policymakers, fostering a future where AI supports patient safety, reduces disparities, and improves the quality of nursing care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79922"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788701/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Model-Enabled Editing of Patient Audio Interviews From "This Is My Story" Conversations: Comparative Study. “这是我的故事”对话中患者音频访谈的大型语言模型编辑:比较研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-09 DOI: 10.2196/80205
Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey
<p><strong>Background: </strong>This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.</p><p><strong>Objective: </strong>TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.</p><p><strong>Methods: </strong>We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.</p><p><strong>Results: </strong>Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P<.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.</p><p><strong>Conclusions: </strong>An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc
背景:这是我的故事(TIMS)是由牧师伊丽莎白·特雷西(Elizabeth Tracey)发起的,旨在推广以人为本的医学方法。在TIMS项目中,患者是由牧师与患者或他们所爱的人进行引导对话的对象。他们被问及四个问题,以引出临床可操作的信息,这些信息已被证明可以改善患者和医疗提供者之间的沟通,加强医疗提供者的同理心。原始记录的谈话被编辑成一个长度约为1分15秒的压缩音频文件,并放置在电子健康记录中,以便所有照顾患者的提供者都可以轻松访问。目的:TIMS在约翰霍普金斯医院很活跃,并在协助提供者移情和沟通方面显示出价值。它在使用录音来实现这一目的方面是独一无二的。随着程序的扩展,由于手动编辑音频对话所需的时间和资源有限,采用存在障碍。为了解决这个问题,我们提出了一个自动化的解决方案,使用一个大的语言模型来创建有意义和简洁的音频摘要。方法:我们分析了24个TIMS音频访谈,并创建了三个编辑版本:(1)专家编辑,(2)人工智能(AI)使用全自动大型语言模型管道编辑,(3)由专家培训的两名医学生编辑。另一位对编辑不知情的专家,以随机顺序对音频采访进行评级。这位专家对每次采访的音频质量和内容质量都进行了5分李克特评分。我们使用词汇和语义相似性度量量化了与专家编辑的参考文献的抄本相似性,并确定了相对于同一专家访谈的遗漏内容。结果:音频质量(流畅度、节奏、清晰度)和内容质量(连贯性、相关性、细微差别)均以5分李克特量表进行评分。专家编辑的访谈在音频质量(4.84)和内容质量(4.83)方面都获得了最高的平均评分。新手编辑得分中等(音频3.84,内容3.63),而人工智能编辑得分略低(音频3.49,内容3.20)。新手和人工智能编辑的评分明显低于专家编辑(p结论:基于人工智能的编辑管道可以生成内容和音频质量与新手编辑相当的TIMS音频摘要,只需经过一小时的培训。人工智能大大减少了编辑时间,消除了人工训练的需要;经过进一步验证,它可以提供一种解决方案,将TIMS扩展到更大范围的医疗保健环境。
{"title":"Large Language Model-Enabled Editing of Patient Audio Interviews From \"This Is My Story\" Conversations: Comparative Study.","authors":"Bikram Bains, Sampath Rapuri, Edgar Robitaille, Jonathan Wang, Arnav Khera, Catalina Gomez, Eduardo Reyes, Cole Perry, Jason Wilson, Elizabeth Tracey","doi":"10.2196/80205","DOIUrl":"10.2196/80205","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;This Is My Story (TIMS) was started by Chaplain Elizabeth Tracey to promote a humanistic approach to medicine. Patients in the TIMS program are the subject of a guided conversation in which a chaplain interviews either the patient or their loved one. They are asked four questions to elicit clinically actionable information that has been shown to improve communication between patients and medical providers, strengthening medical providers' empathy. The original recorded conversation is edited into a condensed audio file approximately 1 minute and 15 seconds in length and placed in the electronic health record where it is easily accessible by all providers caring for the patient.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;TIMS is active at the Johns Hopkins Hospital and has shown value in assisting with provider empathy and communication. It is unique in using audio recordings to accomplish this purpose. As the program expands, there exists a barrier to adoption due to limited time and resources needed to manually edit audio conversations. To address this, we propose an automated solution using a large language model to create meaningful and concise audio summaries.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 24 TIMS audio interviews and created three edited versions of each: (1) expert-edited, (2) artificial intelligence (AI)-edited using a fully automated large language model pipeline, and (3) novice-edited by two medical students trained by the expert. A second expert, blinded to the editor, rated the audio interviews in a randomized order. This expert scored both the audio quality and content quality of each interview on 5-point Likert scales. We quantified transcript similarity to the expert-edited reference using lexical and semantic similarity metrics and identified omitted content relative to that same expert interview.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Audio quality (flow, pacing, clarity) and content quality (coherence, relevance, nuance) were each rated on 5-point Likert scales. Expert-edited interviews received the highest mean ratings for both audio quality (4.84) and content quality (4.83). Novice-edited scored moderately (3.84 audio, 3.63 content), while AI-edited scored slightly lower (3.49 audio, 3.20 content). Novice and AI edits were rated significantly lower than the expert edits (P&lt;.001), but not significantly different from each other. AI and novice-edited interview transcripts had comparable overlap with the expert reference transcript, while qualitative review found frequent omissions of patient identity, actionable insights, and overall context in both the AI and novice-edited interviews. AI editing was fully automated and significantly reduced the editing time compared to both human editors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;An AI-based editing pipeline can generate TIMS audio summaries with comparable content and audio quality to novice human editors with one hour of training. AI significantly reduc","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e80205"},"PeriodicalIF":3.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12788710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping. 现有性别评分对德国临床研究数据的适用性:范围审查和数据映射。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-08 DOI: 10.2196/74162
Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath

Background: Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.

Objective: This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.

Methods: We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).

Results: Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.

Conclusions: To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.

背景:考虑性别和社会性别可以提高研究质量、创新和社会公平,忽视性别和社会性别会导致研究结果的不准确和低效率。尽管越来越多的人关注性别和性别敏感的医学,但由于其动态和具体情况的性质,在准确代表性别方面仍然存在挑战。目的:这项工作的目的是促进在德国大学医院和相关研究机构中执行一项收集和评估具体性别数据的标准。方法:我们进行了一项综述,以确定和分类最新的性别得分。我们系统地评估了22份出版物关于他们提出的性别分数的适用性和实用性。具体而言,我们使用医学信息学倡议核心数据集(MII CDS)评估了这些性别评分在德国常规临床实践研究数据中的使用情况。结果:人们提出了不同的性别评估方法,但没有标准化和有效的性别评分用于健康研究。大多数性别评分针对流行病学或公共卫生研究,在这些研究中,关于社会方面和生活习惯的问题已经是问卷的一部分。然而,将性别评分概念应用于临床数据是具有挑战性的。例如,MII的CDS缺乏目前在性别分数中记录的所有变量。虽然一些必要的变量确实存在于常规临床数据中,但它们需要成为MII CDS的一部分。结论:为了能够对常规临床数据进行针对性别的回顾性分析,我们建议更新和扩展MII CDS,包括更多与性别相关的信息。为此,我们提供了具体的行动步骤,说明如何在常规临床实践中捕获与性别相关的变量,并以机器可读的方式表示。
{"title":"Applicability of Existing Gender Scores for German Clinical Research Data: Scoping Review and Data Mapping.","authors":"Lea Schindler, Hilke Beelich, Elpiniki Katsari, Daniele Liprandi, Sylvia Stracke, Dagmar Waltemath","doi":"10.2196/74162","DOIUrl":"10.2196/74162","url":null,"abstract":"<p><strong>Background: </strong>Considering sex and gender improves research quality, innovation, and social equity, while ignoring them leads to inaccuracies and inefficiency in study results. Despite increasing attention on sex- and gender-sensitive medicine, challenges remain with accurately representing gender due to its dynamic and context-specific nature.</p><p><strong>Objective: </strong>This work aims to contribute to the implementation of a standard for collecting and assessing gender-specific data in German university hospitals and associated research facilities.</p><p><strong>Methods: </strong>We carried out a review to identify and categorize state-of-the-art gender scores. We systematically assessed 22 publications regarding the applicability and practicability of their proposed gender scores. Specifically, we evaluated the use of these gender scores on German research data from routine clinical practice, using the Medical Informatics Initiative core dataset (MII CDS).</p><p><strong>Results: </strong>Different methods for assessing gender have been proposed, but no standardized and validated gender score is available for health research. Most gender scores target epidemiological or public health research where questions about social aspects and life habits are already part of the questionnaires. However, it is challenging to apply concepts for gender scoring on clinical data. The MII CDS, for example, lacks all variables currently being recorded in gender scores. Although some of the required variables are indeed present in routine clinical data, they need to become part of the MII CDS.</p><p><strong>Conclusions: </strong>To enable gender-specific retrospective analysis of routine clinical data, we recommend updating and expanding the MII CDS by including more gender-relevant information. For this purpose, we provide concrete action steps on how gender-related variables can be captured in routine clinical practice and represented in a machine-readable way.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e74162"},"PeriodicalIF":3.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis. 动脉粥样硬化性心血管疾病患者健康交流的大语言模型:试点横断面比较分析。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-07 DOI: 10.2196/81422
Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen
<p><strong>Background: </strong>Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.</p><p><strong>Objective: </strong>Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.</p><p><strong>Methods: </strong>We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.</p><p><strong>Results: </strong>DeepSeek R1 achieved the highest "good response" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P<.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question "What is the standard treatment regimen for ASCVD?"</p><p><strong>Conclusions: </strong>DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa
背景:大型语言模型(LLMs)已成为增强公众获取医疗信息的有前途的工具,特别是对于慢性疾病,如动脉粥样硬化性心血管疾病(ASCVD)。然而,它们在以患者为中心的健康沟通中的有效性仍未得到充分探索,特别是在多语言背景下。目的:我们的研究旨在对3种先进的llms (deepseek R1、chatgpt - 40和gemini)进行比较评估,以生成ascvd相关患者查询的中英文回复,评估它们在准确性、完整性和可理解性方面的表现。方法:我们基于25个临床验证的ASCVD问题进行了横断面评估,涉及5个领域:定义、诊断、治疗、预防和生活方式。每个问题向3位法学硕士分别提交了5次英文和中文,总共得到750个回答,所有回答都是在默认设置下生成的,以近似真实情况。三名委员会认证的心脏病专家对模型身份一无所知,他们使用带有预定义锚点的标准化李克特量表独立地对反应进行评分。评估遵循严格的多阶段过程,包括随机化、洗脱期和最终共识评分。结果:DeepSeek R1获得了最高的“良好反应”率(24/25,英语和中文均为96%),大大优于chatgpt - 40(21/25, 84%)和Gemini(12/25,英语48%,17/25,中文68%)。与其他模型相比,DeepSeek R1显示出更高的中位数准确性得分(6,两种语言的IQR 6-6)和完整性得分(3,两种语言的IQR 2-3)。结论:DeepSeek R1在生成高质量、面向患者的中英文ASCVD信息方面表现出有希望和一致的表现,突出了开源法学硕士在促进数字健康素养和公平获取慢性病信息方面的潜力。然而,在指南敏感治疗中观察到一个临床关键弱点:模型不能可靠地提供与指南一致的标准治疗方案,这表明LLM的使用应限于低风险的信息子查询(例如,定义,诊断和生活方式教育),除非有专家监督和安全控制。
{"title":"Large Language Models in Patient Health Communication for Atherosclerotic Cardiovascular Disease: Pilot Cross-Sectional Comparative Analysis.","authors":"Pengfei Li, Yinfei Xu, Xiang Liu, Zhean Shen, Yi Wang, Xinyi Lv, Ziyi Lu, Hui Wu, Jiaqi Zhuang, Yan Chen","doi":"10.2196/81422","DOIUrl":"10.2196/81422","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Large language models (LLMs) have emerged as promising tools for enhancing public access to medical information, particularly for chronic diseases such as atherosclerotic cardiovascular disease (ASCVD). However, their effectiveness in patient-centered health communication remains underexplored, especially in multilingual contexts.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;Our study aimed to conduct a comparative evaluation of 3 advanced LLMs-DeepSeek R1, ChatGPT-4o, and Gemini-in generating responses to ASCVD-related patient queries in both English and Chinese, assessing their performance across the domains of accuracy, completeness, and comprehensibility.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We conducted a cross-sectional evaluation based on 25 clinically validated ASCVD questions spanning 5 domains-definitions, diagnosis, treatment, prevention, and lifestyle. Each question was submitted 5 times to each of the 3 LLMs in both English and Chinese, yielding 750 responses in total, all generated under default settings to approximate real-world conditions. Three board-certified cardiologists blinded to model identity independently scored the responses using standardized Likert scales with predefined anchors. The assessment followed a rigorous multistage process that incorporated randomization, washout periods, and final consensus scoring.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;DeepSeek R1 achieved the highest \"good response\" rates (24/25, 96% in both English and Chinese), substantially outperforming ChatGPT-4o (21/25, 84%) and Gemini (12/25, 48% in English and 17/25, 68% in Chinese). DeepSeek R1 demonstrated superior median accuracy scores (6, IQR 6-6 in both languages) and completeness scores (3, IQR 2-3 in both languages) compared to the other models (P&lt;.001). All models had a median comprehensibility score of 3; however, in English, DeepSeek R1 and ChatGPT-4o were rated significantly clearer than Gemini (P=.006 and P=.03, respectively), whereas no significant between-model differences were observed in Chinese (P=.08). Interrater reliability was moderate (Kendall W: accuracy=0.578; completeness=0.565; comprehensibility=0.486). Performance was consistently stronger for definitional and diagnostic questions than for treatment and prevention topics across all models. Specifically, none of the models consistently provided responses aligned with the latest clinical guidelines for the following key guideline-facing question \"What is the standard treatment regimen for ASCVD?\"&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;DeepSeek R1 exhibited promising and consistent performance in generating high-quality, patient-facing ASCVD information across both English and Chinese, highlighting the potential of open-source LLMs in promoting digital health literacy and equitable access to chronic disease information. However, a clinically critical weakness was observed in guideline-sensitive treatment: the models did not reliably provide guideline-concordant standa","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e81422"},"PeriodicalIF":3.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12824577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR Medical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1