首页 > 最新文献

JMIR AI最新文献

英文 中文
Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models. 病房临床恶化的治疗建议:机器学习模型的开发和验证。
IF 2 Pub Date : 2026-01-16 DOI: 10.2196/81642
Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek

Background: Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.

Objective: This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.

Methods: We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.

Results: Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.

Conclusions: We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.

背景:普通病房患者的临床恶化与发病率和死亡率增加有关。早期和适当的治疗可以改善这类患者的预后。虽然机器学习(ML)工具在早期识别临床恶化风险方面已被证明是成功的,但在为临床医生提供高风险患者的数据驱动治疗建议方面,很少有研究探索它们的有效性。目的:本研究建立了预测10种常见临床恶化干预措施需求的ML性能基准。本研究还比较了各种ML模型的性能,以告知哪种类型的方法非常适合这些预测任务。方法:我们依赖于一个经过图表审查的多中心数据集,其中包括经历临床恶化的普通病房患者(n=2480次就诊),这些患者使用美国食品和药物管理局批准的早期预警评分(电子心脏骤停风险分诊评分)被确定为高风险。手动图表审查标记了每一次遭遇与黄金标准的救生治疗标签。我们训练了弹性网络逻辑回归、梯度增强机器、长短期记忆和堆叠集成模型,以预测在恶化升高风险评分时对10种常见恶化干预的需求。对来自3个卫生系统的遭遇进行了模型培训,并对来自第四个卫生系统的遭遇进行了外部验证。鉴别表现以受试者工作特征曲线下面积(AUROC)为主要评价指标。结果:不同模型和预测任务的判别性能差异很大,auroc一般在0.7 ~ 0.9之间。在所有模型中,抗心律失常药物是最容易预测的治疗方法(平均AUROC为0.866,SD为0.012),而抗凝药物最难预测(平均AUROC为0.660,SD为0.065)。虽然没有一种单独的建模方法在所有任务中都优于其他方法,但梯度增强的机器往往表现出最佳的单独性能。此外,堆叠集成,结合了所有模型的预测,通常匹配或优于每个任务中表现最好的单个模型。我们还证明,在我们的评估队列中,相当一部分患者在恶化升高的风险评分时未接受治疗,这突出了利用ML工具减少治疗延迟的机会。结论:我们发现,在预测临床恶化患者的救命治疗方面,ML模型在不同任务和模型方法之间的区分存在差异。总体表现较高,这些模型可以与早期预警评分配对,为临床医生提供及时和可操作的治疗建议,以改善患者护理。
{"title":"Treatment Recommendations for Clinical Deterioration on the Wards: Development and Validation of Machine Learning Models.","authors":"Eric Pulick, Kyle A Carey, Tonela Qyli, Madeline K Oguss, Jamila K Picart, Leena Penumalee, Lily K Nezirova, Sean T Tully, Emily R Gilbert, Nirav S Shah, Urmila Ravichandran, Majid Afshar, Dana P Edelson, Yonatan Mintz, Matthew M Churpek","doi":"10.2196/81642","DOIUrl":"https://doi.org/10.2196/81642","url":null,"abstract":"<p><strong>Background: </strong>Clinical deterioration in general ward patients is associated with increased morbidity and mortality. Early and appropriate treatments can improve outcomes for such patients. While machine learning (ML) tools have proven successful in the early identification of clinical deterioration risk, little work has explored their effectiveness in providing data-driven treatment recommendations to clinicians for high-risk patients.</p><p><strong>Objective: </strong>This study established ML performance benchmarks for predicting the need for 10 common clinical deterioration interventions. This study also compared the performance of various ML models to inform which types of approaches are well-suited to these prediction tasks.</p><p><strong>Methods: </strong>We relied on a chart-reviewed, multicenter dataset of general ward patients experiencing clinical deterioration (n=2480 encounters), who were identified as high risk using a Food and Drug Administration-cleared early warning score (electronic Cardiac Arrest Risk Triage score). Manual chart review labeled each encounter with gold-standard lifesaving treatment labels. We trained elastic net logistic regression, gradient boosted machines, long short-term memory, and stacking ensemble models to predict the need for 10 common deterioration interventions at the time of the deterioration elevated risk score. Models were trained on encounters from 3 health systems and externally validated on encounters from a fourth health system. Discriminative performance, assessed by the area under the receiver operating characteristic curve (AUROC), was the primary evaluation metric.</p><p><strong>Results: </strong>Discriminative performance varied widely by model and prediction task, with AUROCs typically ranging from 0.7 to 0.9. Across all models, antiarrhythmics were the easiest treatment to predict (mean AUROC 0.866, SD 0.012) while anticoagulants were the hardest to predict (mean AUROC 0.660, SD 0.065). While no individual modeling approach outperformed the others across all tasks, the gradient boosted machines tended to show the best individual performance. Additionally, the stacking ensemble, which combined predictions from all models, typically matched or outperformed the best-performing individual model for each task. We also demonstrated that a sizable fraction of patients in our evaluation cohort were untreated at the time of the deterioration elevated risk score, highlighting an opportunity to leverage ML tools to decrease treatment latency.</p><p><strong>Conclusions: </strong>We found variability in the discrimination of ML models across tasks and model approaches for predicting lifesaving treatments in patients with clinical deterioration. Overall performance was high, and these models could be paired with early warning scores to provide clinicians with timely and actionable treatment recommendations to improve patient care.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e81642"},"PeriodicalIF":2.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Role of AI in Improving Digital Wellness Among Older Adults: Comparative Bibliometric Analysis. 人工智能在改善老年人数字健康中的作用:比较文献计量分析。
IF 2 Pub Date : 2026-01-14 DOI: 10.2196/71248
Naveh Eskinazi, Moti Zwilling, Adilson Marques, Riki Tesler

Background: Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.

Objective: This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.

Methods: Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.

Results: Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.

Conclusions: The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.

背景:人工智能(AI)的进步通过为健康、社会连接和整体福祉提供创新的解决方案,彻底改变了数字健康。尽管取得了这些进步,但老年人口往往面临无障碍、数字素养和基础设施限制等障碍,使他们面临被数字排斥的风险。这些挑战突出表明,迫切需要有针对性的人工智能驱动干预措施,以弥合数字鸿沟,促进老年人融入数字生态系统。目的:本研究对人工智能在促进数字健康方面的作用进行了比较文献计量分析,特别强调了与普通人群相比老年人群的作用。该分析涉及五个关键研究主题:(1)随着时间的推移,人工智能对老年人和普通人群数字健康的影响的演变;(2)全球合作模式;(3)领先机构对人工智能研究的贡献;(4)该领域的知名期刊;(5)人工智能相关研究的新兴主题和趋势。方法:收集2016 - 2025年Web of Science的数据,共3429篇文献(其中344篇与老年人相关),使用文献计量学工具进行分析。结果表明,自2016年以来,针对普通人群的人工智能相关数字健康研究呈指数级增长,其中美国、英国和中国的贡献显著。相比之下,对老年人的研究增长较慢,有更多的本地化合作网络,引用量稳步增长。针对普通人群的主要研究课题包括数字健康、机器学习和远程医疗,而针对老年人的研究则侧重于痴呆症、移动健康和风险管理。结论:我们的分析结果表明,越来越多的研究集中在人工智能驱动的解决方案上,旨在改善老年人的数字健康,并确定未来的研究方向,以参考这一人群的具体需求。
{"title":"The Role of AI in Improving Digital Wellness Among Older Adults: Comparative Bibliometric Analysis.","authors":"Naveh Eskinazi, Moti Zwilling, Adilson Marques, Riki Tesler","doi":"10.2196/71248","DOIUrl":"https://doi.org/10.2196/71248","url":null,"abstract":"<p><strong>Background: </strong>Advances in artificial intelligence (AI) have revolutionized digital wellness by providing innovative solutions for health, social connectivity, and overall well-being. Despite these advancements, the older population often struggles with barriers such as accessibility, digital literacy, and infrastructure limitations, leaving them at risk of digital exclusion. These challenges underscore the critical need for tailored AI-driven interventions to bridge the digital divide and enhance the inclusion of older adults in the digital ecosystem.</p><p><strong>Objective: </strong>This study presents a comparative bibliometric analysis of research on the role of AI in promoting digital wellness, with a particular emphasis on the older population in comparison to the general population. The analysis addressed five key research topics: (1) the evolution of AI's impact on digital wellness over time for both the older and general population, (2) patterns of collaboration globally, (3) leading institutions' contribution to AI-focused research, (4) prominent journals in the field, and (5) emerging themes and trends in AI-related research.</p><p><strong>Methods: </strong>Data were collected from the Web of Science between 2016 and 2025, totaling 3429 documents (344 related to older people), analyzed using bibliometric tools.</p><p><strong>Results: </strong>Results indicate that AI-related digital wellness research for the general population has experienced exponential growth since 2016, with significant contributions from the United States, the United Kingdom, and China. In contrast, research on older people has seen slower growth, with more localized collaboration networks and a steady increase in citations. Key research topics for the general population include digital health, machine learning, and telemedicine, whereas studies on older people focus on dementia, mobile health, and risk management.</p><p><strong>Conclusions: </strong>The results of our analysis highlight an increasing body of research focused on AI-driven solutions intended to improve the digital wellness among older people and identify future research directions to refer to the specific needs of this population segment.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e71248"},"PeriodicalIF":2.0,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable Multitask Burnout Prediction Using Adaptive Deep Learning (EMBRACE) for Resident Physicians: Algorithm Development and Validation Study. 住院医师使用自适应深度学习(EMBRACE)的可解释多任务倦怠预测:算法开发和验证研究。
IF 2 Pub Date : 2026-01-08 DOI: 10.2196/57025
Saima Alam, Mohammad Arif Ul Alam
<p><strong>Background: </strong>Medical residency is characterized by high stress, long working hours, and demanding schedules, leading to widespread burnout among resident physicians. Although wearable sensors and machine learning (ML) models hold promise for predicting burnout, their lack of clinical explainability often limits their utility in health care settings.</p><p><strong>Objective: </strong>This paper presents EMBRACE (Explainable Multitask Burnout Prediction Using Adaptive Deep Learning), a novel framework designed to predict and explain future burnout in resident physicians through an adaptive multitask deep learning approach. The framework aims to provide clinically actionable and trustworthy burnout predictions by integrating explainable ML techniques.</p><p><strong>Methods: </strong>EMBRACE applies deep multitask learning (3 tasks) using wearable sensor data for context-aware burnout prediction and explanation. The adaptive multitask learning framework predicts workplace activities and future burnout levels, and automatically completes a clinically validated burnout survey. Additionally, an explainability study was conducted using SHAP (Shapley Additive Explanations) to provide feature importance scores and visualizations for clinicians, enhancing the transparency and interpretability of the predictions. We evaluated the model on three datasets: (1) a collected dataset of 28 resident physicians (mean age 27.5, SD 3.5 years), over 2-7 days (average 3.6 days) with research protocols approved by the institutional review board (#2021-017) of Berkshire Medical Center, University of Massachusetts Chan Medical School; (2) the publicly available WESAD (Wearable Stress and Affect Detection) dataset from 15 participants; and (3) the SWELL-KW (SWELL Knowledge Work) dataset containing workplace stress and activity data from 25 participants (8 females and 17 males).</p><p><strong>Results: </strong>On our collected dataset, EMBRACE achieved 93% recall, 91% precision, and 0.91 R<sup>2</sup> error in predicting 5-class activities, 4-class future burnout levels, and 1 clinically explainable survey (Mini-Z with 10 questions). On the WESAD dataset, the model achieved 94.1% recall and 94.6% precision for 3-class stress level prediction. On the SWELL-KW dataset, EMBRACE obtained 89% recall, 86% precision, and 0.88 R<sup>2</sup> error in predicting 5-class activities, 3 burnout measures (joyful, satisfaction, and stress) with 2 classes on each measure, and 4 survey assessments (a total of 20 questions). The explainability study, using SHAP values, highlighted key contributing factors such as heart rate variability, sedentary activity duration, and interruptions, improving clinical trust and interpretation of burnout predictions. Of 23 participants, 21 (91%) reported satisfaction with the explainability of feature importance summaries.</p><p><strong>Conclusions: </strong>EMBRACE provides a clinically explainable and actionable solution for early burnout
背景:住院医师的特点是压力大、工作时间长、工作日程紧凑,导致住院医师普遍存在职业倦怠。尽管可穿戴传感器和机器学习(ML)模型有望预测职业倦怠,但它们缺乏临床可解释性,往往限制了它们在医疗保健环境中的应用。目的:本文介绍了EMBRACE(可解释的多任务倦怠预测使用自适应深度学习),这是一个新的框架,旨在通过自适应多任务深度学习方法预测和解释住院医师未来的倦怠。该框架旨在通过集成可解释的ML技术,提供临床可操作且值得信赖的倦怠预测。方法:EMBRACE应用深度多任务学习(3个任务),利用可穿戴传感器数据进行情境感知倦怠预测和解释。适应性多任务学习框架预测工作场所活动和未来的倦怠水平,并自动完成临床验证的倦怠调查。此外,使用Shapley加性解释(Shapley Additive Explanations)进行可解释性研究,为临床医生提供特征重要性评分和可视化,提高预测的透明度和可解释性。我们在三个数据集上对模型进行了评估:(1)收集了28名住院医生(平均年龄27.5岁,SD 3.5岁)的数据集,研究时间为2-7天(平均3.6天),研究方案经麻省大学陈医学院伯克希尔医疗中心机构审查委员会(#2021-017)批准;(2)来自15个参与者的公开可用WESAD (Wearable Stress and Affect Detection)数据集;(3)包含25名参与者(8名女性和17名男性)的工作压力和活动数据的SWELL- kw (SWELL知识工作)数据集。结果:在我们收集的数据集上,EMBRACE在预测5类活动、4类未来倦怠水平和1项临床可解释调查(带有10个问题的Mini-Z)方面达到了93%的召回率、91%的准确率和0.91的R2误差。在WESAD数据集上,该模型对3类应力水平预测的召回率为94.1%,精度为94.6%。在SWELL-KW数据集上,EMBRACE在预测5类活动、3个倦怠测量(快乐、满意度和压力)(每个测量有2个类别)和4个调查评估(总共20个问题)方面获得了89%的召回率、86%的准确率和0.88 R2误差。使用SHAP值的可解释性研究强调了关键的影响因素,如心率变异性、久坐活动持续时间和中断,提高了临床信任和对倦怠预测的解释。在23名参与者中,21名(91%)对特征重要性摘要的可解释性表示满意。结论:EMBRACE利用先进的ML技术和基于shap的解释,为住院医师的早期倦怠检测提供了临床可解释和可操作的解决方案。专有和公开可用数据集的验证证明了它们的鲁棒性和泛化性。未来的研究可能会探索在不同的临床环境中扩展模型,并评估其对医疗保健结果和医生幸福感的长期影响。
{"title":"Explainable Multitask Burnout Prediction Using Adaptive Deep Learning (EMBRACE) for Resident Physicians: Algorithm Development and Validation Study.","authors":"Saima Alam, Mohammad Arif Ul Alam","doi":"10.2196/57025","DOIUrl":"https://doi.org/10.2196/57025","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Medical residency is characterized by high stress, long working hours, and demanding schedules, leading to widespread burnout among resident physicians. Although wearable sensors and machine learning (ML) models hold promise for predicting burnout, their lack of clinical explainability often limits their utility in health care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This paper presents EMBRACE (Explainable Multitask Burnout Prediction Using Adaptive Deep Learning), a novel framework designed to predict and explain future burnout in resident physicians through an adaptive multitask deep learning approach. The framework aims to provide clinically actionable and trustworthy burnout predictions by integrating explainable ML techniques.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;EMBRACE applies deep multitask learning (3 tasks) using wearable sensor data for context-aware burnout prediction and explanation. The adaptive multitask learning framework predicts workplace activities and future burnout levels, and automatically completes a clinically validated burnout survey. Additionally, an explainability study was conducted using SHAP (Shapley Additive Explanations) to provide feature importance scores and visualizations for clinicians, enhancing the transparency and interpretability of the predictions. We evaluated the model on three datasets: (1) a collected dataset of 28 resident physicians (mean age 27.5, SD 3.5 years), over 2-7 days (average 3.6 days) with research protocols approved by the institutional review board (#2021-017) of Berkshire Medical Center, University of Massachusetts Chan Medical School; (2) the publicly available WESAD (Wearable Stress and Affect Detection) dataset from 15 participants; and (3) the SWELL-KW (SWELL Knowledge Work) dataset containing workplace stress and activity data from 25 participants (8 females and 17 males).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;On our collected dataset, EMBRACE achieved 93% recall, 91% precision, and 0.91 R&lt;sup&gt;2&lt;/sup&gt; error in predicting 5-class activities, 4-class future burnout levels, and 1 clinically explainable survey (Mini-Z with 10 questions). On the WESAD dataset, the model achieved 94.1% recall and 94.6% precision for 3-class stress level prediction. On the SWELL-KW dataset, EMBRACE obtained 89% recall, 86% precision, and 0.88 R&lt;sup&gt;2&lt;/sup&gt; error in predicting 5-class activities, 3 burnout measures (joyful, satisfaction, and stress) with 2 classes on each measure, and 4 survey assessments (a total of 20 questions). The explainability study, using SHAP values, highlighted key contributing factors such as heart rate variability, sedentary activity duration, and interruptions, improving clinical trust and interpretation of burnout predictions. Of 23 participants, 21 (91%) reported satisfaction with the explainability of feature importance summaries.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;EMBRACE provides a clinically explainable and actionable solution for early burnout","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e57025"},"PeriodicalIF":2.0,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Quality of AI Responses to Patient Concerns About Axial Spondyloarthritis: Delphi-Based Evaluation. 评估人工智能对患者对轴型脊柱炎关注的反应质量:基于delphi的评估。
IF 2 Pub Date : 2026-01-07 DOI: 10.2196/79153
Jiaxin Bai, Xiaojian Ji, Jiali Yu, Yiwen Wang, Yufei Guo, Chao Xue, Wenrui Zhang, Jian Zhu

Background: Axial spondyloarthritis (axSpA) is a chronic autoinflammatory disease with heterogeneous clinical features, presenting considerable complexity for sustained patient self-management. Although the use of large language models (LLMs) in health care is rapidly expanding, there has been no rigorous assessment of their capacity to provide axSpA-specific health guidance.

Objective: This study aimed to develop a patient-centered needs assessment tool and conduct a systematic evaluation of the quality of LLM-generated health advice for patients with axSpA.

Methods: A 2-round Delphi consensus process guided the design of the questionnaire, which was subsequently administered to 84 patients with axSpA and 26 rheumatologists. Patient-identified key concerns were formulated and input into 5 LLM platforms (GPT-4.0, DeepSeek R1, Hunyuan T1, Kimi k1.5, and Wenxin X1), with all prompts and model outputs in Chinese. Responses were evaluated using 2 techniques: an accuracy assessment based on guideline concordance, with independent double blinding by 2 raters (interrater reliability analyzed via Cohen κ), and the AlphaReadabilityChinese analytic tool to assess readability.

Results: Analysis of the validated questionnaire revealed age-related differences. Patients younger than 40 years prioritized symptom management and medication side effects more than those older than 40 years. Distinct priorities between clinicians and patients were identified for diagnostic mimics and drug mechanisms. LLM accuracy was highest in the diagnosis and examination category (mean score 20.4, SD 0.9) but lower in treatment and medication domains (mean score 19.3, SD 1.7). GPT-4.0 and Kimi k1.5 demonstrated superior overall readability; safety remained generally high (disclaimer rates: GPT-4.0 and DeepSeek-R1 100%; Kimi k1.5 88%).

Conclusions: Needs assessment across age groups and observed divergences between clinicians and patients underline the necessity for customized patient education. LLMs performed robustly on most evaluation metrics, and GPT-4.0 achieved 94% overall agreement with clinical guidelines. These tools hold promise as scalable adjuncts for ongoing axSpA support, provided complex clinical decision-making remains under human oversight. Nevertheless, the prevalence of artificial intelligence hallucinations remains a critical barrier. Only through comprehensive mitigation of such risks can LLM-based medical support be safely accelerated.

背景:轴性脊柱炎(axSpA)是一种具有异质性临床特征的慢性自身炎症性疾病,患者的持续自我管理相当复杂。尽管大型语言模型(llm)在卫生保健领域的使用正在迅速扩大,但尚未对其提供特定于axspa的健康指导的能力进行严格评估。目的:本研究旨在开发以患者为中心的需求评估工具,并对llm生成的axSpA患者健康咨询的质量进行系统评估。方法:采用2轮德尔菲共识过程指导问卷设计,随后对84例axSpA患者和26名风湿病学家进行问卷调查。制定患者确定的关键问题并输入5个LLM平台(GPT-4.0, DeepSeek R1,浑源T1, Kimi k1.5和文心X1),所有提示和模型输出均为中文。采用两种技术对反应进行评估:基于指南一致性的准确性评估,由2名评分者进行独立双盲(通过Cohen κ分析评分者的信度),以及AlphaReadabilityChinese分析工具评估可读性。结果:对验证问卷的分析显示了年龄相关的差异。40岁以下的患者比40岁以上的患者更重视症状管理和药物副作用。临床医生和患者之间的不同优先事项被确定为诊断模拟和药物机制。LLM准确率在诊断和检查类别最高(平均得分20.4,SD 0.9),但在治疗和药物领域较低(平均得分19.3,SD 1.7)。GPT-4.0和Kimi k1.5整体可读性较好;安全性仍然普遍较高(免责率:GPT-4.0和DeepSeek-R1 100%; Kimi k1.5 88%)。结论:跨年龄组的需求评估和临床医生和患者之间观察到的差异强调了定制患者教育的必要性。LLMs在大多数评估指标上表现良好,GPT-4.0与临床指南的总体一致性达到94%。如果复杂的临床决策仍在人类监督之下,这些工具有望成为axSpA持续支持的可扩展辅助工具。然而,人工智能幻觉的流行仍然是一个关键障碍。只有全面减轻此类风险,才能安全地加速基于法学硕士的医疗支助。
{"title":"Assessing the Quality of AI Responses to Patient Concerns About Axial Spondyloarthritis: Delphi-Based Evaluation.","authors":"Jiaxin Bai, Xiaojian Ji, Jiali Yu, Yiwen Wang, Yufei Guo, Chao Xue, Wenrui Zhang, Jian Zhu","doi":"10.2196/79153","DOIUrl":"https://doi.org/10.2196/79153","url":null,"abstract":"<p><strong>Background: </strong>Axial spondyloarthritis (axSpA) is a chronic autoinflammatory disease with heterogeneous clinical features, presenting considerable complexity for sustained patient self-management. Although the use of large language models (LLMs) in health care is rapidly expanding, there has been no rigorous assessment of their capacity to provide axSpA-specific health guidance.</p><p><strong>Objective: </strong>This study aimed to develop a patient-centered needs assessment tool and conduct a systematic evaluation of the quality of LLM-generated health advice for patients with axSpA.</p><p><strong>Methods: </strong>A 2-round Delphi consensus process guided the design of the questionnaire, which was subsequently administered to 84 patients with axSpA and 26 rheumatologists. Patient-identified key concerns were formulated and input into 5 LLM platforms (GPT-4.0, DeepSeek R1, Hunyuan T1, Kimi k1.5, and Wenxin X1), with all prompts and model outputs in Chinese. Responses were evaluated using 2 techniques: an accuracy assessment based on guideline concordance, with independent double blinding by 2 raters (interrater reliability analyzed via Cohen κ), and the AlphaReadabilityChinese analytic tool to assess readability.</p><p><strong>Results: </strong>Analysis of the validated questionnaire revealed age-related differences. Patients younger than 40 years prioritized symptom management and medication side effects more than those older than 40 years. Distinct priorities between clinicians and patients were identified for diagnostic mimics and drug mechanisms. LLM accuracy was highest in the diagnosis and examination category (mean score 20.4, SD 0.9) but lower in treatment and medication domains (mean score 19.3, SD 1.7). GPT-4.0 and Kimi k1.5 demonstrated superior overall readability; safety remained generally high (disclaimer rates: GPT-4.0 and DeepSeek-R1 100%; Kimi k1.5 88%).</p><p><strong>Conclusions: </strong>Needs assessment across age groups and observed divergences between clinicians and patients underline the necessity for customized patient education. LLMs performed robustly on most evaluation metrics, and GPT-4.0 achieved 94% overall agreement with clinical guidelines. These tools hold promise as scalable adjuncts for ongoing axSpA support, provided complex clinical decision-making remains under human oversight. Nevertheless, the prevalence of artificial intelligence hallucinations remains a critical barrier. Only through comprehensive mitigation of such risks can LLM-based medical support be safely accelerated.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e79153"},"PeriodicalIF":2.0,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable AI-Driven Comparative Analysis of Machine Learning Models for Predicting HIV Viral Nonsuppression in Ugandan Patients: Retrospective Cross-Sectional Study. 预测乌干达患者HIV病毒不抑制的机器学习模型的xai驱动的比较分析:一项回顾性横断面研究。
IF 2 Pub Date : 2026-01-06 DOI: 10.2196/68196
Francis Ngema, Albert Whata, Micheal O Olusanya, Siyabonga Mhlongo
<p><strong>Background: </strong>HIV viral suppression is essential for improving health outcomes and reducing transmission rates among people living with HIV. In Uganda, where HIV/AIDS is a major public health concern, machine learning (ML) models can predict viral suppression effectively. However, the limited use of explainable artificial intelligence (XAI) methods affects model transparency and clinical utility.</p><p><strong>Objective: </strong>This study aimed to develop and compare ML models for predicting viral nonsuppression in Ugandan people living with HIV on antiretroviral therapy (ART), and then systematically apply comprehensive XAI techniques to the best-performing model to identify key predictors and demonstrate interpretability at both population and individual patient levels.</p><p><strong>Methods: </strong>We retrospectively analyzed clinical and demographic data from 1101 Ugandan people living with HIV on ART at the HIV clinic in Muyembe Health Centre IV between June 2016 and April 2018, focusing on predicting viral nonsuppression (viral load >1000 copies per milliliter). The dataset was divided into model-building (training: 80%) and validation (test: 20%) sets. To address class imbalance, the synthetic minority over-sampling technique was applied. For global explanation, 8 ML algorithms-logistic regression, stacked ensemble, random forest, support vector machines, extreme gradient boosting (XGBoost), k-nearest neighbors, naïve Bayes, and artificial neural networks-were compared. Model performance was evaluated using metrics such as accuracy, precision, recall, F<sub>1</sub>-score, Cohen κ, and area under the curve (AUC). For local explanation, individual conditional expectation plots, Shapley Additive Explanations (SHAP), breakdown, and SHAP force plots were used to provide insights into predictions for individual patients.</p><p><strong>Results: </strong>The XGBoost ensemble model demonstrated superior performance with an accuracy of 0.89, precision of 0.59, recall of 0.65, and AUC of 0.80. The model achieved high specificity (0.93) and moderate sensitivity, yielding a Cohen κ of 0.55 and F<sub>1</sub>-score of 0.62, indicating good discriminative ability for viral nonsuppression prediction. SHAP feature importance analysis identified adherence assessment over the preceding 3 months as the most influential predictor of viral nonsuppression, followed by age group, urban residence, and duration on ART. Local SHAP consistently demonstrated that poor adherence was the primary driver of both correctly identified nonsuppressed cases and false positive predictions, reinforcing adherence as the critical determinant of treatment outcomes.</p><p><strong>Conclusions: </strong>The XGBoost model demonstrated optimal performance for predicting viral nonsuppression among Ugandan people living with HIV on ART, achieving an AUC of 0.80. Comprehensive XAI analysis identified adherence assessment as the primary predictor, followed by age group,
背景:抑制艾滋病毒对改善健康结果和降低艾滋病毒感染者(PLWH)之间的传播率至关重要。在乌干达,艾滋病毒/艾滋病是一个主要的公共卫生问题,机器学习(ML)模型可以有效地预测病毒抑制。然而,可解释人工智能(XAI)方法的有限使用影响了模型的透明度和临床实用性。目的:本研究旨在开发和比较预测乌干达艾滋病患者抗逆转录病毒治疗中病毒不抑制的ML模型,然后系统地将综合XAI技术应用于表现最好的模型,以确定关键预测因素,并证明在群体和个体患者水平上的可解释性。方法:我们回顾性分析了2016年6月至2018年4月期间在Muyembe HCIV艾滋病诊所接受抗逆转录病毒治疗的1101名乌干达PLWH的临床和人口统计学数据,重点预测病毒非抑制(病毒载量为1000拷贝/mL)。数据集分为模型构建集(训练集占80%)和验证集(测试集占20%)。为了解决类不平衡问题,采用了合成少数过采样技术(SMOTE)。为了进行全局解释,我们比较了八种机器学习算法——逻辑回归、堆叠集成、随机森林、支持向量机、极端梯度增强、k近邻、naïve贝叶斯和人工神经网络。模型的性能使用诸如准确性、精度、召回率、F1分数、科恩kappa和AUC等指标进行评估。对于局部解释,使用个体条件期望(ICE)图、SHapley加性解释(SHAP)图、分解图和SHAP力图为个体患者的预测提供见解。结果:XGBoost集成模型的准确率为0.89,精密度为0.59,召回率为0.65,AUC为0.80。该模型具有较高的特异性(0.93)和中等的敏感性,Cohen’s kappa为0.55,F1评分为0.62,表明该模型对病毒非抑制性预测具有较好的判别能力。SHAP特征重要性分析发现,前三个月的依从性评估是病毒未抑制最具影响力的预测因子,其次是年龄组、城市居住地和抗逆转录病毒治疗持续时间。当地的SHAP解释一致表明,依从性差是正确识别非抑制病例和假阳性预测的主要驱动因素,强化了依从性作为治疗结果的关键决定因素。结论:XGBoost模型在预测乌干达PLWH抗逆转录病毒治疗的病毒无抑制方面表现最佳,AUC为0.80。综合XAI分析确定依从性评估是主要预测因素,其次是年龄组、居住类型和ART持续时间。XAI方法为群体和个体患者水平的模型预测提供了透明的解释,从而能够在资源有限的情况下识别关键风险因素,进行有针对性的临床干预。
{"title":"Explainable AI-Driven Comparative Analysis of Machine Learning Models for Predicting HIV Viral Nonsuppression in Ugandan Patients: Retrospective Cross-Sectional Study.","authors":"Francis Ngema, Albert Whata, Micheal O Olusanya, Siyabonga Mhlongo","doi":"10.2196/68196","DOIUrl":"10.2196/68196","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;HIV viral suppression is essential for improving health outcomes and reducing transmission rates among people living with HIV. In Uganda, where HIV/AIDS is a major public health concern, machine learning (ML) models can predict viral suppression effectively. However, the limited use of explainable artificial intelligence (XAI) methods affects model transparency and clinical utility.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to develop and compare ML models for predicting viral nonsuppression in Ugandan people living with HIV on antiretroviral therapy (ART), and then systematically apply comprehensive XAI techniques to the best-performing model to identify key predictors and demonstrate interpretability at both population and individual patient levels.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We retrospectively analyzed clinical and demographic data from 1101 Ugandan people living with HIV on ART at the HIV clinic in Muyembe Health Centre IV between June 2016 and April 2018, focusing on predicting viral nonsuppression (viral load &gt;1000 copies per milliliter). The dataset was divided into model-building (training: 80%) and validation (test: 20%) sets. To address class imbalance, the synthetic minority over-sampling technique was applied. For global explanation, 8 ML algorithms-logistic regression, stacked ensemble, random forest, support vector machines, extreme gradient boosting (XGBoost), k-nearest neighbors, naïve Bayes, and artificial neural networks-were compared. Model performance was evaluated using metrics such as accuracy, precision, recall, F&lt;sub&gt;1&lt;/sub&gt;-score, Cohen κ, and area under the curve (AUC). For local explanation, individual conditional expectation plots, Shapley Additive Explanations (SHAP), breakdown, and SHAP force plots were used to provide insights into predictions for individual patients.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The XGBoost ensemble model demonstrated superior performance with an accuracy of 0.89, precision of 0.59, recall of 0.65, and AUC of 0.80. The model achieved high specificity (0.93) and moderate sensitivity, yielding a Cohen κ of 0.55 and F&lt;sub&gt;1&lt;/sub&gt;-score of 0.62, indicating good discriminative ability for viral nonsuppression prediction. SHAP feature importance analysis identified adherence assessment over the preceding 3 months as the most influential predictor of viral nonsuppression, followed by age group, urban residence, and duration on ART. Local SHAP consistently demonstrated that poor adherence was the primary driver of both correctly identified nonsuppressed cases and false positive predictions, reinforcing adherence as the critical determinant of treatment outcomes.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The XGBoost model demonstrated optimal performance for predicting viral nonsuppression among Ugandan people living with HIV on ART, achieving an AUC of 0.80. Comprehensive XAI analysis identified adherence assessment as the primary predictor, followed by age group,","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":"e68196"},"PeriodicalIF":2.0,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145703182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of a Small Language Model Versus a Large Language Model in Answering Glaucoma Frequently Asked Patient Questions: Development and Usability Study. 小型语言模型与大型语言模型在回答青光眼患者常见问题中的表现:开发和可用性研究。
IF 2 Pub Date : 2026-01-06 DOI: 10.2196/72101
Adriano Cypriano Faneli, Rafael Scherer, Rohit Muralidhar, Marcus Guerreiro-Filho, Luiz Beniz, Verônica Vilasboas-Campos, Douglas Costa, Alessandro A Jammal, Felipe A Medeiros

Background: Large language models (LLMs) have been shown to answer patient questions in ophthalmology similar to human experts. However, concerns remain regarding their use, particularly related to patient privacy and potential inaccuracies that could compromise patient safety.

Objective: This study aimed to compare the performance of an LLM in answering frequently asked patient questions about glaucoma with that of a small language model (SLM) trained locally on ophthalmology-specific literature.

Methods: We compiled 35 frequently asked questions on glaucoma, categorized into 6 domains, including pathogenesis, risk factors, clinical manifestations, diagnosis, treatment and prevention, and prognosis. Each question was posed to both a SLM using a retrieval-augmented generation framework, trained on ophthalmology-specific literature, and to a LLM (ChatGPT 4.0, OpenAI). Three glaucoma specialists from a single institution independently assessed the answers using a 3-tier accuracy rating scale: poor (score=1), borderline (score=2), and good (score=3). Each answer received a quality score ranging from 3 to 9 points based on the sum of ratings from the 3 graders. Readability grade level was assessed using 4 formulas, such as the Flesch-Kincaid Level, the Gunning Fog Index, the Coleman-Liau Index, and the Simple Measure of Gobbledygook Index.

Results: The answers from the SLM demonstrated comparable quality with ChatGPT 4.0, scoring mean 7.9 (SD 1.2) and mean 7.4 (SD 1.5), respectively, out of a total of 9 points (P=.13). The accuracy rating was consistent overall and across all 6 glaucoma care domains. Both models provided answers considered unsuitable for health care-related information, as they were difficult for the average layperson to read.

Conclusions: Both models generated accurate content, but the answers were considered challenging for the average layperson to understand, making them unsuitable for health care-related information. Given the specialized SLM's comparable performance to the LLM, its high customization potential, lower cost, and ability to operate locally, it presents a viable option for deploying natural language processing in real-world ophthalmology clinical settings.

背景:大型语言模型(llm)已经被证明可以像人类专家一样回答眼科患者的问题。然而,对它们的使用仍然存在担忧,特别是与患者隐私和可能危及患者安全的潜在不准确性有关。目的:本研究旨在比较LLM在回答患者关于青光眼的常见问题方面的表现,以及在眼科特定文献方面进行本地训练的小型语言模型(SLM)的表现。方法:收集青光眼常见问题35个,分为发病机制、危险因素、临床表现、诊断、治疗和预防、预后6个领域。每个问题都被提交给使用检索增强生成框架的SLM,该框架经过眼科特定文献的培训,以及LLM (ChatGPT 4.0, OpenAI)。来自同一家机构的三名青光眼专家使用三级准确度评定量表独立评估了答案:差(得分=1),边缘(得分=2)和良好(得分=3)。每个答案的质量分数为3 ~ 9分,以3名评分者的评分总和为基础。采用Flesch-Kincaid水平、Gunning Fog指数、Coleman-Liau指数和简单测量的Gobbledygook指数4种公式评估可读性等级水平。结果:SLM的答案显示出与ChatGPT 4.0相当的质量,平均得分分别为7.9 (SD 1.2)和7.4 (SD 1.5),总共9分(P=.13)。准确度评分在所有6个青光眼护理领域总体上是一致的。这两种模型都提供了被认为不适合医疗保健相关信息的答案,因为它们对普通外行人来说很难阅读。结论:两种模型都生成了准确的内容,但答案被认为对一般外行人来说是具有挑战性的,因此不适合用于医疗保健相关信息。考虑到专业的SLM与LLM的性能相当,其高定制潜力,更低的成本,以及在本地操作的能力,它为在现实世界的眼科临床环境中部署自然语言处理提供了一个可行的选择。
{"title":"Performance of a Small Language Model Versus a Large Language Model in Answering Glaucoma Frequently Asked Patient Questions: Development and Usability Study.","authors":"Adriano Cypriano Faneli, Rafael Scherer, Rohit Muralidhar, Marcus Guerreiro-Filho, Luiz Beniz, Verônica Vilasboas-Campos, Douglas Costa, Alessandro A Jammal, Felipe A Medeiros","doi":"10.2196/72101","DOIUrl":"10.2196/72101","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have been shown to answer patient questions in ophthalmology similar to human experts. However, concerns remain regarding their use, particularly related to patient privacy and potential inaccuracies that could compromise patient safety.</p><p><strong>Objective: </strong>This study aimed to compare the performance of an LLM in answering frequently asked patient questions about glaucoma with that of a small language model (SLM) trained locally on ophthalmology-specific literature.</p><p><strong>Methods: </strong>We compiled 35 frequently asked questions on glaucoma, categorized into 6 domains, including pathogenesis, risk factors, clinical manifestations, diagnosis, treatment and prevention, and prognosis. Each question was posed to both a SLM using a retrieval-augmented generation framework, trained on ophthalmology-specific literature, and to a LLM (ChatGPT 4.0, OpenAI). Three glaucoma specialists from a single institution independently assessed the answers using a 3-tier accuracy rating scale: poor (score=1), borderline (score=2), and good (score=3). Each answer received a quality score ranging from 3 to 9 points based on the sum of ratings from the 3 graders. Readability grade level was assessed using 4 formulas, such as the Flesch-Kincaid Level, the Gunning Fog Index, the Coleman-Liau Index, and the Simple Measure of Gobbledygook Index.</p><p><strong>Results: </strong>The answers from the SLM demonstrated comparable quality with ChatGPT 4.0, scoring mean 7.9 (SD 1.2) and mean 7.4 (SD 1.5), respectively, out of a total of 9 points (P=.13). The accuracy rating was consistent overall and across all 6 glaucoma care domains. Both models provided answers considered unsuitable for health care-related information, as they were difficult for the average layperson to read.</p><p><strong>Conclusions: </strong>Both models generated accurate content, but the answers were considered challenging for the average layperson to understand, making them unsuitable for health care-related information. Given the specialized SLM's comparable performance to the LLM, its high customization potential, lower cost, and ability to operate locally, it presents a viable option for deploying natural language processing in real-world ophthalmology clinical settings.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e72101"},"PeriodicalIF":2.0,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12772937/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145913535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting the Emergency Department Patient Journey Using a Machine Learning Approach. 使用机器学习方法预测急诊科病人的旅程。
IF 2 Pub Date : 2025-12-19 DOI: 10.2196/67321
Joshua George Kovoor, Gavin John Carmichael, Brandon Stretton, Aashray K Gupta, Oliver S Kleinig, Mana Ittimani, Jack Fabian, Sheryn Tan, Jeng Swen Ng, Shrirajh Sateakeerthy, Andrew Booth, Alexander Beath, John Kefalianos, Mathew Ollapallil Jacob, Sadeya Ahmed, WengOnn Chan, Pramesh Kovoor, Samuel Gluck, Toby Gilbert, James Malycha, Benjamin A Reddi, Robert T Padbury, Markus I Trochsler, Guy J Maddern, Derek P Chew, Andrew C Zannettino, Danny Liew, John F Beltrame, Patrick G O'Callaghan, Cynthia Papendick, Stephen Bacchi
{"title":"Predicting the Emergency Department Patient Journey Using a Machine Learning Approach.","authors":"Joshua George Kovoor, Gavin John Carmichael, Brandon Stretton, Aashray K Gupta, Oliver S Kleinig, Mana Ittimani, Jack Fabian, Sheryn Tan, Jeng Swen Ng, Shrirajh Sateakeerthy, Andrew Booth, Alexander Beath, John Kefalianos, Mathew Ollapallil Jacob, Sadeya Ahmed, WengOnn Chan, Pramesh Kovoor, Samuel Gluck, Toby Gilbert, James Malycha, Benjamin A Reddi, Robert T Padbury, Markus I Trochsler, Guy J Maddern, Derek P Chew, Andrew C Zannettino, Danny Liew, John F Beltrame, Patrick G O'Callaghan, Cynthia Papendick, Stephen Bacchi","doi":"10.2196/67321","DOIUrl":"10.2196/67321","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67321"},"PeriodicalIF":2.0,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716828/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Authors' Reply: Predicting the Emergency Department Patient Journey Using a Machine Learning Approach. 作者回复:使用机器学习方法预测急诊科患者的旅程。
IF 2 Pub Date : 2025-12-19 DOI: 10.2196/73342
Dhavalkumar Patel, Eyal Klang, Prem Timsina
{"title":"Authors' Reply: Predicting the Emergency Department Patient Journey Using a Machine Learning Approach.","authors":"Dhavalkumar Patel, Eyal Klang, Prem Timsina","doi":"10.2196/73342","DOIUrl":"10.2196/73342","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e73342"},"PeriodicalIF":2.0,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physical Examination Identification in Medical Education Videos: Zero-Shot Multimodal AI With Temporal Sequence Optimization Study. 医学教育视频中的体检识别:零镜头多模态人工智能时序优化研究
IF 2 Pub Date : 2025-12-18 DOI: 10.2196/76586
Shinyoung Kang, Michael Holcomb, David Hein, Ameer Hamza Shakur, Thomas Dalton, Andrew Jamieson
<p><strong>Background: </strong>Objective structured clinical examinations (OSCEs) are widely used for assessing medical student competency, but their evaluation is resource-intensive, requiring trained evaluators to review 15-minute videos. The physical examination (PE) component typically constitutes only a small portion of these recordings; yet, current automated approaches struggle with processing long medical videos due to computational constraints and difficulties maintaining temporal context.</p><p><strong>Objective: </strong>This study aims to determine whether multimodal large language models (MM-LLMs) can effectively segment PE periods within OSCE videos without previous training, potentially reducing the evaluation burden on both human graders and automated assessment systems.</p><p><strong>Methods: </strong>We analyzed 500 videos from 5 OSCE stations at University of Texas Southwestern Simulation Center, each 15 minutes long, by using hand-labeled PE periods as ground truth. Frames were sampled at 1, 2, or 3 seconds. A pose detection preprocessing step filtered frames without people. Six MM-LLMs performed frame-level classification into encounter states by using a standardized prompt. To enforce temporal consistency, we used a hidden Markov model with Viterbi decoding, merging states into 3 primary activities (consulting/notes, physical examination, and no doctor) and adding a brief edge buffer to avoid truncating true PE segments. Performance was computed per video and averaged across the dataset by using recall, precision, intersection over union (IOU), and predicted PE length with 95% CIs.</p><p><strong>Results: </strong>At 1-second sampling, GPT-4o achieved recall of 0.998 (95% CI 0.994-1.000), IOU of 0.784 (95% CI 0.765-0.803), and precision of 0.792 (95% CI 0.774-0.811), identifying a mean of 175 (SD 83) seconds of content per video as PE versus a mean labeled PE of 126 (SD 61) seconds, yielding an 81% reduction in video needing review (from 900 to 175 seconds). Across stations, recall remained high, with expected IOU variability linked to examination format and camera geometry. Increasing the sampling interval modestly decreased recall while slightly improving IOU and precision. Comparative baselines (eg, Gemini 2.0 Flash, Gemma 3, and Qwen2.5-VL variants) demonstrated trade-offs between recall and overselection; GPT-4o offered the best balance among high-recall models. Error analysis highlighted false negatives during occluded or verbally guided maneuvers and false positives during preparatory actions, suggesting opportunities for camera placement optimization and multimodal fusion (eg, audio cues).</p><p><strong>Conclusions: </strong>Integrating zero-shot MM-LLMs with minimal-supervision temporal modeling effectively segments PE periods in OSCE videos without requiring extensive training data. This approach significantly reduces review time while maintaining clinical assessment integrity, demonstrating that artificial intelli
背景:客观结构化临床考试(osce)被广泛用于评估医学生的能力,但其评估是资源密集型的,需要训练有素的评估者审查15分钟的视频。体格检查(PE)部分通常只占这些记录的一小部分;然而,由于计算限制和难以维持时间背景,目前的自动化方法难以处理长医学视频。目的:本研究旨在确定多模态大语言模型(mm - llm)是否可以在没有事先训练的情况下有效地分割欧安组织视频中的PE时段,从而潜在地减轻人工评分者和自动评估系统的评估负担。方法:我们分析了来自德克萨斯大学西南模拟中心5个欧安组织站点的500个视频,每个视频长15分钟,使用手工标记的PE时段作为基础事实。帧在1、2或3秒采样。一个姿态检测预处理步骤过滤了没有人的帧。6个mm - llm通过使用标准化提示执行帧级相遇状态分类。为了加强时间一致性,我们使用了带有Viterbi解码的隐马尔可夫模型,将状态合并为3个主要活动(咨询/笔记、体检和无医生),并添加了一个简短的边缘缓冲,以避免截断真正的PE段。对每个视频的性能进行计算,并通过召回率、精度、交集/联合(IOU)和预测PE长度(95% ci)在整个数据集中进行平均。结果:在1秒采样时,gpt - 40的召回率为0.998 (95% CI 0.994-1.000), IOU为0.784 (95% CI 0.765-0.803),精度为0.792 (95% CI 0.774-0.811),识别每个视频的平均内容为175 (SD 83)秒的PE,而平均标记PE为126 (SD 61)秒,使需要审查的视频减少81%(从900秒到175秒)。在各个站点,召回率仍然很高,预计IOU的变化与检查格式和摄像机几何形状有关。适度增加采样间隔会降低召回率,而略微提高IOU和精度。比较基线(如Gemini 2.0 Flash、Gemma 3和Qwen2.5-VL变体)证明了召回和过度选择之间的权衡;gpt - 40在高召回模型中提供了最好的平衡。误差分析强调了在遮挡或口头引导的演习中出现的假阴性和在准备行动中出现的假阳性,这表明有机会优化摄像机放置和多模式融合(例如,音频提示)。结论:将零镜头mm - llm与最小监督时间模型相结合,可以有效地分割欧安组织视频中的PE时间段,而不需要大量的训练数据。这种方法大大缩短了审查时间,同时保持了临床评估的完整性,表明结合零射击能力和轻监督的人工智能方法可以针对医学教育的具体要求进行优化。这项技术为在不同的医学教育环境中进行更有效和可扩展的临床技能评估奠定了基础。
{"title":"Physical Examination Identification in Medical Education Videos: Zero-Shot Multimodal AI With Temporal Sequence Optimization Study.","authors":"Shinyoung Kang, Michael Holcomb, David Hein, Ameer Hamza Shakur, Thomas Dalton, Andrew Jamieson","doi":"10.2196/76586","DOIUrl":"10.2196/76586","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Objective structured clinical examinations (OSCEs) are widely used for assessing medical student competency, but their evaluation is resource-intensive, requiring trained evaluators to review 15-minute videos. The physical examination (PE) component typically constitutes only a small portion of these recordings; yet, current automated approaches struggle with processing long medical videos due to computational constraints and difficulties maintaining temporal context.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to determine whether multimodal large language models (MM-LLMs) can effectively segment PE periods within OSCE videos without previous training, potentially reducing the evaluation burden on both human graders and automated assessment systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 500 videos from 5 OSCE stations at University of Texas Southwestern Simulation Center, each 15 minutes long, by using hand-labeled PE periods as ground truth. Frames were sampled at 1, 2, or 3 seconds. A pose detection preprocessing step filtered frames without people. Six MM-LLMs performed frame-level classification into encounter states by using a standardized prompt. To enforce temporal consistency, we used a hidden Markov model with Viterbi decoding, merging states into 3 primary activities (consulting/notes, physical examination, and no doctor) and adding a brief edge buffer to avoid truncating true PE segments. Performance was computed per video and averaged across the dataset by using recall, precision, intersection over union (IOU), and predicted PE length with 95% CIs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;At 1-second sampling, GPT-4o achieved recall of 0.998 (95% CI 0.994-1.000), IOU of 0.784 (95% CI 0.765-0.803), and precision of 0.792 (95% CI 0.774-0.811), identifying a mean of 175 (SD 83) seconds of content per video as PE versus a mean labeled PE of 126 (SD 61) seconds, yielding an 81% reduction in video needing review (from 900 to 175 seconds). Across stations, recall remained high, with expected IOU variability linked to examination format and camera geometry. Increasing the sampling interval modestly decreased recall while slightly improving IOU and precision. Comparative baselines (eg, Gemini 2.0 Flash, Gemma 3, and Qwen2.5-VL variants) demonstrated trade-offs between recall and overselection; GPT-4o offered the best balance among high-recall models. Error analysis highlighted false negatives during occluded or verbally guided maneuvers and false positives during preparatory actions, suggesting opportunities for camera placement optimization and multimodal fusion (eg, audio cues).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Integrating zero-shot MM-LLMs with minimal-supervision temporal modeling effectively segments PE periods in OSCE videos without requiring extensive training data. This approach significantly reduces review time while maintaining clinical assessment integrity, demonstrating that artificial intelli","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76586"},"PeriodicalIF":2.0,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12757708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study. 医学文本的多智能体摘要与自动评价框架:开发与评价研究。
IF 2 Pub Date : 2025-12-16 DOI: 10.2196/75932
Yuhao Chen, Bo Wen, Farhana Zulkernine
<p><strong>Background: </strong>Although large language models (LLMs) show great promise in processing medical text, they are prone to generating incorrect information, commonly referred to as hallucinations. These inaccuracies present a significant risk for clinical applications where precision is critical. Additionally, relying on human experts to review LLM-generated content to ensure accuracy is costly and time-consuming, which sets a barrier against large-scale deployment of LLMs in health care settings.</p><p><strong>Objective: </strong>The primary objective of this study was to develop an automatic artificial intelligence (AI) system capable of extracting structured information from unstructured medical data and using advanced reasoning techniques to support reliable clinical decision making. A key aspect of this objective is ensuring that the system incorporates self-verification mechanisms, enabling it to assess the accuracy and reliability of its own outputs. By integrating such mechanisms, we aim to enhance the system's robustness, reduce reliance on human intervention, and improve the overall trustworthiness of AI-driven medical summarization and evaluation.</p><p><strong>Methods: </strong>The proposed framework comprises 2 layers: a summarization layer and an evaluation layer. The summarization layer uses Llama2-70B (Meta AI) and Mistral-7B (Mistral AI) models to generate concise summaries from unstructured medical data, focusing on tasks such as consumer health question summarization, biomedical answer summarization, and dialog summarization. The evaluation layer uses GPT-4-turbo (OpenAI) as a judge, leveraging pairwise comparison strategies and different prompt strategies to evaluate summaries across 4 dimensions: coherence, consistency, fluency, and relevance. To validate the framework, we compare the judgments generated by the LLM assistants in the evaluation layer with those provided by medical experts, offering valuable insights into the alignment and reliability of AI-driven evaluations within the medical domain. We also explore a way to handle disagreement among human experts and discuss our methodology in addressing diversity in human perspectives.</p><p><strong>Results: </strong>The study found variability in expert consensus, with average agreement rates of 19.2% among all experts and 54% among groups of 3 experts. GPT-4 (OpenAI) demonstrated alignment with expert judgments, achieving an average agreement rate of 83.06% with at least 1 expert and comparable performance in cross-validation tests. The enhanced guidance in prompt design (prompt-enhanced guidance) improved GPT-4's alignment with expert evaluations compared with a baseline prompt, highlighting the importance of effective prompt engineering in auto-evaluation of summarization tasks. We also evaluated open-source LLMs, including Llama-3.3 (Meta AI) and Mixtral-Large (Mistral AI), and a domain-specific LLM, OpenBioLLM (Aaditya Ura), for comparison as LLM judges.</
背景:尽管大型语言模型(llm)在处理医学文本方面显示出巨大的前景,但它们容易产生不正确的信息,通常被称为幻觉。这些不准确性为临床应用带来了重大风险,其中精度是至关重要的。此外,依靠人类专家来审查法学硕士生成的内容以确保准确性既昂贵又耗时,这对在医疗保健环境中大规模部署法学硕士设置了障碍。目的:本研究的主要目的是开发一种自动人工智能(AI)系统,该系统能够从非结构化医疗数据中提取结构化信息,并使用先进的推理技术来支持可靠的临床决策。这一目标的一个关键方面是确保该系统纳入自我核查机制,使其能够评估其自身产出的准确性和可靠性。通过整合这些机制,我们的目标是增强系统的鲁棒性,减少对人为干预的依赖,并提高人工智能驱动的医疗总结和评估的整体可信度。方法:提出的框架包括总结层和评价层两层。摘要层使用Llama2-70B (Meta AI)和Mistral- 7b (Mistral AI)模型从非结构化医疗数据生成简明摘要,重点完成消费者健康问题摘要、生物医学答案摘要和对话摘要等任务。评估层使用GPT-4-turbo (OpenAI)作为裁判,利用两两比较策略和不同提示策略从四个维度评估摘要:连贯性、一致性、流畅性和相关性。为了验证该框架,我们将评估层中LLM助手生成的判断与医学专家提供的判断进行了比较,从而为医疗领域中人工智能驱动评估的一致性和可靠性提供了有价值的见解。我们还探讨了一种处理人类专家之间分歧的方法,并讨论了我们在解决人类观点多样性方面的方法。结果:研究发现了专家共识的可变性,所有专家的平均一致性率为19.2%,3名专家的平均一致性率为54%。GPT-4 (OpenAI)证明了与专家判断的一致性,在交叉验证测试中,至少有一位专家的平均一致性率达到83.06%,并且具有相当的性能。与基线提示相比,增强的提示设计指导提高了GPT-4与专家评估的一致性,突出了有效的提示工程在总结任务自动评估中的重要性。我们还评估了开源法学硕士,包括Llama-3.3 (Meta AI)和Mixtral-Large (Mistral AI),以及一个特定领域的法学硕士,OpenBioLLM (Aaditya Ura),作为法学硕士评委进行比较。结论:本研究强调了llm作为非结构化医疗数据总结和评估的可靠工具的潜力,以减少对人类专家的依赖,同时也指出了局限性。提出的框架,多智能体总结和自动评估,展示了临床应用的可扩展性和适应性,同时解决了幻觉和位置偏见等关键挑战。
{"title":"A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study.","authors":"Yuhao Chen, Bo Wen, Farhana Zulkernine","doi":"10.2196/75932","DOIUrl":"10.2196/75932","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Although large language models (LLMs) show great promise in processing medical text, they are prone to generating incorrect information, commonly referred to as hallucinations. These inaccuracies present a significant risk for clinical applications where precision is critical. Additionally, relying on human experts to review LLM-generated content to ensure accuracy is costly and time-consuming, which sets a barrier against large-scale deployment of LLMs in health care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;The primary objective of this study was to develop an automatic artificial intelligence (AI) system capable of extracting structured information from unstructured medical data and using advanced reasoning techniques to support reliable clinical decision making. A key aspect of this objective is ensuring that the system incorporates self-verification mechanisms, enabling it to assess the accuracy and reliability of its own outputs. By integrating such mechanisms, we aim to enhance the system's robustness, reduce reliance on human intervention, and improve the overall trustworthiness of AI-driven medical summarization and evaluation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;The proposed framework comprises 2 layers: a summarization layer and an evaluation layer. The summarization layer uses Llama2-70B (Meta AI) and Mistral-7B (Mistral AI) models to generate concise summaries from unstructured medical data, focusing on tasks such as consumer health question summarization, biomedical answer summarization, and dialog summarization. The evaluation layer uses GPT-4-turbo (OpenAI) as a judge, leveraging pairwise comparison strategies and different prompt strategies to evaluate summaries across 4 dimensions: coherence, consistency, fluency, and relevance. To validate the framework, we compare the judgments generated by the LLM assistants in the evaluation layer with those provided by medical experts, offering valuable insights into the alignment and reliability of AI-driven evaluations within the medical domain. We also explore a way to handle disagreement among human experts and discuss our methodology in addressing diversity in human perspectives.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The study found variability in expert consensus, with average agreement rates of 19.2% among all experts and 54% among groups of 3 experts. GPT-4 (OpenAI) demonstrated alignment with expert judgments, achieving an average agreement rate of 83.06% with at least 1 expert and comparable performance in cross-validation tests. The enhanced guidance in prompt design (prompt-enhanced guidance) improved GPT-4's alignment with expert evaluations compared with a baseline prompt, highlighting the importance of effective prompt engineering in auto-evaluation of summarization tasks. We also evaluated open-source LLMs, including Llama-3.3 (Meta AI) and Mixtral-Large (Mistral AI), and a domain-specific LLM, OpenBioLLM (Aaditya Ura), for comparison as LLM judges.&lt;/","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75932"},"PeriodicalIF":2.0,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1