首页 > 最新文献

JMIR AI最新文献

英文 中文
Artificial Intelligence in Point-of-Care Imaging for Clinical Decision Support: Systematic Review of Diagnostic Accuracy, Task-Shifting, and Explainability. 人工智能在临床决策支持的护理点成像:诊断准确性、任务转移和可解释性的系统回顾。
IF 2 Pub Date : 2026-02-07 DOI: 10.2196/80928
Peter Wadie, Bishoy Zakher, Khalid Elgazzar, Abdulhamid Alsbakhi, Abdul-Mohsen G Alhejaily
<p><strong>Background: </strong>Artificial intelligence (AI) integrated with point-of-care (POC) imaging has emerged as a promising approach to expand diagnostic access in settings with limited specialist availability. However, no systematic review has comprehensively evaluated AI-assisted clinical decision support across multiple POC imaging modalities, assessed explainability implementation, or quantified clinical impact evidence gaps.</p><p><strong>Objective: </strong>To systematically evaluate and synthesize evidence on AI-based clinical decision support systems utilizing point-of-care imaging, with particular attention to task-shifting potential, explainability implementation, and clinical outcome evidence.</p><p><strong>Methods: </strong>We searched PubMed, Scopus, IEEE Xplore, and Web of Science (January 2018 to November 2025). We included research studies evaluating AI/machine learning systems applied to POC-capable imaging modalities in POC clinical settings with clinical decision support outputs. Two reviewers independently screened studies, extracted data across 15 domains, and assessed methodological quality using QUADAS-2. Proposed frameworks were developed to evaluate explainability implementation and clinical impact evidence. Narrative synthesis was performed due to substantial data heterogeneity.</p><p><strong>Results: </strong>Of 2,113 records identified, 20 studies met inclusion criteria, encompassing approximately 78,296 patients across 15 countries. Studies evaluated tuberculosis (n=5), breast cancer (n=3), deep vein thrombosis (n=2), and nine other conditions using ultrasound (35%, 7/20), chest X-ray (25%, 5/20), photography-based and colposcopic imaging (15%, 3/20), fundus photography (10%, 2/20), microscopy (10%, 2/20), and dermoscopy (5%, 1/20). Median sensitivity was 92% (IQR 85.7%-98.0%), and median specificity was 90.6% (IQR 70.0%-95.7%). Task-shifting was demonstrated in 65% (13/20) of studies, with nonspecialists achieving specialist-level performance after a median of 1 hour of training. The explainable AI (XAI) implementation cascade revealed critical gaps: 75% (15/20) of studies did not mention explainability, 10% (2/20) provided explanations to users, and none evaluated whether clinicians understood explanations or whether XAI influenced decisions. The clinical impact pyramid showed 15% (3/20) of studies reported technical accuracy only, 65% (13/20) reported process outcomes, 20% (4/20) documented clinical actions, and none measured patient outcomes. Methodological quality was concerning, as 70% (14/20) of studies were at high or very high risk of bias, with verification bias (70%, 14/20) and selection bias (50%, 10/20) being the most common. The overall certainty of evidence was very low-Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) ⊕◯◯◯, primarily due to risk of bias, heterogeneity, and imprecision.</p><p><strong>Conclusions: </strong>AI-assisted POC imaging demonstrates promising d
背景:人工智能(AI)与护理点成像(POC)相结合已成为在专家可用性有限的环境中扩大诊断可及性的一种有前途的方法。然而,没有系统综述全面评估人工智能辅助临床决策支持跨多种POC成像模式,评估可解释性实施,或量化临床影响证据差距。目的:系统地评估和综合基于人工智能的临床决策支持系统的证据,利用护理点成像,特别关注任务转移潜力,可解释性实施和临床结果证据。方法:检索PubMed、Scopus、IEEE explore和Web of Science(2018年1月至2025年11月)。我们纳入了评估人工智能/机器学习系统应用于POC临床环境中具有临床决策支持输出的POC成像模式的研究。两位审稿人独立筛选研究,提取15个领域的数据,并使用QUADAS-2评估方法学质量。提出了评估可解释性实施和临床影响证据的框架。由于大量数据异质性,我们进行了叙述性综合。结果:在确定的2113项记录中,20项研究符合纳入标准,涵盖了15个国家的约78296名患者。研究评估了肺结核(n=5)、乳腺癌(n=3)、深静脉血栓形成(n=2)和其他9种疾病,包括超声(35%,7/20)、胸部x线片(25%,5/20)、基于摄影和阴道镜成像(15%,3/20)、眼底摄影(10%,2/20)、显微镜(10%,2/20)和皮肤镜(5%,1/20)。中位敏感性为92% (IQR 85.7% ~ 98.0%),中位特异性为90.6% (IQR 70.0% ~ 95.7%)。65%(13/20)的研究证明了任务转移,非专业人员在平均1小时的培训后达到了专业人员水平的表现。可解释的人工智能(XAI)实施级联揭示了关键的差距:75%(15/20)的研究没有提到可解释性,10%(2/20)的研究向用户提供了解释,没有一个评估临床医生是否理解解释或XAI是否影响决策。临床影响金字塔显示,15%(3/20)的研究仅报告了技术准确性,65%(13/20)报告了过程结果,20%(4/20)记录了临床行动,没有测量患者结果。方法学质量令人担忧,因为70%(14/20)的研究存在高或非常高的偏倚风险,其中验证偏倚(70%,14/20)和选择偏倚(50%,10/20)最为常见。证据的总体确定性非常低,推荐、评估、发展和评价分级(GRADE)⊕,主要是由于存在偏倚、异质性和不精确的风险。结论:人工智能辅助POC成像显示出有希望的诊断准确性,并且可以在最少的培训要求下实现有意义的任务转移。然而,关键的证据差距仍然存在,包括缺乏患者结果测量,不充分的可解释性评估,监管偏差以及缺乏跨背景验证,尽管声称具有全球适用性。要解决这些差距,需要在广泛采用之前对患者结果终点进行实施研究、严格的XAI评估和多背景验证。局限性包括英文出版物的限制,灰色文献排除,以及排除meta分析的异质性。临床试验:由于时间限制,本综述未进行前瞻性注册。
{"title":"Artificial Intelligence in Point-of-Care Imaging for Clinical Decision Support: Systematic Review of Diagnostic Accuracy, Task-Shifting, and Explainability.","authors":"Peter Wadie, Bishoy Zakher, Khalid Elgazzar, Abdulhamid Alsbakhi, Abdul-Mohsen G Alhejaily","doi":"10.2196/80928","DOIUrl":"https://doi.org/10.2196/80928","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Artificial intelligence (AI) integrated with point-of-care (POC) imaging has emerged as a promising approach to expand diagnostic access in settings with limited specialist availability. However, no systematic review has comprehensively evaluated AI-assisted clinical decision support across multiple POC imaging modalities, assessed explainability implementation, or quantified clinical impact evidence gaps.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;To systematically evaluate and synthesize evidence on AI-based clinical decision support systems utilizing point-of-care imaging, with particular attention to task-shifting potential, explainability implementation, and clinical outcome evidence.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We searched PubMed, Scopus, IEEE Xplore, and Web of Science (January 2018 to November 2025). We included research studies evaluating AI/machine learning systems applied to POC-capable imaging modalities in POC clinical settings with clinical decision support outputs. Two reviewers independently screened studies, extracted data across 15 domains, and assessed methodological quality using QUADAS-2. Proposed frameworks were developed to evaluate explainability implementation and clinical impact evidence. Narrative synthesis was performed due to substantial data heterogeneity.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Of 2,113 records identified, 20 studies met inclusion criteria, encompassing approximately 78,296 patients across 15 countries. Studies evaluated tuberculosis (n=5), breast cancer (n=3), deep vein thrombosis (n=2), and nine other conditions using ultrasound (35%, 7/20), chest X-ray (25%, 5/20), photography-based and colposcopic imaging (15%, 3/20), fundus photography (10%, 2/20), microscopy (10%, 2/20), and dermoscopy (5%, 1/20). Median sensitivity was 92% (IQR 85.7%-98.0%), and median specificity was 90.6% (IQR 70.0%-95.7%). Task-shifting was demonstrated in 65% (13/20) of studies, with nonspecialists achieving specialist-level performance after a median of 1 hour of training. The explainable AI (XAI) implementation cascade revealed critical gaps: 75% (15/20) of studies did not mention explainability, 10% (2/20) provided explanations to users, and none evaluated whether clinicians understood explanations or whether XAI influenced decisions. The clinical impact pyramid showed 15% (3/20) of studies reported technical accuracy only, 65% (13/20) reported process outcomes, 20% (4/20) documented clinical actions, and none measured patient outcomes. Methodological quality was concerning, as 70% (14/20) of studies were at high or very high risk of bias, with verification bias (70%, 14/20) and selection bias (50%, 10/20) being the most common. The overall certainty of evidence was very low-Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) ⊕◯◯◯, primarily due to risk of bias, heterogeneity, and imprecision.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;AI-assisted POC imaging demonstrates promising d","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ambient AI Documentation and Patient Satisfaction in Outpatient Care: Retrospective Pilot Study. 门诊护理中环境人工智能记录和患者满意度:回顾性试点研究。
IF 2 Pub Date : 2026-02-06 DOI: 10.2196/78830
Eric Davis, Sarah Davis, Kristina Haralambides, Conrad Gleber, Gregg Nicandri

Background: Patient experience is a critical consideration for any health care institution. Leveraging artificial intelligence (AI) to improve health care delivery has rapidly become an institutional priority across the United States. Ambient AI documentation systems such as Dragon Ambient eXperience (DAX) may influence patient perception of health care provider communication and overall experience.

Objective: The objective of this study was to assess the impact of the implementation of an ambient AI documentation system (DAX) on Press Ganey (PG) patient experience scores.

Methods: A retrospective study was conducted to evaluate the relationship between provider use of DAX (N=49) and PG patient satisfaction scores from January 2023 to December 2024. Three domains were analyzed: (1) overall assessment of the experience, (2) concern the care provider showed for patients' questions or worries, and (3) likelihood of recommending the care provider to others. Mean pretest-posttest score differences and P values were calculated.

Results: A total of 49 health care providers across 9 departments participated in the DAX pilot. Aggregate scores for individual items increased between 0.9 and 1.9 points. Care provider concern for a patient's questions or worries increased the most (1.9 points; P=.01), followed by overall assessment of the experience (1.3 points; P=.09) and likelihood of recommending the provider (0.9 points; P=.33). Subgroup analysis showed a larger increase in concern scores among providers using DAX <50% of the time (3.2-point increase; P=.03).

Conclusions: This pilot study aimed to investigate the relationship between provider use of DAX and PG patient experience scores in the outpatient setting at a large academic medical center. Increases in PG scores after implementing DAX were observed across all PG items assessed. As technology and AI continue to improve and become more widespread, these results are encouraging. Health care providers may consider leveraging AI note-taking software as a way to enhance their communication and interactions with patients.

背景:患者经验是任何医疗机构的关键考虑因素。利用人工智能(AI)改善医疗保健服务已迅速成为美国各地机构的优先事项。Dragon Ambient eXperience (DAX)等环境人工智能文档系统可能会影响患者对医疗保健提供者沟通和整体体验的看法。目的:本研究的目的是评估环境人工智能文档系统(DAX)的实施对Press Ganey (PG)患者体验评分的影响。方法:回顾性研究2023年1月至2024年12月期间提供者使用DAX (N=49)与PG患者满意度评分的关系。分析了三个领域:(1)对体验的总体评估,(2)护理提供者对患者问题或担忧的关注,以及(3)向他人推荐护理提供者的可能性。计算前测-后测平均分差及P值。结果:共有9个部门的49名卫生保健提供者参与了DAX试点。个别项目的综合得分增加了0.9到1.9分。护理提供者对患者问题或担忧的关注增加最多(1.9分,P= 0.01),其次是对体验的总体评估(1.3分,P= 0.09)和推荐提供者的可能性(0.9分,P= 0.33)。亚组分析显示,在使用DAX的医疗服务提供者中,关注评分有较大的增加。结论:本初步研究旨在调查在一家大型学术医疗中心门诊设置中,医疗服务提供者使用DAX与PG患者体验评分之间的关系。在实施DAX后,所有评估的PG项目都观察到PG分数的增加。随着技术和人工智能的不断改进和普及,这些结果令人鼓舞。医疗服务提供者可以考虑利用人工智能笔记软件来加强他们与患者的沟通和互动。
{"title":"Ambient AI Documentation and Patient Satisfaction in Outpatient Care: Retrospective Pilot Study.","authors":"Eric Davis, Sarah Davis, Kristina Haralambides, Conrad Gleber, Gregg Nicandri","doi":"10.2196/78830","DOIUrl":"10.2196/78830","url":null,"abstract":"<p><strong>Background: </strong>Patient experience is a critical consideration for any health care institution. Leveraging artificial intelligence (AI) to improve health care delivery has rapidly become an institutional priority across the United States. Ambient AI documentation systems such as Dragon Ambient eXperience (DAX) may influence patient perception of health care provider communication and overall experience.</p><p><strong>Objective: </strong>The objective of this study was to assess the impact of the implementation of an ambient AI documentation system (DAX) on Press Ganey (PG) patient experience scores.</p><p><strong>Methods: </strong>A retrospective study was conducted to evaluate the relationship between provider use of DAX (N=49) and PG patient satisfaction scores from January 2023 to December 2024. Three domains were analyzed: (1) overall assessment of the experience, (2) concern the care provider showed for patients' questions or worries, and (3) likelihood of recommending the care provider to others. Mean pretest-posttest score differences and P values were calculated.</p><p><strong>Results: </strong>A total of 49 health care providers across 9 departments participated in the DAX pilot. Aggregate scores for individual items increased between 0.9 and 1.9 points. Care provider concern for a patient's questions or worries increased the most (1.9 points; P=.01), followed by overall assessment of the experience (1.3 points; P=.09) and likelihood of recommending the provider (0.9 points; P=.33). Subgroup analysis showed a larger increase in concern scores among providers using DAX <50% of the time (3.2-point increase; P=.03).</p><p><strong>Conclusions: </strong>This pilot study aimed to investigate the relationship between provider use of DAX and PG patient experience scores in the outpatient setting at a large academic medical center. Increases in PG scores after implementing DAX were observed across all PG items assessed. As technology and AI continue to improve and become more widespread, these results are encouraging. Health care providers may consider leveraging AI note-taking software as a way to enhance their communication and interactions with patients.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e78830"},"PeriodicalIF":2.0,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12880801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ethical Risks and Structural Implications of AI-Mediated Medical Interpreting. 人工智能介导的医学口译的伦理风险和结构含义。
IF 2 Pub Date : 2026-02-05 DOI: 10.2196/88651
Alexandra Lopez Vera

Unlabelled: Artificial intelligence (AI) is increasingly used to support medical interpreting and public health communication, yet current systems introduce serious risks to accuracy, confidentiality, and equity, particularly for speakers of low-resource languages. Automatic translation models often struggle with regional varieties, figurative language, culturally embedded meanings, and emotionally sensitive conversations about reproductive health or chronic disease, which can lead to clinically significant misunderstandings. These limitations threaten patient safety, informed consent, and trust in health systems when clinicians rely on AI as if it were a professional interpreter. At the same time, the large data sets required to train and maintain these systems create new concerns about surveillance, secondary use of linguistic data, and gaps in existing privacy protections. This viewpoint examines the ethical and structural implications of AI-mediated interpreting in clinical and public health settings, arguing that its routine use as a replacement for qualified interpreters would normalize a lower standard of care for people with Non-English Language Preference and reinforce existing health disparities. Instead, AI tools should be treated as optional, carefully evaluated supplements that operate under the supervision of trained clinicians and professional interpreters, within clear regulatory guardrails for transparency, accountability, and community oversight. The paper concludes that language access must remain grounded in human expertise, language rights, and structural commitments to equity, rather than in cost-saving promises of automated systems.

未标记:人工智能(AI)越来越多地用于支持医学口译和公共卫生交流,但目前的系统给准确性、保密性和公平性带来了严重风险,特别是对于资源匮乏的语言使用者。自动翻译模型经常与地域差异、比喻性语言、文化内涵以及关于生殖健康或慢性疾病的情感敏感对话作斗争,这可能导致临床上重大的误解。当临床医生把人工智能当作专业翻译来依赖时,这些限制会威胁到患者的安全、知情同意和对卫生系统的信任。与此同时,训练和维护这些系统所需的大型数据集引发了对监控、语言数据的二次使用以及现有隐私保护漏洞的新担忧。这一观点考察了人工智能介导的口译在临床和公共卫生环境中的伦理和结构影响,认为常规使用人工智能作为合格口译员的替代品,将使具有非英语语言偏好的人的护理标准降低,并加剧现有的健康差距。相反,人工智能工具应该被视为可选的、经过仔细评估的补充,在训练有素的临床医生和专业口译员的监督下运作,在明确的监管护栏内,以实现透明度、问责制和社区监督。该报告的结论是,语言获取必须以人类专业知识、语言权利和对公平的结构性承诺为基础,而不是以自动化系统节省成本的承诺为基础。
{"title":"Ethical Risks and Structural Implications of AI-Mediated Medical Interpreting.","authors":"Alexandra Lopez Vera","doi":"10.2196/88651","DOIUrl":"10.2196/88651","url":null,"abstract":"<p><strong>Unlabelled: </strong>Artificial intelligence (AI) is increasingly used to support medical interpreting and public health communication, yet current systems introduce serious risks to accuracy, confidentiality, and equity, particularly for speakers of low-resource languages. Automatic translation models often struggle with regional varieties, figurative language, culturally embedded meanings, and emotionally sensitive conversations about reproductive health or chronic disease, which can lead to clinically significant misunderstandings. These limitations threaten patient safety, informed consent, and trust in health systems when clinicians rely on AI as if it were a professional interpreter. At the same time, the large data sets required to train and maintain these systems create new concerns about surveillance, secondary use of linguistic data, and gaps in existing privacy protections. This viewpoint examines the ethical and structural implications of AI-mediated interpreting in clinical and public health settings, arguing that its routine use as a replacement for qualified interpreters would normalize a lower standard of care for people with Non-English Language Preference and reinforce existing health disparities. Instead, AI tools should be treated as optional, carefully evaluated supplements that operate under the supervision of trained clinicians and professional interpreters, within clear regulatory guardrails for transparency, accountability, and community oversight. The paper concludes that language access must remain grounded in human expertise, language rights, and structural commitments to equity, rather than in cost-saving promises of automated systems.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e88651"},"PeriodicalIF":2.0,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clinical Evidence Linkage From the American Society of Clinical Oncology 2024 Conference Poster Images Using Generative AI: Exploratory Observational Study. 来自美国临床肿瘤学会2024年会议海报图像的临床证据链接使用生成人工智能:探索性观察研究。
IF 2 Pub Date : 2026-02-05 DOI: 10.2196/78148
Carlos Areia, Michael Taylor

Background: Early-stage clinical findings often appear only as conference posters circulated on social media. Because posters rarely carry structured metadata, their citations are invisible to bibliometric and alternative metric tools, limiting real-time research discovery.

Objective: This study aimed to determine whether a large language model can accurately extract citation data from clinical conference poster images shared on X (formerly known as Twitter) and link those data to the Dimensions and Altmetric databases.

Methods: Poster images associated with the 2024 American Society of Clinical Oncology conference were searched using the terms "#ASCO24," "#ASCO2024," and the conference name. Images ≥100 kB that contained the word "poster" in the post text were retained. A prompt-engineered Gemini 2.0 Flash model classified images, summarized posters, and extracted structured citation elements (eg, authors, titles, and digital object identifiers [DOIs]) in JSON format. A hierarchical linkage algorithm matched extracted elements against Dimensions records, prioritizing persistent identifiers and then title-journal-author composites. Manual validation was performed on a random 20% sample.

Results: We searched within 115,714 posts and 16,574 images, of which 651 (3.9%) met the inclusion criteria, and we obtained 1117 potential citations. The algorithm linked 63.4% (708/1117) of the citations to 616 unique research outputs (n=580, 94.2% journal articles; n=36, 5.8% clinical trial registrations). Manual review of 135 randomly sampled citations confirmed correct linkage in 124 (91.9%) cases. DOI-based matching was mostly flawless; most errors occurred where only partial bibliographic details were available. The linked dataset enabled rapid profiling of topical foci (eg, lung and breast cancer) and identification of the most frequently referenced institutions and clinical trials in shared posters.

Conclusions: This study presents a novel artificial intelligence-driven methodology for enhancing research discovery and attention analysis from nontraditional clinical scholarly outputs. The American Society of Clinical Oncology was used as an example, but this methodology could be used for any conference and clinical poster.

背景:早期临床发现往往只出现在社交媒体上流传的会议海报上。由于海报很少携带结构化元数据,它们的引用对文献计量和其他计量工具来说是不可见的,限制了实时研究发现。目的:本研究旨在确定大型语言模型是否可以准确地从X(以前称为Twitter)上共享的临床会议海报图像中提取引文数据,并将这些数据链接到Dimensions和Altmetric数据库。方法:使用“#ASCO24”、“#ASCO2024”和会议名称搜索与2024年美国临床肿瘤学会会议相关的海报图像。保留post文本中包含“poster”字样的≥100kb的图片。一个即时设计的Gemini 2.0 Flash模型对图像进行分类,汇总海报,并以JSON格式提取结构化引用元素(如作者、标题和数字对象标识符[doi])。分层链接算法将提取的元素与Dimensions记录匹配,优先考虑持久标识符,然后是标题-期刊-作者组合。随机抽取20%的样本进行人工验证。结果:我们检索了115,714篇文章和16,574张图片,其中651张(3.9%)符合纳入标准,获得了1117条潜在被引。该算法将63.4%(708/1117)的引用与616个独特的研究成果(n=580, 94.2%的期刊文章;n=36, 5.8%的临床试验注册)联系起来。对135个随机抽样的引文进行人工回顾,确认124例(91.9%)的关联正确。基于doi的匹配基本上是完美无缺的;大多数错误发生在只有部分书目细节可用的地方。链接的数据集能够快速分析局部焦点(如肺癌和乳腺癌),并识别共享海报中最常被引用的机构和临床试验。结论:本研究提出了一种新的人工智能驱动的方法,用于加强非传统临床学术成果的研究发现和关注分析。以美国临床肿瘤学会为例,但这种方法可以用于任何会议和临床海报。
{"title":"Clinical Evidence Linkage From the American Society of Clinical Oncology 2024 Conference Poster Images Using Generative AI: Exploratory Observational Study.","authors":"Carlos Areia, Michael Taylor","doi":"10.2196/78148","DOIUrl":"https://doi.org/10.2196/78148","url":null,"abstract":"<p><strong>Background: </strong>Early-stage clinical findings often appear only as conference posters circulated on social media. Because posters rarely carry structured metadata, their citations are invisible to bibliometric and alternative metric tools, limiting real-time research discovery.</p><p><strong>Objective: </strong>This study aimed to determine whether a large language model can accurately extract citation data from clinical conference poster images shared on X (formerly known as Twitter) and link those data to the Dimensions and Altmetric databases.</p><p><strong>Methods: </strong>Poster images associated with the 2024 American Society of Clinical Oncology conference were searched using the terms \"#ASCO24,\" \"#ASCO2024,\" and the conference name. Images ≥100 kB that contained the word \"poster\" in the post text were retained. A prompt-engineered Gemini 2.0 Flash model classified images, summarized posters, and extracted structured citation elements (eg, authors, titles, and digital object identifiers [DOIs]) in JSON format. A hierarchical linkage algorithm matched extracted elements against Dimensions records, prioritizing persistent identifiers and then title-journal-author composites. Manual validation was performed on a random 20% sample.</p><p><strong>Results: </strong>We searched within 115,714 posts and 16,574 images, of which 651 (3.9%) met the inclusion criteria, and we obtained 1117 potential citations. The algorithm linked 63.4% (708/1117) of the citations to 616 unique research outputs (n=580, 94.2% journal articles; n=36, 5.8% clinical trial registrations). Manual review of 135 randomly sampled citations confirmed correct linkage in 124 (91.9%) cases. DOI-based matching was mostly flawless; most errors occurred where only partial bibliographic details were available. The linked dataset enabled rapid profiling of topical foci (eg, lung and breast cancer) and identification of the most frequently referenced institutions and clinical trials in shared posters.</p><p><strong>Conclusions: </strong>This study presents a novel artificial intelligence-driven methodology for enhancing research discovery and attention analysis from nontraditional clinical scholarly outputs. The American Society of Clinical Oncology was used as an example, but this methodology could be used for any conference and clinical poster.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e78148"},"PeriodicalIF":2.0,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Clinician Perspectives on Artificial Intelligence in Primary Care: Qualitative Systematic Review and Meta-Synthesis. 探索临床医生对初级保健中人工智能的看法:定性系统评价和元综合。
IF 2 Pub Date : 2026-02-05 DOI: 10.2196/72210
Robin Bogdanffy, Alisa Mundzic, Peter Nymberg, David Sundemo, Anna Moberg, Carl Wikberg, Ronny Kent Gunnarsson, Jonathan Widén, Pär-Daniel Sundvall, Artin Entezarjou
<p><strong>Background: </strong>Recent advances have highlighted the potential of artificial intelligence (AI) systems to assist clinicians with administrative and clinical tasks, but concerns regarding biases, lack of regulation, and potential technical issues pose significant challenges. The lack of a clear definition of AI, combined with limited focus on qualitative research exploring clinicians' perspectives, has limited the understanding of perspectives on AI in primary health care settings.</p><p><strong>Objective: </strong>This review aims to synthesize current qualitative research on the perspectives of clinicians on AI in primary care settings.</p><p><strong>Methods: </strong>A systematic search was conducted in MEDLINE (PubMed), Scopus, Web of Science, and CINAHL (EBSCOhost) databases for publications from inception to February 5, 2024. The search strategy was designed using the Sample, Phenomenon of Interest, Design, Evaluation, and Research type (SPIDER) framework. Studies were eligible if they were published in English, peer-reviewed, and provided qualitative analyses of clinician perspectives on AI in primary health care. Studies were excluded if they were gray literature, used questionnaires, surveys, or similar methods for data collection, or if the perspectives of clinicians were not distinguishable from those of nonclinicians. A qualitative systematic review and thematic synthesis were performed. The Grading of Recommendations Assessment, Development and Evaluation-Confidence in Evidence from Reviews of Qualitative Research (GRADE-CERQual) approach was used to assess confidence in the findings. The CASP (Critical Appraisal Skills Program) checklist for qualitative research was used for risk-of-bias and quality appraisal.</p><p><strong>Results: </strong>A total of 1492 records were identified, of which 13 studies from 6 countries were included, representing qualitative data from 238 primary care physicians, nurses, physiotherapists, and other health care professionals providing direct patient care. Eight descriptive themes were identified and synthesized into 3 analytical themes using thematic synthesis: (1) the human-machine relationship, describing clinicians' thoughts on AI assistance in administration and clinical work, interactions between clinicians, patients, and AI, and resistance and skepticism toward AI; (2) the technologically enhanced clinic, highlighting the effects of AI on the workplace, fear of errors, and desired features; and (3) the societal impact of AI, reflecting concerns about data privacy, medicolegal liability, and bias. GRADE-CERQual assessment rated confidence as high in 15 findings, moderate in 5 findings, and low in 1 finding.</p><p><strong>Conclusions: </strong>Clinicians view AI as a technology that can both enhance and complicate primary health care. While AI can provide substantial support, its integration into health care requires careful consideration of ethical implications, technical reliabili
背景:最近的进展突出了人工智能(AI)系统在协助临床医生完成行政和临床任务方面的潜力,但对偏见、缺乏监管和潜在技术问题的担忧构成了重大挑战。缺乏对人工智能的明确定义,再加上对探讨临床医生观点的定性研究的关注有限,限制了对初级卫生保健机构中人工智能观点的理解。目的:本综述旨在综合目前临床医生对初级保健机构人工智能的定性研究。方法:系统检索MEDLINE (PubMed)、Scopus、Web of Science和CINAHL (EBSCOhost)数据库中自创刊至2024年2月5日的出版物。使用样本、兴趣现象、设计、评估和研究类型(SPIDER)框架设计搜索策略。如果研究以英文发表,经过同行评审,并提供临床医生对初级卫生保健中人工智能观点的定性分析,则符合条件。如果研究是灰色文献,使用问卷、调查或类似的数据收集方法,或者临床医生的观点与非临床医生的观点无法区分,则排除研究。进行了定性系统评价和专题综合。采用建议分级评估、发展和评价-质性研究综述证据的可信度(GRADE-CERQual)方法评估研究结果的可信度。质性研究的CASP(关键评估技能计划)检查表用于偏倚风险和质量评估。结果:共确定了1492份记录,其中包括来自6个国家的13项研究,代表了238名初级保健医生、护士、物理治疗师和其他提供直接患者护理的卫生保健专业人员的定性数据。通过主题综合,将8个描述性主题确定并合成为3个分析性主题:(1)人机关系,描述临床医生对人工智能协助管理和临床工作的想法,临床医生、患者和人工智能之间的互动,以及对人工智能的抵制和怀疑;(2)技术增强的诊所,突出人工智能对工作场所的影响、对错误的恐惧和期望的功能;(3)人工智能的社会影响,反映了对数据隐私、医疗法律责任和偏见的担忧。GRADE-CERQual评估将15个发现的置信度评为高,5个发现的置信度为中等,1个发现的置信度为低。结论:临床医生认为人工智能是一种既可以加强初级卫生保健,也可以使其复杂化的技术。虽然人工智能可以提供大量支持,但将其融入卫生保健需要仔细考虑伦理影响、技术可靠性和维护人类监督。解释受到定性方法的异质性和研究中检验的人工智能技术的多样性的限制。对人工智能对临床医生的职业和自主性的影响进行更深入的定性研究,可能有助于人工智能系统的未来发展。
{"title":"Exploring Clinician Perspectives on Artificial Intelligence in Primary Care: Qualitative Systematic Review and Meta-Synthesis.","authors":"Robin Bogdanffy, Alisa Mundzic, Peter Nymberg, David Sundemo, Anna Moberg, Carl Wikberg, Ronny Kent Gunnarsson, Jonathan Widén, Pär-Daniel Sundvall, Artin Entezarjou","doi":"10.2196/72210","DOIUrl":"10.2196/72210","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Recent advances have highlighted the potential of artificial intelligence (AI) systems to assist clinicians with administrative and clinical tasks, but concerns regarding biases, lack of regulation, and potential technical issues pose significant challenges. The lack of a clear definition of AI, combined with limited focus on qualitative research exploring clinicians' perspectives, has limited the understanding of perspectives on AI in primary health care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This review aims to synthesize current qualitative research on the perspectives of clinicians on AI in primary care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;A systematic search was conducted in MEDLINE (PubMed), Scopus, Web of Science, and CINAHL (EBSCOhost) databases for publications from inception to February 5, 2024. The search strategy was designed using the Sample, Phenomenon of Interest, Design, Evaluation, and Research type (SPIDER) framework. Studies were eligible if they were published in English, peer-reviewed, and provided qualitative analyses of clinician perspectives on AI in primary health care. Studies were excluded if they were gray literature, used questionnaires, surveys, or similar methods for data collection, or if the perspectives of clinicians were not distinguishable from those of nonclinicians. A qualitative systematic review and thematic synthesis were performed. The Grading of Recommendations Assessment, Development and Evaluation-Confidence in Evidence from Reviews of Qualitative Research (GRADE-CERQual) approach was used to assess confidence in the findings. The CASP (Critical Appraisal Skills Program) checklist for qualitative research was used for risk-of-bias and quality appraisal.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;A total of 1492 records were identified, of which 13 studies from 6 countries were included, representing qualitative data from 238 primary care physicians, nurses, physiotherapists, and other health care professionals providing direct patient care. Eight descriptive themes were identified and synthesized into 3 analytical themes using thematic synthesis: (1) the human-machine relationship, describing clinicians' thoughts on AI assistance in administration and clinical work, interactions between clinicians, patients, and AI, and resistance and skepticism toward AI; (2) the technologically enhanced clinic, highlighting the effects of AI on the workplace, fear of errors, and desired features; and (3) the societal impact of AI, reflecting concerns about data privacy, medicolegal liability, and bias. GRADE-CERQual assessment rated confidence as high in 15 findings, moderate in 5 findings, and low in 1 finding.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Clinicians view AI as a technology that can both enhance and complicate primary health care. While AI can provide substantial support, its integration into health care requires careful consideration of ethical implications, technical reliabili","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e72210"},"PeriodicalIF":2.0,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875425/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-Generative AI Interactions and Their Effects on Beliefs About Health Issues: Content Analysis and Experiment. 人类生成的人工智能互动及其对健康问题信念的影响:内容分析和实验。
IF 2 Pub Date : 2026-02-04 DOI: 10.2196/80270
Linqi Lu, Yanshu Sybil Wang, Jiawei Liu, Douglas M McLeod
{"title":"Human-Generative AI Interactions and Their Effects on Beliefs About Health Issues: Content Analysis and Experiment.","authors":"Linqi Lu, Yanshu Sybil Wang, Jiawei Liu, Douglas M McLeod","doi":"10.2196/80270","DOIUrl":"https://doi.org/10.2196/80270","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e80270"},"PeriodicalIF":2.0,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable AI Approaches in Federated Learning: Systematic Review. 联邦学习中可解释的人工智能方法:系统回顾。
IF 2 Pub Date : 2026-02-03 DOI: 10.2196/69985
Titus Tunduny, Bernard Shibwabo

Background: Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.

Objective: This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.

Methods: A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.

Results: A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.

Conclusions: There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.

Trial registration: OSF Registries 10.17605/OSF.IO/Y85WA; https://osf.io/y85wa.

背景:近年来,随着ChatGPT和Bard等生成式人工智能系统的发展,人工智能(AI)经历了一次重生。这些系统经过数十亿个参数的训练,使不同用户群体能够广泛访问和理解人工智能。人工智能的广泛采用导致需要了解机器学习(ML)模型如何运作,以建立对它们的信任。理解这些模型是如何产生结果的,仍然是可解释人工智能寻求解决的一个巨大挑战。联邦学习(FL)源于对保护隐私的人工智能的需求,通过使用分散的ML模型,但仍然与全局模型共享模型参数。目的:本研究试图检查FL环境中可解释的AI领域的发展程度,包括所做的主要贡献、FL的类型、它所应用的部门、所使用的模型、每项研究应用的方法以及从中获得资源的数据库。方法:系统检索Web of Science Core Collection、Scopus、PubMed、ACM Digital Library、IEEE explore、Mendeley、BASE、谷歌Scholar等8个电子数据库。结果:对26项研究的回顾表明,尽管主要集中在欧洲和亚洲,但对可解释性FL的研究正在稳步增长。使用FL的关键决定因素是数据隐私和有限的训练数据。水平FL仍然是联邦ML的首选方法,而事后可解释性技术是首选方法。结论:在可解释的FL领域,特别是在关键领域,存在开发新方法和改进现有方法的潜力。试验注册:OSF registres10.17605 /OSF. io /Y85WA;https://osf.io/y85wa。
{"title":"Explainable AI Approaches in Federated Learning: Systematic Review.","authors":"Titus Tunduny, Bernard Shibwabo","doi":"10.2196/69985","DOIUrl":"https://doi.org/10.2196/69985","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has, in the recent past, experienced a rebirth with the growth of generative AI systems such as ChatGPT and Bard. These systems are trained with billions of parameters and have enabled widespread accessibility and understanding of AI among different user groups. Widespread adoption of AI has led to the need for understanding how machine learning (ML) models operate to build trust in them. An understanding of how these models generate their results remains a huge challenge that explainable AI seeks to solve. Federated learning (FL) grew out of the need to have privacy-preserving AI by having ML models that are decentralized but still share model parameters with a global model.</p><p><strong>Objective: </strong>This study sought to examine the extent of development of the explainable AI field within the FL environment in relation to the main contributions made, the types of FL, the sectors it is applied to, the models used, the methods applied by each study, and the databases from which sources are obtained.</p><p><strong>Methods: </strong>A systematic search in 8 electronic databases, namely, Web of Science Core Collection, Scopus, PubMed, ACM Digital Library, IEEE Xplore, Mendeley, BASE, and Google Scholar, was undertaken.</p><p><strong>Results: </strong>A review of 26 studies revealed that research on explainable FL is steadily growing despite being concentrated in Europe and Asia. The key determinants of FL use were data privacy and limited training data. Horizontal FL remains the preferred approach for federated ML, whereas post hoc explainability techniques were preferred.</p><p><strong>Conclusions: </strong>There is potential for development of novel approaches and improvement of existing approaches in the explainable FL field, especially for critical areas.</p><p><strong>Trial registration: </strong>OSF Registries 10.17605/OSF.IO/Y85WA; https://osf.io/y85wa.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e69985"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study. 信息人性化作为人工智能感知人类的预测因素:HeartBot研究的次要数据分析。
IF 2 Pub Date : 2026-02-03 DOI: 10.2196/67717
Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka

Background: Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.

Objective: This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.

Methods: This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.

Results: Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).

Conclusions: To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.

背景:人工智能(AI)聊天机器人已经成为医疗保健领域增强健康知识和促进不同人群健康行为的重要工具。然而,影响人工智能聊天机器人和人机交互感知的因素在很大程度上是未知的。目的:本研究旨在确定与人工智能聊天机器人身份感知相关的交互特征,并根据不同女性样本的社会人口状况和以前的聊天机器人使用情况进行调整。方法:本研究是对HeartBot试验数据的二次分析,该试验在2023年10月至2024年1月期间通过社交媒体招募的25岁及以上女性中进行。HeartBot试验的最初目标是评估与全自动AI HeartBot聊天机器人互动后对心脏病发作的意识和知识的变化。所有参与者都与HeartBot互动一次。在对话开始时,聊天机器人介绍自己为HeartBot。然而,它并没有明确指出参与者将与人工智能系统进行交互。在聊天机器人之后的调查中,测量了感知到的聊天机器人身份(人类与人工智能代理)、与HeartBot的对话长度、消息的人性化、消息的有效性以及对人工智能的态度。通过多变量逻辑回归,研究了预测女性对聊天机器人作为人类身份的感知的因素,调整了年龄、种族或民族、教育程度、以前使用过的人工智能聊天机器人、信息人性化、信息有效性和对人工智能的态度。结果:在92名女性(平均年龄45.9岁,标准差11.9,范围26-70岁)中,三分之二(n= 61,66%)的样本正确识别了聊天机器人的身份,而三分之一(n= 31,34%)的样本将聊天机器人误认为是人类。超过一半(n= 53,58%)的人以前有过人工智能聊天机器人的经验。参与者与HeartBot的平均互动时间为13.0分钟(SD 7.8),输入82.5个单词(SD 61.9)。在多变量分析中,与人工智能相比,只有信息的人性与聊天机器人作为人类的身份感知显著相关(调整后的优势比为2.37,95% CI为1.26-4.48;P=.007)。结论:据我们所知,这是第一个明确询问参与者在医疗保健领域,他们是将互动视为人类还是聊天机器人(HeartBot)的研究。这项研究的发现(信息人性化的作用和重要性)为设计聊天机器人提供了新的见解。然而,目前的证据仍然是初步的。未来的研究需要在更大规模的研究中了解聊天机器人身份、信息人性化和健康结果之间的关系。
{"title":"Message Humanness as a Predictor of AI's Perception as Human: Secondary Data Analysis of the HeartBot Study.","authors":"Haruno Suzuki, Jingwen Zhang, Diane Dagyong Kim, Kenji Sagae, Holli A DeVon, Yoshimi Fukuoka","doi":"10.2196/67717","DOIUrl":"https://doi.org/10.2196/67717","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) chatbots have become prominent tools in health care to enhance health knowledge and promote healthy behaviors across diverse populations. However, factors influencing the perception of AI chatbots and human-AI interaction are largely unknown.</p><p><strong>Objective: </strong>This study aimed to identify interaction characteristics associated with the perception of an AI chatbot identity as a human versus an artificial agent, adjusting for sociodemographic status and previous chatbot use in a diverse sample of women.</p><p><strong>Methods: </strong>This study was a secondary analysis of data from the HeartBot trial in women aged 25 years or older who were recruited through social media from October 2023 to January 2024. The original goal of the HeartBot trial was to evaluate the change in awareness and knowledge of heart attack after interacting with a fully automated AI HeartBot chatbot. All participants interacted with HeartBot once. At the beginning of the conversation, the chatbot introduced itself as HeartBot. However, it did not explicitly indicate that participants would be interacting with an AI system. The perceived chatbot identity (human vs artificial agent), conversation length with HeartBot, message humanness, message effectiveness, and attitude toward AI were measured at the postchatbot survey. Multivariable logistic regression was conducted to explore factors predicting women's perception of a chatbot's identity as a human, adjusting for age, race or ethnicity, education, previous AI chatbot use, message humanness, message effectiveness, and attitude toward AI.</p><p><strong>Results: </strong>Among 92 women (mean age 45.9, SD 11.9; range 26-70 y), the chatbot identity was correctly identified by two-thirds (n=61, 66%) of the sample, while one-third (n=31, 34%) misidentified the chatbot as a human. Over half (n=53, 58%) had previous AI chatbot experience. On average, participants interacted with the HeartBot for 13.0 (SD 7.8) minutes and entered 82.5 (SD 61.9) words. In multivariable analysis, only message humanness was significantly associated with the perception of chatbot identity as a human compared with an artificial agent (adjusted odds ratio 2.37, 95% CI 1.26-4.48; P=.007).</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first study to explicitly ask participants whether they perceive an interaction as human or from a chatbot (HeartBot) in the health care field. This study's findings (role and importance of message humanness) provide new insights into designing chatbots. However, the current evidence remains preliminary. Future research is warranted to understand the relationship between chatbot identity, message humanness, and health outcomes in a larger-scale study.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"5 ","pages":"e67717"},"PeriodicalIF":2.0,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study. 五种人工智能模型在USMLE上的表现step1问题:一项比较观察研究
IF 2 Pub Date : 2026-01-30 DOI: 10.2196/76928
Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham

Background: Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.

Objective: To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.

Methods: This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.

Results: Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.

Conclusions: AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.

Clinicaltrial:

背景:人工智能(AI)模型在医学教育中的应用越来越广泛。虽然ChatGPT等模型之前在usmle风格的问题上表现出色,但现在有了功能增强的新人工智能工具,需要对不同医学领域和问题格式的准确性和可靠性进行比较评估。目的:评估和比较五种公开可用的人工智能模型:Grok、ChatGPT-4、Copilot、Gemini和DeepSeek在USMLE Step 1 Free 120个问题集上的性能,检查它们在问题类型和医学主题上的准确性和一致性。方法:本横断面观察研究于2025年2月10日至3月5日进行。119个usmle风格的问题(不包括一个基于音频的问题)中的每一个都使用标准化的提示周期呈现给每个AI模型。模型回答每个问题三次,以评估信心和一致性。问题分为基于文本或基于图像,基于案例或基于信息。统计分析采用卡方检验和Fisher精确检验,两两比较采用Bonferroni调整。结果:Grok得分最高(91.6%),其次是Copilot(84.9%)、Gemini(84.0%)、ChatGPT-4(79.8%)和DeepSeek(72.3%)。DeepSeek的较低分数是由于无法处理视觉媒体,导致基于图像的项目的准确率为0%。当仅限于文本问题(n = 96)时,DeepSeek的准确率提高到89.6%,与Copilot相当。Grok在基于图像的问题(91.3%)和基于案例的问题(89.7%)上的准确率最高,Grok和DeepSeek在基于案例的问题上的差异有统计学意义(p = 0.011)。模型在生物统计学和流行病学方面表现最好(96.7%),在肌肉骨骼、皮肤和结缔组织方面表现最差(62.9%)。Grok在回答中保持了100%的一致性,而Copilot表现出了最高的自我纠正(一致性为94.1%),在第三次尝试时将其准确性提高到89.9%。结论:人工智能模型在不同领域表现出不同的优势,Grok在该数据集中表现出最高的准确性和一致性,特别是对于基于图像和推理重的问题。虽然ChatGPT-4仍然被广泛使用,但Grok和Copilot等较新的模型也表现得很有竞争力。随着人工智能工具的快速发展,持续评估是必不可少的。临床试验:
{"title":"Performance of Five AI Models on USMLE Step 1 Questions: A Comparative Observational Study.","authors":"Dania El Natour, Mohamad Abou Alfa, Ahmad Chaaban, Reda Assi, Toufic Dally, Bahaa Bou Dargham","doi":"10.2196/76928","DOIUrl":"https://doi.org/10.2196/76928","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) models are increasingly being used in medical education. Although models like ChatGPT have previously demonstrated strong performance on USMLE-style questions, newer AI tools with enhanced capabilities are now available, necessitating comparative evaluations of their accuracy and reliability across different medical domains and question formats.</p><p><strong>Objective: </strong>To evaluate and compare the performance of five publicly available AI models: Grok, ChatGPT-4, Copilot, Gemini, and DeepSeek, on the USMLE Step 1 Free 120-question set, checking their accuracy and consistency across question types and medical subjects.</p><p><strong>Methods: </strong>This cross-sectional observational study was conducted between February 10 and March 5, 2025. Each of the 119 USMLE-style questions (excluding one audio-based item) was presented to each AI model using a standardized prompt cycle. Models answered each question three times to assess confidence and consistency. Questions were categorized as text-based or image-based, and as case-based or information-based. Statistical analysis was done using Chi-square and Fisher's exact tests, with Bonferroni adjustment for pairwise comparisons.</p><p><strong>Results: </strong>Grok got the highest score (91.6%), followed by Copilot (84.9%), Gemini (84.0%), ChatGPT-4 (79.8%), and DeepSeek (72.3%). DeepSeek's lower grade was due to an inability to process visual media, resulting in 0% accuracy on image-based items. When limited to text-only questions (n = 96), DeepSeek's accuracy increased to 89.6%, matching Copilot. Grok showed the highest accuracy on image-based (91.3%) and case-based questions (89.7%), with statistically significant differences observed between Grok and DeepSeek on case-based items (p = .011). The models performed best in Biostatistics & Epidemiology (96.7%) and worst in Musculoskeletal, Skin, & Connective Tissue (62.9%). Grok maintained 100% consistency in responses, while Copilot demonstrated the most self-correction (94.1% consistency), improving its accuracy to 89.9% on the third attempt.</p><p><strong>Conclusions: </strong>AI models showed varying strengths across domains, with Grok demonstrating the highest accuracy and consistency in this dataset, particularly for image-based and reasoning-heavy questions. Although ChatGPT-4 remains widely used, newer models like Grok and Copilot also performed competitively. Continuous evaluation is essential as AI tools rapidly evolve.</p><p><strong>Clinicaltrial: </strong></p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation. 在NSCLC TNM分期中通过快速工程和监督微调来增强LLM:框架开发和验证。
IF 2 Pub Date : 2026-01-29 DOI: 10.2196/77988
Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li
<p><strong>Background: </strong>Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.</p><p><strong>Objective: </strong>This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).</p><p><strong>Methods: </strong>We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.</p><p><strong>Results: </strong>The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.</p><p><strong>Conclusions: </strong>The proposed hybrid fra
背景:准确的TNM分期是非小细胞肺癌(NSCLC)治疗计划和预后的基础。然而,其复杂性带来了重大挑战,特别是在不同临床环境的标准化解释方面。传统的基于规则的自然语言处理方法受其依赖于人工制定的规则的限制,并且容易受到临床报告不一致的影响。目的:本研究旨在通过先进的即时工程和监督微调(SFT),战略性地增强大型语言模型GLM-4-Air,开发和验证一个强大、准确、高效的NSCLC TNM分期人工智能框架。方法:我们构建了一个精心整理的数据集,包含492份去识别的真实世界医学影像报告,并根据AJCC(美国癌症联合委员会)第8版指南,由资深医生严格验证TNM分期注释。通过多阶段过程对GLM-4-Air模型进行了系统优化:针对所有阶段任务采用思维链推理和领域知识注入的迭代提示工程,然后针对推理密集型的T和N阶段任务采用低秩自适应(Low-Rank Adaptation, LoRA)的参数高效SFT。最终的混合模型在一个完全固定的内部测试集(黑盒)上进行评估,并使用标准指标、统计测试和分期错误的临床影响分析对gpt - 40进行基准测试。结果:优化后的混合GLM-4-Air模型性能可靠。它在黑盒测试集上获得了更高的分期准确性:T为92%(95%置信区间(CI): 0.850-0.959), N为86% (95% CI: 0.779-0.915), M为92% (95% CI: 0.850-0.959),总体临床分期为90%;相比之下,gpt - 40分别达到87% (95% CI: 0.790-0.922)、70% (95% CI: 0.604-0.781)、78% (95% CI: 0.689-0.850)和80%。宏观平均f1得分分别为0.914 (T)、0.815 (N)和0.831 (M),持续优于gpt - 40(0.836、0.620和0.698),进一步证明了模型的稳健性。对混淆矩阵的分析证实了该模型在识别关键分期特征方面的熟练程度,同时有效地减少了假阴性。至关重要的是,临床影响评估显示严重的I类错误大幅减少,这被定义为可能显著影响后续临床决策的错误分类。我们的模型在两个测试集的M阶段中犯了0个第一类错误,在T和N阶段犯了更少的第一类错误。此外,该框架展示了实际的可部署性,实现了对消费级硬件(例如,4个RTX 4090 gpu)的有效推断,延迟适合临床工作流程并可接受。结论:所提出的混合框架,整合了结构化提示工程,并将SFT应用于推理繁重的任务(T/N),使GLM-4-Air模型成为一种高度准确、临床可靠且经济高效的NSCLC TNM自动分期解决方案。这项工作证明了与现成的通才模型相比,领域优化的小型模型的有效性和潜力,有望在资源感知型医疗保健环境中增强诊断标准化。临床试验:
{"title":"Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.","authors":"Ruonan Jin, Chao Ling, Yixuan Hou, Yuhan Sun, Ning Li, Jiefei Han, Jin Sheng, Qizhao Wang, Yuepeng Liu, Shen Zheng, Xingyu Ren, Chiyu Chen, Jue Wang, Cheng Li","doi":"10.2196/77988","DOIUrl":"https://doi.org/10.2196/77988","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;The proposed hybrid fra","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1