首页 > 最新文献

JMIR AI最新文献

英文 中文
Authors' Reply: Predicting the Emergency Department Patient Journey Using a Machine Learning Approach. 作者回复:使用机器学习方法预测急诊科患者的旅程。
IF 2 Pub Date : 2025-12-19 DOI: 10.2196/73342
Dhavalkumar Patel, Eyal Klang, Prem Timsina
{"title":"Authors' Reply: Predicting the Emergency Department Patient Journey Using a Machine Learning Approach.","authors":"Dhavalkumar Patel, Eyal Klang, Prem Timsina","doi":"10.2196/73342","DOIUrl":"10.2196/73342","url":null,"abstract":"","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e73342"},"PeriodicalIF":2.0,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Physical Examination Identification in Medical Education Videos: Zero-Shot Multimodal AI With Temporal Sequence Optimization Study. 医学教育视频中的体检识别:零镜头多模态人工智能时序优化研究
IF 2 Pub Date : 2025-12-18 DOI: 10.2196/76586
Shinyoung Kang, Michael Holcomb, David Hein, Ameer Hamza Shakur, Thomas Dalton, Andrew Jamieson
<p><strong>Background: </strong>Objective structured clinical examinations (OSCEs) are widely used for assessing medical student competency, but their evaluation is resource-intensive, requiring trained evaluators to review 15-minute videos. The physical examination (PE) component typically constitutes only a small portion of these recordings; yet, current automated approaches struggle with processing long medical videos due to computational constraints and difficulties maintaining temporal context.</p><p><strong>Objective: </strong>This study aims to determine whether multimodal large language models (MM-LLMs) can effectively segment PE periods within OSCE videos without previous training, potentially reducing the evaluation burden on both human graders and automated assessment systems.</p><p><strong>Methods: </strong>We analyzed 500 videos from 5 OSCE stations at University of Texas Southwestern Simulation Center, each 15 minutes long, by using hand-labeled PE periods as ground truth. Frames were sampled at 1, 2, or 3 seconds. A pose detection preprocessing step filtered frames without people. Six MM-LLMs performed frame-level classification into encounter states by using a standardized prompt. To enforce temporal consistency, we used a hidden Markov model with Viterbi decoding, merging states into 3 primary activities (consulting/notes, physical examination, and no doctor) and adding a brief edge buffer to avoid truncating true PE segments. Performance was computed per video and averaged across the dataset by using recall, precision, intersection over union (IOU), and predicted PE length with 95% CIs.</p><p><strong>Results: </strong>At 1-second sampling, GPT-4o achieved recall of 0.998 (95% CI 0.994-1.000), IOU of 0.784 (95% CI 0.765-0.803), and precision of 0.792 (95% CI 0.774-0.811), identifying a mean of 175 (SD 83) seconds of content per video as PE versus a mean labeled PE of 126 (SD 61) seconds, yielding an 81% reduction in video needing review (from 900 to 175 seconds). Across stations, recall remained high, with expected IOU variability linked to examination format and camera geometry. Increasing the sampling interval modestly decreased recall while slightly improving IOU and precision. Comparative baselines (eg, Gemini 2.0 Flash, Gemma 3, and Qwen2.5-VL variants) demonstrated trade-offs between recall and overselection; GPT-4o offered the best balance among high-recall models. Error analysis highlighted false negatives during occluded or verbally guided maneuvers and false positives during preparatory actions, suggesting opportunities for camera placement optimization and multimodal fusion (eg, audio cues).</p><p><strong>Conclusions: </strong>Integrating zero-shot MM-LLMs with minimal-supervision temporal modeling effectively segments PE periods in OSCE videos without requiring extensive training data. This approach significantly reduces review time while maintaining clinical assessment integrity, demonstrating that artificial intelli
背景:客观结构化临床考试(osce)被广泛用于评估医学生的能力,但其评估是资源密集型的,需要训练有素的评估者审查15分钟的视频。体格检查(PE)部分通常只占这些记录的一小部分;然而,由于计算限制和难以维持时间背景,目前的自动化方法难以处理长医学视频。目的:本研究旨在确定多模态大语言模型(mm - llm)是否可以在没有事先训练的情况下有效地分割欧安组织视频中的PE时段,从而潜在地减轻人工评分者和自动评估系统的评估负担。方法:我们分析了来自德克萨斯大学西南模拟中心5个欧安组织站点的500个视频,每个视频长15分钟,使用手工标记的PE时段作为基础事实。帧在1、2或3秒采样。一个姿态检测预处理步骤过滤了没有人的帧。6个mm - llm通过使用标准化提示执行帧级相遇状态分类。为了加强时间一致性,我们使用了带有Viterbi解码的隐马尔可夫模型,将状态合并为3个主要活动(咨询/笔记、体检和无医生),并添加了一个简短的边缘缓冲,以避免截断真正的PE段。对每个视频的性能进行计算,并通过召回率、精度、交集/联合(IOU)和预测PE长度(95% ci)在整个数据集中进行平均。结果:在1秒采样时,gpt - 40的召回率为0.998 (95% CI 0.994-1.000), IOU为0.784 (95% CI 0.765-0.803),精度为0.792 (95% CI 0.774-0.811),识别每个视频的平均内容为175 (SD 83)秒的PE,而平均标记PE为126 (SD 61)秒,使需要审查的视频减少81%(从900秒到175秒)。在各个站点,召回率仍然很高,预计IOU的变化与检查格式和摄像机几何形状有关。适度增加采样间隔会降低召回率,而略微提高IOU和精度。比较基线(如Gemini 2.0 Flash、Gemma 3和Qwen2.5-VL变体)证明了召回和过度选择之间的权衡;gpt - 40在高召回模型中提供了最好的平衡。误差分析强调了在遮挡或口头引导的演习中出现的假阴性和在准备行动中出现的假阳性,这表明有机会优化摄像机放置和多模式融合(例如,音频提示)。结论:将零镜头mm - llm与最小监督时间模型相结合,可以有效地分割欧安组织视频中的PE时间段,而不需要大量的训练数据。这种方法大大缩短了审查时间,同时保持了临床评估的完整性,表明结合零射击能力和轻监督的人工智能方法可以针对医学教育的具体要求进行优化。这项技术为在不同的医学教育环境中进行更有效和可扩展的临床技能评估奠定了基础。
{"title":"Physical Examination Identification in Medical Education Videos: Zero-Shot Multimodal AI With Temporal Sequence Optimization Study.","authors":"Shinyoung Kang, Michael Holcomb, David Hein, Ameer Hamza Shakur, Thomas Dalton, Andrew Jamieson","doi":"10.2196/76586","DOIUrl":"10.2196/76586","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Objective structured clinical examinations (OSCEs) are widely used for assessing medical student competency, but their evaluation is resource-intensive, requiring trained evaluators to review 15-minute videos. The physical examination (PE) component typically constitutes only a small portion of these recordings; yet, current automated approaches struggle with processing long medical videos due to computational constraints and difficulties maintaining temporal context.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to determine whether multimodal large language models (MM-LLMs) can effectively segment PE periods within OSCE videos without previous training, potentially reducing the evaluation burden on both human graders and automated assessment systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We analyzed 500 videos from 5 OSCE stations at University of Texas Southwestern Simulation Center, each 15 minutes long, by using hand-labeled PE periods as ground truth. Frames were sampled at 1, 2, or 3 seconds. A pose detection preprocessing step filtered frames without people. Six MM-LLMs performed frame-level classification into encounter states by using a standardized prompt. To enforce temporal consistency, we used a hidden Markov model with Viterbi decoding, merging states into 3 primary activities (consulting/notes, physical examination, and no doctor) and adding a brief edge buffer to avoid truncating true PE segments. Performance was computed per video and averaged across the dataset by using recall, precision, intersection over union (IOU), and predicted PE length with 95% CIs.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;At 1-second sampling, GPT-4o achieved recall of 0.998 (95% CI 0.994-1.000), IOU of 0.784 (95% CI 0.765-0.803), and precision of 0.792 (95% CI 0.774-0.811), identifying a mean of 175 (SD 83) seconds of content per video as PE versus a mean labeled PE of 126 (SD 61) seconds, yielding an 81% reduction in video needing review (from 900 to 175 seconds). Across stations, recall remained high, with expected IOU variability linked to examination format and camera geometry. Increasing the sampling interval modestly decreased recall while slightly improving IOU and precision. Comparative baselines (eg, Gemini 2.0 Flash, Gemma 3, and Qwen2.5-VL variants) demonstrated trade-offs between recall and overselection; GPT-4o offered the best balance among high-recall models. Error analysis highlighted false negatives during occluded or verbally guided maneuvers and false positives during preparatory actions, suggesting opportunities for camera placement optimization and multimodal fusion (eg, audio cues).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Integrating zero-shot MM-LLMs with minimal-supervision temporal modeling effectively segments PE periods in OSCE videos without requiring extensive training data. This approach significantly reduces review time while maintaining clinical assessment integrity, demonstrating that artificial intelli","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76586"},"PeriodicalIF":2.0,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12757708/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study. 医学文本的多智能体摘要与自动评价框架:开发与评价研究。
IF 2 Pub Date : 2025-12-16 DOI: 10.2196/75932
Yuhao Chen, Bo Wen, Farhana Zulkernine
<p><strong>Background: </strong>Although large language models (LLMs) show great promise in processing medical text, they are prone to generating incorrect information, commonly referred to as hallucinations. These inaccuracies present a significant risk for clinical applications where precision is critical. Additionally, relying on human experts to review LLM-generated content to ensure accuracy is costly and time-consuming, which sets a barrier against large-scale deployment of LLMs in health care settings.</p><p><strong>Objective: </strong>The primary objective of this study was to develop an automatic artificial intelligence (AI) system capable of extracting structured information from unstructured medical data and using advanced reasoning techniques to support reliable clinical decision making. A key aspect of this objective is ensuring that the system incorporates self-verification mechanisms, enabling it to assess the accuracy and reliability of its own outputs. By integrating such mechanisms, we aim to enhance the system's robustness, reduce reliance on human intervention, and improve the overall trustworthiness of AI-driven medical summarization and evaluation.</p><p><strong>Methods: </strong>The proposed framework comprises 2 layers: a summarization layer and an evaluation layer. The summarization layer uses Llama2-70B (Meta AI) and Mistral-7B (Mistral AI) models to generate concise summaries from unstructured medical data, focusing on tasks such as consumer health question summarization, biomedical answer summarization, and dialog summarization. The evaluation layer uses GPT-4-turbo (OpenAI) as a judge, leveraging pairwise comparison strategies and different prompt strategies to evaluate summaries across 4 dimensions: coherence, consistency, fluency, and relevance. To validate the framework, we compare the judgments generated by the LLM assistants in the evaluation layer with those provided by medical experts, offering valuable insights into the alignment and reliability of AI-driven evaluations within the medical domain. We also explore a way to handle disagreement among human experts and discuss our methodology in addressing diversity in human perspectives.</p><p><strong>Results: </strong>The study found variability in expert consensus, with average agreement rates of 19.2% among all experts and 54% among groups of 3 experts. GPT-4 (OpenAI) demonstrated alignment with expert judgments, achieving an average agreement rate of 83.06% with at least 1 expert and comparable performance in cross-validation tests. The enhanced guidance in prompt design (prompt-enhanced guidance) improved GPT-4's alignment with expert evaluations compared with a baseline prompt, highlighting the importance of effective prompt engineering in auto-evaluation of summarization tasks. We also evaluated open-source LLMs, including Llama-3.3 (Meta AI) and Mixtral-Large (Mistral AI), and a domain-specific LLM, OpenBioLLM (Aaditya Ura), for comparison as LLM judges.</
背景:尽管大型语言模型(llm)在处理医学文本方面显示出巨大的前景,但它们容易产生不正确的信息,通常被称为幻觉。这些不准确性为临床应用带来了重大风险,其中精度是至关重要的。此外,依靠人类专家来审查法学硕士生成的内容以确保准确性既昂贵又耗时,这对在医疗保健环境中大规模部署法学硕士设置了障碍。目的:本研究的主要目的是开发一种自动人工智能(AI)系统,该系统能够从非结构化医疗数据中提取结构化信息,并使用先进的推理技术来支持可靠的临床决策。这一目标的一个关键方面是确保该系统纳入自我核查机制,使其能够评估其自身产出的准确性和可靠性。通过整合这些机制,我们的目标是增强系统的鲁棒性,减少对人为干预的依赖,并提高人工智能驱动的医疗总结和评估的整体可信度。方法:提出的框架包括总结层和评价层两层。摘要层使用Llama2-70B (Meta AI)和Mistral- 7b (Mistral AI)模型从非结构化医疗数据生成简明摘要,重点完成消费者健康问题摘要、生物医学答案摘要和对话摘要等任务。评估层使用GPT-4-turbo (OpenAI)作为裁判,利用两两比较策略和不同提示策略从四个维度评估摘要:连贯性、一致性、流畅性和相关性。为了验证该框架,我们将评估层中LLM助手生成的判断与医学专家提供的判断进行了比较,从而为医疗领域中人工智能驱动评估的一致性和可靠性提供了有价值的见解。我们还探讨了一种处理人类专家之间分歧的方法,并讨论了我们在解决人类观点多样性方面的方法。结果:研究发现了专家共识的可变性,所有专家的平均一致性率为19.2%,3名专家的平均一致性率为54%。GPT-4 (OpenAI)证明了与专家判断的一致性,在交叉验证测试中,至少有一位专家的平均一致性率达到83.06%,并且具有相当的性能。与基线提示相比,增强的提示设计指导提高了GPT-4与专家评估的一致性,突出了有效的提示工程在总结任务自动评估中的重要性。我们还评估了开源法学硕士,包括Llama-3.3 (Meta AI)和Mixtral-Large (Mistral AI),以及一个特定领域的法学硕士,OpenBioLLM (Aaditya Ura),作为法学硕士评委进行比较。结论:本研究强调了llm作为非结构化医疗数据总结和评估的可靠工具的潜力,以减少对人类专家的依赖,同时也指出了局限性。提出的框架,多智能体总结和自动评估,展示了临床应用的可扩展性和适应性,同时解决了幻觉和位置偏见等关键挑战。
{"title":"A Multiagent Summarization and Auto-Evaluation Framework for Medical Text: Development and Evaluation Study.","authors":"Yuhao Chen, Bo Wen, Farhana Zulkernine","doi":"10.2196/75932","DOIUrl":"10.2196/75932","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Although large language models (LLMs) show great promise in processing medical text, they are prone to generating incorrect information, commonly referred to as hallucinations. These inaccuracies present a significant risk for clinical applications where precision is critical. Additionally, relying on human experts to review LLM-generated content to ensure accuracy is costly and time-consuming, which sets a barrier against large-scale deployment of LLMs in health care settings.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;The primary objective of this study was to develop an automatic artificial intelligence (AI) system capable of extracting structured information from unstructured medical data and using advanced reasoning techniques to support reliable clinical decision making. A key aspect of this objective is ensuring that the system incorporates self-verification mechanisms, enabling it to assess the accuracy and reliability of its own outputs. By integrating such mechanisms, we aim to enhance the system's robustness, reduce reliance on human intervention, and improve the overall trustworthiness of AI-driven medical summarization and evaluation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;The proposed framework comprises 2 layers: a summarization layer and an evaluation layer. The summarization layer uses Llama2-70B (Meta AI) and Mistral-7B (Mistral AI) models to generate concise summaries from unstructured medical data, focusing on tasks such as consumer health question summarization, biomedical answer summarization, and dialog summarization. The evaluation layer uses GPT-4-turbo (OpenAI) as a judge, leveraging pairwise comparison strategies and different prompt strategies to evaluate summaries across 4 dimensions: coherence, consistency, fluency, and relevance. To validate the framework, we compare the judgments generated by the LLM assistants in the evaluation layer with those provided by medical experts, offering valuable insights into the alignment and reliability of AI-driven evaluations within the medical domain. We also explore a way to handle disagreement among human experts and discuss our methodology in addressing diversity in human perspectives.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The study found variability in expert consensus, with average agreement rates of 19.2% among all experts and 54% among groups of 3 experts. GPT-4 (OpenAI) demonstrated alignment with expert judgments, achieving an average agreement rate of 83.06% with at least 1 expert and comparable performance in cross-validation tests. The enhanced guidance in prompt design (prompt-enhanced guidance) improved GPT-4's alignment with expert evaluations compared with a baseline prompt, highlighting the importance of effective prompt engineering in auto-evaluation of summarization tasks. We also evaluated open-source LLMs, including Llama-3.3 (Meta AI) and Mixtral-Large (Mistral AI), and a domain-specific LLM, OpenBioLLM (Aaditya Ura), for comparison as LLM judges.&lt;/","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75932"},"PeriodicalIF":2.0,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12707800/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Effectiveness of ChatGPT, Google Gemini, and Microsoft Copilot in Answering Thai Drug Information Queries: Cross-Sectional Study. ChatGPT、谷歌Gemini和Microsoft Copilot在回答泰国药物信息查询中的有效性:横断面研究。
IF 2 Pub Date : 2025-12-15 DOI: 10.2196/79751
Suphannika Pornwattanakavee, Nattawut Leelakanok, Teerarat Todsarot, Gabrielle Angele Tatta Guinto, Ratchanon Takun, Assadawut Sumativit, Marisa Senngam

Background: ChatGPT-4o, Google Gemini, and Microsoft Copilot have shown potential in generating health care-related information. However, their accuracy, completeness, and safety for providing drug-related information in Thai contexts remain underexplored.

Objective: This study aims to evaluate the performance of artificial intelligence (AI) systems in responding to drug-related questions in Thai.

Methods: An analytical cross-sectional study was conducted using 76 public drug-related questions compiled from medical databases and social media between November 1, 2019, and December 31, 2024. All questions were categorized into 19 distinct categories, each comprising 4 questions. ChatGPT-4o, Google Gemini, and Microsoft Copilot were queried in a single session on March 1, 2025, by using input in Thai. All responses were evaluated for correctness, completeness, risk, and reproducibility independently by clinical pharmacists using standardized evaluation criteria.

Results: All 3 AI models provided generally complete responses (P=.08). ChatGPT-4o yielded the highest proportion of fully correct responses (P=.08). The overall risk levels of high-risk answers were not significantly different (P=.12). Response correctness was influenced by the category of the drug-related questions (P=.002) but not completeness (P=.23). The correctness of Google Gemini and Microsoft Copilot was higher than that of ChatGPT for pharmacology queries. The type of questions also statistically significantly affected the risk level of the answers (P=.04). In particular, the pregnancy and lactation category had the highest high-risk response rate (1/76, 1% per system). All 3 AI models demonstrated consistent response patterns when the same questions were re-queried after 1, 7, and 14 days.

Conclusions: The evaluated AI chatbots were able to answer the queries with generally complete content; however, we found limited accuracy and occasional high-risk errors in responding to drug-related questions in Thai. All models exhibited good reproducibility.

背景:chatgpt - 40、谷歌Gemini和Microsoft Copilot已经显示出在生成医疗保健相关信息方面的潜力。然而,他们的准确性,完整性和安全性提供药物相关的信息在泰国的情况下仍未得到充分探讨。目的:本研究旨在评估人工智能(AI)系统在泰语中回答药物相关问题的性能。方法:对2019年11月1日至2024年12月31日期间从医学数据库和社交媒体中收集的76个与药物相关的公共问题进行分析性横断面研究。所有问题被分为19个不同的类别,每个类别包括4个问题。chatgpt - 40、谷歌Gemini和Microsoft Copilot在2025年3月1日的一次会话中使用泰语输入进行查询。临床药师使用标准化评价标准独立评价所有应答的正确性、完整性、风险和可重复性。结果:3种人工智能模型均提供了基本完整的应答(P=.08)。chatgpt - 40的完全正确回答比例最高(P=.08)。高风险答案的总体风险水平差异无统计学意义(P= 0.12)。药物相关问题的类别影响回答的正确性(P= 0.002),但不影响回答的完整性(P= 0.23)。谷歌Gemini和Microsoft Copilot对药理学查询的正确率高于ChatGPT。问题类型对答案的风险水平也有统计学上的显著影响(P= 0.04)。特别是妊娠和哺乳期的高危反应率最高(每系统1/ 76,1 %)。当相同的问题在1、7和14天后再次被询问时,所有3种人工智能模型都表现出一致的反应模式。结论:被评估的人工智能聊天机器人能够以基本完整的内容回答问题;然而,我们发现用泰语回答药物相关问题的准确性有限,偶尔会出现高风险错误。所有模型均具有良好的再现性。
{"title":"Effectiveness of ChatGPT, Google Gemini, and Microsoft Copilot in Answering Thai Drug Information Queries: Cross-Sectional Study.","authors":"Suphannika Pornwattanakavee, Nattawut Leelakanok, Teerarat Todsarot, Gabrielle Angele Tatta Guinto, Ratchanon Takun, Assadawut Sumativit, Marisa Senngam","doi":"10.2196/79751","DOIUrl":"10.2196/79751","url":null,"abstract":"<p><strong>Background: </strong>ChatGPT-4o, Google Gemini, and Microsoft Copilot have shown potential in generating health care-related information. However, their accuracy, completeness, and safety for providing drug-related information in Thai contexts remain underexplored.</p><p><strong>Objective: </strong>This study aims to evaluate the performance of artificial intelligence (AI) systems in responding to drug-related questions in Thai.</p><p><strong>Methods: </strong>An analytical cross-sectional study was conducted using 76 public drug-related questions compiled from medical databases and social media between November 1, 2019, and December 31, 2024. All questions were categorized into 19 distinct categories, each comprising 4 questions. ChatGPT-4o, Google Gemini, and Microsoft Copilot were queried in a single session on March 1, 2025, by using input in Thai. All responses were evaluated for correctness, completeness, risk, and reproducibility independently by clinical pharmacists using standardized evaluation criteria.</p><p><strong>Results: </strong>All 3 AI models provided generally complete responses (P=.08). ChatGPT-4o yielded the highest proportion of fully correct responses (P=.08). The overall risk levels of high-risk answers were not significantly different (P=.12). Response correctness was influenced by the category of the drug-related questions (P=.002) but not completeness (P=.23). The correctness of Google Gemini and Microsoft Copilot was higher than that of ChatGPT for pharmacology queries. The type of questions also statistically significantly affected the risk level of the answers (P=.04). In particular, the pregnancy and lactation category had the highest high-risk response rate (1/76, 1% per system). All 3 AI models demonstrated consistent response patterns when the same questions were re-queried after 1, 7, and 14 days.</p><p><strong>Conclusions: </strong>The evaluated AI chatbots were able to answer the queries with generally complete content; however, we found limited accuracy and occasional high-risk errors in responding to drug-related questions in Thai. All models exhibited good reproducibility.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e79751"},"PeriodicalIF":2.0,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12750067/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Observer-Independent Assessment of Content Overlap in Mental Health Questionnaires: Large Language Model-Based Study. 心理健康问卷内容重叠的观察者独立评估:基于大型语言模型的研究
IF 2 Pub Date : 2025-12-11 DOI: 10.2196/79868
Annkathrin Böke, Hannah Hacker, Millennia Chakraborty, Luise Baumeister-Lingens, Jasper Vöckel, Julian Koenig, David Hv Vogel, Theresa Katharina Lichtenstein, Kai Vogeley, Lana Kambeitz-Ilankovic, Joseph Kambeitz

Background: Mental disorders are frequently evaluated using questionnaires, which have been developed over the past decades for the assessment of different conditions. Despite the rigorous validation of these tools, high levels of content divergence have been reported for questionnaires measuring the same construct of psychopathology. Previous studies that examined the content overlap required manual symptom labeling, which is observer-dependent and time-consuming.

Objective: In this study, we used large language models (LLMs) to analyze content overlap of mental health questionnaires in an observer-independent way and compare our results with clinical expertise.

Methods: We analyzed questionnaires from a range of mental health conditions, including adult depression (n=7), childhood depression (n=15), clinical high risk for psychosis (CHR-P; n=11), mania (n=7), obsessive-compulsive disorder (n=7), and sleep disorder (n=12). Two different LLM-based approaches were tested. First, we used sentence Bidirectional Encoder Representations from Transformers (sBERT) to derive numerical representations (embeddings) for each questionnaire item, which were then clustered using k-means to group semantically similar symptoms. Second, questionnaire items were prompted to a Generative Pretrained Transformer to identify underlying symptom clusters. Clustering results were compared to a manual categorization by experts using the adjusted rand index. Further, we assessed the content overlap within each diagnostic domain based on LLM-derived clusters.

Results: We observed varying degrees of similarity between expert-based and LLM-based clustering across diagnostic domains. Overall, agreement between experts was higher than between experts and LLMs. Among the 2 LLM approaches, GPT showed greater alignment with expert ratings than sBERT, ranging from weak to strong similarity depending on the diagnostic domain. Using GPT-based clustering of questionnaire items to assess the content overlap within each diagnostic domain revealed a weak (CHR-P: 0.344) to moderate (adult depression: 0.574; childhood depression: 0.433; mania: 0.419; obsessive-compulsive disorder [OCD]: 0.450; sleep disorder: 0.445) content overlap of questionnaires. Compared to the studies that manually investigated content overlap among these scales, the results of this study exhibited variations, though these were not substantial.

Conclusions: These findings demonstrate the feasibility of using LLMs to objectively assess content overlap in diagnostic questionnaires. Notably, the GPT-based approach showed particular promise in aligning with expert-derived symptom structures.

背景:精神障碍经常使用问卷进行评估,在过去的几十年里,人们开发了问卷来评估不同的状况。尽管对这些工具进行了严格的验证,但对于测量相同精神病理学结构的问卷,已经报告了高水平的内容差异。以前的研究检查内容重叠需要手动症状标记,这是观察者依赖和耗时的。目的:在本研究中,我们使用大语言模型(LLMs)以观察者独立的方式分析心理健康问卷的内容重叠,并将我们的结果与临床专业知识进行比较。方法:我们分析了来自一系列心理健康状况的问卷,包括成人抑郁症(n=7)、儿童抑郁症(n=15)、临床精神病高危患者(r - p; n=11)、躁狂(n=7)、强迫症(n=7)和睡眠障碍(n=12)。我们测试了两种不同的基于法学硕士的方法。首先,我们使用来自变压器的句子双向编码器表示(sBERT)来获得每个问卷项目的数值表示(嵌入),然后使用k-means对语义相似的症状进行聚类。其次,问卷项目被提示到一个生成预训练变压器,以确定潜在的症状集群。聚类结果与专家使用调整后的兰德指数进行人工分类进行比较。此外,我们基于llm衍生的聚类评估了每个诊断域中的内容重叠。结果:我们观察到基于专家和基于llm的聚类在诊断领域之间存在不同程度的相似性。总体而言,专家之间的一致性高于专家与法学硕士之间的一致性。在两种LLM方法中,GPT比sBERT更符合专家评级,根据诊断领域的不同,相似度从弱到强不等。采用基于gbp的问卷项目聚类方法评估各诊断域内问卷内容重叠程度,发现问卷内容重叠程度为弱(chrp: 0.344)至中度(成人抑郁:0.574;儿童抑郁:0.433;躁狂症:0.419;强迫症[OCD]: 0.450;睡眠障碍:0.445)。与手工调查这些尺度之间内容重叠的研究相比,本研究的结果显示出变化,尽管这些变化并不实质性。结论:这些发现证明了使用llm客观评估诊断问卷内容重叠的可行性。值得注意的是,基于gpt的方法在与专家衍生的症状结构一致方面显示出特别的希望。
{"title":"Observer-Independent Assessment of Content Overlap in Mental Health Questionnaires: Large Language Model-Based Study.","authors":"Annkathrin Böke, Hannah Hacker, Millennia Chakraborty, Luise Baumeister-Lingens, Jasper Vöckel, Julian Koenig, David Hv Vogel, Theresa Katharina Lichtenstein, Kai Vogeley, Lana Kambeitz-Ilankovic, Joseph Kambeitz","doi":"10.2196/79868","DOIUrl":"10.2196/79868","url":null,"abstract":"<p><strong>Background: </strong>Mental disorders are frequently evaluated using questionnaires, which have been developed over the past decades for the assessment of different conditions. Despite the rigorous validation of these tools, high levels of content divergence have been reported for questionnaires measuring the same construct of psychopathology. Previous studies that examined the content overlap required manual symptom labeling, which is observer-dependent and time-consuming.</p><p><strong>Objective: </strong>In this study, we used large language models (LLMs) to analyze content overlap of mental health questionnaires in an observer-independent way and compare our results with clinical expertise.</p><p><strong>Methods: </strong>We analyzed questionnaires from a range of mental health conditions, including adult depression (n=7), childhood depression (n=15), clinical high risk for psychosis (CHR-P; n=11), mania (n=7), obsessive-compulsive disorder (n=7), and sleep disorder (n=12). Two different LLM-based approaches were tested. First, we used sentence Bidirectional Encoder Representations from Transformers (sBERT) to derive numerical representations (embeddings) for each questionnaire item, which were then clustered using k-means to group semantically similar symptoms. Second, questionnaire items were prompted to a Generative Pretrained Transformer to identify underlying symptom clusters. Clustering results were compared to a manual categorization by experts using the adjusted rand index. Further, we assessed the content overlap within each diagnostic domain based on LLM-derived clusters.</p><p><strong>Results: </strong>We observed varying degrees of similarity between expert-based and LLM-based clustering across diagnostic domains. Overall, agreement between experts was higher than between experts and LLMs. Among the 2 LLM approaches, GPT showed greater alignment with expert ratings than sBERT, ranging from weak to strong similarity depending on the diagnostic domain. Using GPT-based clustering of questionnaire items to assess the content overlap within each diagnostic domain revealed a weak (CHR-P: 0.344) to moderate (adult depression: 0.574; childhood depression: 0.433; mania: 0.419; obsessive-compulsive disorder [OCD]: 0.450; sleep disorder: 0.445) content overlap of questionnaires. Compared to the studies that manually investigated content overlap among these scales, the results of this study exhibited variations, though these were not substantial.</p><p><strong>Conclusions: </strong>These findings demonstrate the feasibility of using LLMs to objectively assess content overlap in diagnostic questionnaires. Notably, the GPT-based approach showed particular promise in aligning with expert-derived symptom structures.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e79868"},"PeriodicalIF":2.0,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12697914/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transparent Reporting of AI in Systematic Literature Reviews: Development of the PRISMA-trAIce Checklist. 系统文献综述中人工智能的透明报告:prism - traice检查表的开发。
IF 2 Pub Date : 2025-12-10 DOI: 10.2196/80247
Dirk Holst, Keno Moenck, Julian Koch, Ole Schmedemann, Thorsten Schüppstuhl

Background: Systematic literature reviews (SLRs) build the foundation for evidence synthesis, but they are exceptionally demanding in terms of time and resources. While recent advances in artificial intelligence (AI), particularly large language models, offer the potential to accelerate this process, their use introduces challenges to transparency and reproducibility. Reporting guidelines such as the PRISMA-AI (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Artificial Intelligence Extension) primarily focus on AI as a subject of research, not as a tool in the review process itself.

Objective: To address the gap in reporting standards, this study aimed to develop and propose a discipline-agnostic checklist extension to the PRISMA 2020 statement. The goal was to ensure transparent reporting when AI is used as a methodological tool in evidence synthesis, fostering trust in the next generation of SLRs.

Methods: The proposed checklist, named PRISMA-trAIce (PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis), was developed through a systematic process. We conducted a literature search to identify established, consensus-based AI reporting guidelines (eg, CONSORT-AI [Consolidated Standards of Reporting Trials-Artificial Intelligence] and TRIPOD-AI [Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis-Artificial Intelligence]). Relevant items from these frameworks were extracted, analyzed, and thematically synthesized to form a modular checklist that integrated with the PRISMA 2020 structure.

Results: The primary result of this work is the PRISMA-trAIce checklist, a comprehensive set of reporting items designed to document the use of AI in SLRs. The checklist covers the entire structure of an SLR, from title and abstract to methods and discussion, and includes specific items for identifying AI tools, describing human-AI interaction, reporting performance evaluation, and discussing limitations.

Conclusions: PRISMA-trAIce establishes an important framework to improve the transparency and methodological integrity of AI-assisted systematic reviews, enhancing the trust required for the responsible application of AI-assisted systematic reviews in evidence synthesis. We present this work as a foundational proposal, explicitly inviting the scientific community to join an open science process of consensus building. Through this collaborative refinement, we aim to evolve PRISMA-trAIce into a formally endorsed guideline, thereby ensuring the collective validation and scientific rigor of future AI-driven research.

背景:系统文献综述(slr)为证据综合奠定了基础,但在时间和资源方面要求特别高。虽然人工智能(AI)的最新进展,特别是大型语言模型,提供了加速这一过程的潜力,但它们的使用给透明度和可重复性带来了挑战。报告指南,如PRISMA-AI(系统评论和元分析的首选报告项目-人工智能扩展)主要关注人工智能作为研究主题,而不是作为审查过程本身的工具。目的:为了解决报告标准的差距,本研究旨在开发并提出一个与PRISMA 2020声明无关的学科清单扩展。目标是在人工智能被用作证据合成的方法论工具时,确保报告的透明度,培养对下一代单反相机的信任。方法:通过系统流程制定了拟议的清单PRISMA-trAIce (PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis)。我们进行了文献检索,以确定已建立的、基于共识的人工智能报告指南(例如,consortium -AI[综合报告试验标准-人工智能]和TRIPOD-AI[透明报告个体预后或诊断的多变量预测模型-人工智能])。从这些框架中提取相关项目,进行分析,并按主题进行综合,形成与PRISMA 2020结构集成的模块化清单。结果:这项工作的主要成果是PRISMA-trAIce清单,这是一套全面的报告项目,旨在记录人工智能在单反相机中的使用。清单涵盖了单反的整个结构,从标题和摘要到方法和讨论,并包括识别人工智能工具、描述人类与人工智能交互、报告性能评估和讨论局限性的具体项目。结论:prism - traice建立了一个重要的框架,以提高人工智能辅助系统评价的透明度和方法完整性,增强了在证据合成中负责任地应用人工智能辅助系统评价所需的信任。我们将这项工作作为一个基础建议,明确邀请科学界加入一个建立共识的开放科学进程。通过这种协作改进,我们的目标是将PRISMA-trAIce发展成为正式认可的指南,从而确保未来人工智能驱动研究的集体验证和科学严谨性。
{"title":"Transparent Reporting of AI in Systematic Literature Reviews: Development of the PRISMA-trAIce Checklist.","authors":"Dirk Holst, Keno Moenck, Julian Koch, Ole Schmedemann, Thorsten Schüppstuhl","doi":"10.2196/80247","DOIUrl":"10.2196/80247","url":null,"abstract":"<p><strong>Background: </strong>Systematic literature reviews (SLRs) build the foundation for evidence synthesis, but they are exceptionally demanding in terms of time and resources. While recent advances in artificial intelligence (AI), particularly large language models, offer the potential to accelerate this process, their use introduces challenges to transparency and reproducibility. Reporting guidelines such as the PRISMA-AI (Preferred Reporting Items for Systematic Reviews and Meta-Analyses-Artificial Intelligence Extension) primarily focus on AI as a subject of research, not as a tool in the review process itself.</p><p><strong>Objective: </strong>To address the gap in reporting standards, this study aimed to develop and propose a discipline-agnostic checklist extension to the PRISMA 2020 statement. The goal was to ensure transparent reporting when AI is used as a methodological tool in evidence synthesis, fostering trust in the next generation of SLRs.</p><p><strong>Methods: </strong>The proposed checklist, named PRISMA-trAIce (PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis), was developed through a systematic process. We conducted a literature search to identify established, consensus-based AI reporting guidelines (eg, CONSORT-AI [Consolidated Standards of Reporting Trials-Artificial Intelligence] and TRIPOD-AI [Transparent Reporting of a Multivariable Prediction Model of Individual Prognosis or Diagnosis-Artificial Intelligence]). Relevant items from these frameworks were extracted, analyzed, and thematically synthesized to form a modular checklist that integrated with the PRISMA 2020 structure.</p><p><strong>Results: </strong>The primary result of this work is the PRISMA-trAIce checklist, a comprehensive set of reporting items designed to document the use of AI in SLRs. The checklist covers the entire structure of an SLR, from title and abstract to methods and discussion, and includes specific items for identifying AI tools, describing human-AI interaction, reporting performance evaluation, and discussing limitations.</p><p><strong>Conclusions: </strong>PRISMA-trAIce establishes an important framework to improve the transparency and methodological integrity of AI-assisted systematic reviews, enhancing the trust required for the responsible application of AI-assisted systematic reviews in evidence synthesis. We present this work as a foundational proposal, explicitly inviting the scientific community to join an open science process of consensus building. Through this collaborative refinement, we aim to evolve PRISMA-trAIce into a formally endorsed guideline, thereby ensuring the collective validation and scientific rigor of future AI-driven research.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e80247"},"PeriodicalIF":2.0,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694947/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145727661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Spinal Cord Injury Prognosis Using Machine Learning: Systematic Review and Meta-Analysis. 使用机器学习预测脊髓损伤预后:系统回顾和荟萃分析。
IF 2 Pub Date : 2025-12-05 DOI: 10.2196/66233
Linxing Zhong, Qiying Huang, Hao Zhang, Liang Xue, Yehuang Chen, Jianwu Wu, Liangfeng Wei

Background: Spinal cord injury (SCI) is complicated and varied conditions that receive a lot of attention. However, the prognosis of patients with SCI is increasingly being predicted using machine learning (ML) techniques.

Objective: This study aims to evaluate the efficacy and caliber of ML models in forecasting the consequences of SCI.

Methods: Literature searches were conducted in PubMed, Web of Science, Embase, PROSPERO, Scopus, Cochrane Library, China National Knowledge Infrastructure, China Biomedical Literature Service System, and Wanfang databases. Meta-analysis of the area under the receiver operating characteristic curve of ML models was performed to comprehensively evaluate their performance.

Results: A total of 1254 articles were retrieved, and 13 eligible studies were included. Predictive outcomes included spinal cord function prognosis, postoperative complications, independent living ability, and walking ability. For spinal cord function prognosis, the area under the curve (AUC) of the random forest algorithm was 0.832, the AUC of the logistic regression algorithm was 0.813 (95% CI 0.805-0.883), the AUC of the decision tree algorithm was 0.747 (95% CI 0.677-0.802), and the AUC of the XGBoost (extreme gradient boosting) algorithm was 0.867. For postoperative complications, the AUC of the random forest algorithm was 0.627 (95% CI 0.441-0.812), the AUC of the logistic regression algorithm was 0.747 (95% CI 0.597-0.896), and the AUC of the decision tree algorithm was 0.688. For independent living ability, the AUC of the classification and regression tree model was 0.813. For walking ability, the model based on the vector machine algorithm was the most effective, with an AUC of 0.780.

Conclusions: The ML models predict SCI outcomes with relative accuracy, particularly in spinal cord function prognosis. They are expected to become important tools for clinicians in assessing the prognosis of patients with SCI, with the XGBoost algorithm showing the best performance. Prediction models should continue to advance as large data are used and ML algorithms develop.

背景:脊髓损伤(SCI)是一种复杂多变的疾病,受到了广泛的关注。然而,越来越多地使用机器学习(ML)技术来预测脊髓损伤患者的预后。目的:本研究旨在评价ML模型预测脊髓损伤后果的有效性和水平。方法:检索PubMed、Web of Science、Embase、PROSPERO、Scopus、Cochrane图书馆、中国国家知识基础设施、中国生物医学文献服务系统、万方数据库。对ML模型的受试者工作特征曲线下面积进行meta分析,综合评价ML模型的性能。结果:共检索到1254篇文献,纳入13项符合条件的研究。预测结果包括脊髓功能预后、术后并发症、独立生活能力和行走能力。对于脊髓功能预后,随机森林算法的曲线下面积(AUC)为0.832,logistic回归算法的AUC为0.813 (95% CI 0.805 ~ 0.883),决策树算法的AUC为0.747 (95% CI 0.677 ~ 0.802), XGBoost(极端梯度增强)算法的AUC为0.867。对于术后并发症,随机森林算法的AUC为0.627 (95% CI 0.441 ~ 0.812), logistic回归算法的AUC为0.747 (95% CI 0.597 ~ 0.896),决策树算法的AUC为0.688。对于独立生活能力,分类回归树模型AUC为0.813。对于步行能力,基于向量机算法的模型最有效,AUC为0.780。结论:ML模型预测脊髓损伤的预后相对准确,特别是在脊髓功能预后方面。它们有望成为临床医生评估脊髓损伤患者预后的重要工具,其中XGBoost算法表现出最好的性能。随着大数据的使用和机器学习算法的发展,预测模型应该继续发展。
{"title":"Predicting Spinal Cord Injury Prognosis Using Machine Learning: Systematic Review and Meta-Analysis.","authors":"Linxing Zhong, Qiying Huang, Hao Zhang, Liang Xue, Yehuang Chen, Jianwu Wu, Liangfeng Wei","doi":"10.2196/66233","DOIUrl":"10.2196/66233","url":null,"abstract":"<p><strong>Background: </strong>Spinal cord injury (SCI) is complicated and varied conditions that receive a lot of attention. However, the prognosis of patients with SCI is increasingly being predicted using machine learning (ML) techniques.</p><p><strong>Objective: </strong>This study aims to evaluate the efficacy and caliber of ML models in forecasting the consequences of SCI.</p><p><strong>Methods: </strong>Literature searches were conducted in PubMed, Web of Science, Embase, PROSPERO, Scopus, Cochrane Library, China National Knowledge Infrastructure, China Biomedical Literature Service System, and Wanfang databases. Meta-analysis of the area under the receiver operating characteristic curve of ML models was performed to comprehensively evaluate their performance.</p><p><strong>Results: </strong>A total of 1254 articles were retrieved, and 13 eligible studies were included. Predictive outcomes included spinal cord function prognosis, postoperative complications, independent living ability, and walking ability. For spinal cord function prognosis, the area under the curve (AUC) of the random forest algorithm was 0.832, the AUC of the logistic regression algorithm was 0.813 (95% CI 0.805-0.883), the AUC of the decision tree algorithm was 0.747 (95% CI 0.677-0.802), and the AUC of the XGBoost (extreme gradient boosting) algorithm was 0.867. For postoperative complications, the AUC of the random forest algorithm was 0.627 (95% CI 0.441-0.812), the AUC of the logistic regression algorithm was 0.747 (95% CI 0.597-0.896), and the AUC of the decision tree algorithm was 0.688. For independent living ability, the AUC of the classification and regression tree model was 0.813. For walking ability, the model based on the vector machine algorithm was the most effective, with an AUC of 0.780.</p><p><strong>Conclusions: </strong>The ML models predict SCI outcomes with relative accuracy, particularly in spinal cord function prognosis. They are expected to become important tools for clinicians in assessing the prognosis of patients with SCI, with the XGBoost algorithm showing the best performance. Prediction models should continue to advance as large data are used and ML algorithms develop.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e66233"},"PeriodicalIF":2.0,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12680090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145688593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of Deep Learning-Based Multimodal Data Fusion for the Diagnosis of Skin Neglected Tropical Diseases: Systematic Review. 基于深度学习的多模态数据融合在皮肤被忽视热带病诊断中的应用:系统综述。
IF 2 Pub Date : 2025-12-04 DOI: 10.2196/67584
G Yohannes Minyilu, Mohammed Abebe Yimer, Million Meshesha

Background: Neglected tropical diseases (NTDs) are the most prevalent diseases and comprise 21 different conditions. One-half of these conditions have skin manifestations, known as skin NTDs. The diagnosis of skin NTDs incorporates visual examination of patients, and deep learning (DL)-based diagnostic tools can be used to assist the diagnostic procedures. The use of advanced DL-based methods, including multimodal data fusion (MMDF) functionality, could be a potential approach to enhance the diagnostic procedures of these diseases. However, little has been done toward the application of such tools, as confirmed by the very few studies currently available that implemented MMDF for skin NTDs.

Objective: This article presents a systematic review regarding the use of DL-based MMDF methods for the diagnosis of skin NTDs and related diseases (non-NTD skin diseases), including the ethical risks and potential risk of bias.

Methods: The review was conducted based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method using 6 parameters (research approach followed, disease[s] diagnosed, dataset[s] used, algorithm[s] applied, performance achieved, and future direction[s]).

Results: Initially, 437 articles were collected from 5 major groups of identified sources; 14 articles were selected for the final analysis. Results revealed that, compared with traditional methods, the MMDF methods improved model performances for the diagnoses of skin NTDs and non-NTD skin diseases. Algorithmically, convolutional neural network (CNN)-based models were the predominantly used DL architectures (9/14 studies, 64% ), providing feature extraction, feature fusion, and disease classification, which were also conducted with transformer-based methods (1/14, 7%). Furthermore, recurrent neural networks were used in combination with CNN-based feature extractors to fuse multimodal data (1/14, 7%) and with generative models (1/14, 7%). The remaining studies used study-specific algorithms using transformers (1/14, 7%) and generative models (1/14, 7%).

Conclusions: Finally, this article suggests that further studies should be conducted about using DL-based MMDF methods for skin NTDs, considering model efficiency, data scarcity, algorithm selection and use, fusion strategies of multiple modalities, and the possible adoption of such tools for resource-constrained areas.

背景:被忽视的热带病(NTDs)是最流行的疾病,包括21种不同的疾病。这些疾病中有一半有皮肤表现,称为皮肤被忽视热带病。皮肤ntd的诊断包括对患者的视觉检查,基于深度学习(DL)的诊断工具可用于辅助诊断程序。使用先进的基于dl的方法,包括多模态数据融合(MMDF)功能,可能是一种增强这些疾病诊断程序的潜在方法。然而,这类工具的应用很少,目前很少有研究证实,MMDF用于皮肤ntd。目的:本文系统综述了基于dl的MMDF方法在皮肤ntd及相关疾病(非ntd皮肤病)诊断中的应用,包括伦理风险和潜在偏倚风险。方法:采用PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis)方法,采用6个参数(采用的研究方法、诊断的疾病、使用的数据集、应用的算法、取得的绩效和未来发展方向)进行综述。结果:最初,从5大类已确定的来源中收集了437篇文章;最终选择了14篇文章进行分析。结果表明,与传统方法相比,MMDF方法提高了皮肤ntd和非ntd皮肤病的模型诊断性能。在算法上,基于卷积神经网络(CNN)的模型是主要使用的深度学习架构(9/14,64%),提供特征提取、特征融合和疾病分类,这些也使用基于变压器的方法进行(1/14,7%)。此外,将递归神经网络与基于cnn的特征提取器结合使用,融合多模态数据(1/ 14,7%)和生成模型(1/ 14,7%)。其余的研究使用特定的研究算法,使用变压器(1/ 14,7%)和生成模型(1/ 14,7%)。结论:最后,本文建议进一步研究基于dl的MMDF方法在皮肤ntd中的应用,考虑模型效率、数据稀缺性、算法选择和使用、多模式融合策略以及资源受限地区可能采用此类工具。
{"title":"Application of Deep Learning-Based Multimodal Data Fusion for the Diagnosis of Skin Neglected Tropical Diseases: Systematic Review.","authors":"G Yohannes Minyilu, Mohammed Abebe Yimer, Million Meshesha","doi":"10.2196/67584","DOIUrl":"10.2196/67584","url":null,"abstract":"<p><strong>Background: </strong>Neglected tropical diseases (NTDs) are the most prevalent diseases and comprise 21 different conditions. One-half of these conditions have skin manifestations, known as skin NTDs. The diagnosis of skin NTDs incorporates visual examination of patients, and deep learning (DL)-based diagnostic tools can be used to assist the diagnostic procedures. The use of advanced DL-based methods, including multimodal data fusion (MMDF) functionality, could be a potential approach to enhance the diagnostic procedures of these diseases. However, little has been done toward the application of such tools, as confirmed by the very few studies currently available that implemented MMDF for skin NTDs.</p><p><strong>Objective: </strong>This article presents a systematic review regarding the use of DL-based MMDF methods for the diagnosis of skin NTDs and related diseases (non-NTD skin diseases), including the ethical risks and potential risk of bias.</p><p><strong>Methods: </strong>The review was conducted based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method using 6 parameters (research approach followed, disease[s] diagnosed, dataset[s] used, algorithm[s] applied, performance achieved, and future direction[s]).</p><p><strong>Results: </strong>Initially, 437 articles were collected from 5 major groups of identified sources; 14 articles were selected for the final analysis. Results revealed that, compared with traditional methods, the MMDF methods improved model performances for the diagnoses of skin NTDs and non-NTD skin diseases. Algorithmically, convolutional neural network (CNN)-based models were the predominantly used DL architectures (9/14 studies, 64% ), providing feature extraction, feature fusion, and disease classification, which were also conducted with transformer-based methods (1/14, 7%). Furthermore, recurrent neural networks were used in combination with CNN-based feature extractors to fuse multimodal data (1/14, 7%) and with generative models (1/14, 7%). The remaining studies used study-specific algorithms using transformers (1/14, 7%) and generative models (1/14, 7%).</p><p><strong>Conclusions: </strong>Finally, this article suggests that further studies should be conducted about using DL-based MMDF methods for skin NTDs, considering model efficiency, data scarcity, algorithm selection and use, fusion strategies of multiple modalities, and the possible adoption of such tools for resource-constrained areas.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67584"},"PeriodicalIF":2.0,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715462/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking AI Workflows: Guidelines for Scientific Evaluation in Digital Health Companies. 重新思考人工智能工作流程:数字医疗公司科学评估指南。
IF 2 Pub Date : 2025-12-04 DOI: 10.2196/71798
Kelsey Lynn McAlister, Lee Gonzales, Jennifer Huberty

Unlabelled: Artificial intelligence (AI) is revolutionizing digital health, driving innovation in care delivery and operational efficiency. Despite its potential, many AI systems fail to meet real-world expectations due to limited evaluation practices that focus narrowly on short-term metrics like efficiency and technical accuracy. Ignoring factors such as usability, trust, transparency, and adaptability hinders AI adoption, scalability, and long-term impact in health care. This paper emphasizes the importance of embedding scientific evaluation as a core operational layer throughout the AI life cycle. We outline practical guidelines for digital health companies to improve AI integration and evaluation, informed by over 35 years of experience in science, the digital health industry, and AI development. It describes a multistep approach, including stakeholder analysis, real-time monitoring, and iterative improvement, that digital health companies can adopt to ensure robust AI integration. Key recommendations include assessing stakeholder needs, designing AI systems that can check its own work, conducting testing to address usability and biases, and ensuring continuous improvement to keep systems user-centered and adaptable. By integrating these guidelines, digital health companies can improve AI reliability, scalability, and trustworthiness, driving better health care delivery and stakeholder alignment.

未标注:人工智能(AI)正在彻底改变数字健康,推动医疗服务和运营效率的创新。尽管具有潜力,但由于有限的评估实践只关注效率和技术准确性等短期指标,许多人工智能系统未能满足现实世界的期望。忽视可用性、信任、透明度和适应性等因素阻碍了人工智能的采用、可扩展性和对医疗保健的长期影响。本文强调了在整个人工智能生命周期中嵌入科学评估作为核心操作层的重要性。我们根据在科学、数字健康产业和人工智能发展方面超过35年的经验,为数字健康公司提供了改进人工智能整合和评估的实用指南。它描述了一种多步骤方法,包括利益相关者分析、实时监测和迭代改进,数字医疗公司可以采用这种方法来确保强大的人工智能集成。主要建议包括评估利益相关者的需求,设计可以检查自己工作的人工智能系统,进行测试以解决可用性和偏见,并确保持续改进以保持系统以用户为中心和适应性。通过整合这些指导方针,数字医疗公司可以提高人工智能的可靠性、可扩展性和可信度,从而推动更好的医疗服务提供和利益相关者的协调。
{"title":"Rethinking AI Workflows: Guidelines for Scientific Evaluation in Digital Health Companies.","authors":"Kelsey Lynn McAlister, Lee Gonzales, Jennifer Huberty","doi":"10.2196/71798","DOIUrl":"10.2196/71798","url":null,"abstract":"<p><strong>Unlabelled: </strong>Artificial intelligence (AI) is revolutionizing digital health, driving innovation in care delivery and operational efficiency. Despite its potential, many AI systems fail to meet real-world expectations due to limited evaluation practices that focus narrowly on short-term metrics like efficiency and technical accuracy. Ignoring factors such as usability, trust, transparency, and adaptability hinders AI adoption, scalability, and long-term impact in health care. This paper emphasizes the importance of embedding scientific evaluation as a core operational layer throughout the AI life cycle. We outline practical guidelines for digital health companies to improve AI integration and evaluation, informed by over 35 years of experience in science, the digital health industry, and AI development. It describes a multistep approach, including stakeholder analysis, real-time monitoring, and iterative improvement, that digital health companies can adopt to ensure robust AI integration. Key recommendations include assessing stakeholder needs, designing AI systems that can check its own work, conducting testing to address usability and biases, and ensuring continuous improvement to keep systems user-centered and adaptable. By integrating these guidelines, digital health companies can improve AI reliability, scalability, and trustworthiness, driving better health care delivery and stakeholder alignment.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e71798"},"PeriodicalIF":2.0,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12677877/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Perceived Roles of AI in Clinical Practice: National Survey of 941 Academic Physicians. 人工智能在临床实践中的感知作用:全国941名学术医生的调查。
IF 2 Pub Date : 2025-12-04 DOI: 10.2196/72535
Anshul Ratnaparkhi, Simon Moore, Abhinav Suri, Bayard Wilson, Jacob Alderete, T J Florence, David Zarrin, David Berin, Rami Abuqubo, Kirstin Cook, Matiar Jafari, Joseph S Bell, Luke Macyszyn, Andrew C Vivas, Joel Beckett

Background: Artificial intelligence (AI) and machine learning models are frequently developed in medical research to optimize patient care, yet they remain rarely used in clinical practice.

Objective: This study aims to understand the disconnect between model development and implementation by surveying physicians of all specialties across the United States.

Methods: The present survey was distributed to residency coordinators at Accreditation Council for Graduate Medical Education-accredited residency programs to disseminate among attending physicians and resident physicians affiliated with their academic institution. Respondents were asked to identify and quantify the extent of their training and specialization, as well as the type and location of their practice. Physicians were then asked follow-up questions regarding AI in their practice, including whether its use is permitted, whether they would use it if made available, primary reasons for using or not using AI, elements that would encourage its use, and ethical concerns.

Results: Of the 941 physicians who responded to the survey, 384 (40.8%) were attending physicians and 557 (59.2%) were resident physicians. The majority of the physicians (651/795, 81.9%) indicated that they would adopt AI in clinical practice if given the opportunity. The most cited intended uses for AI were risk stratification, image analysis or segmentation, and disease prognosis. The most common reservations were concerns about clinical errors made by AI and the potential to replicate human biases.

Conclusions: To date, this study comprises the largest and most diverse dataset of physician perspectives on AI. Our results emphasize that most academic physicians in the United States are open to adopting AI in their clinical practice. However, for AI to become clinically relevant, developers and physicians must work synergistically to design models that are accurate, accessible, and intuitive while thoroughly addressing ethical concerns associated with the implementation of AI in medicine.

背景:人工智能(AI)和机器学习模型经常在医学研究中开发,以优化患者护理,但它们在临床实践中很少使用。目的:本研究旨在通过调查美国所有专业的医生来了解模型开发和实施之间的脱节。方法:本调查分发给研究生医学教育认证委员会认可的住院医师项目的住院医师协调员,以在其所属学术机构的主治医师和住院医师中传播。受访者被要求确定和量化他们的培训和专业化程度,以及他们实践的类型和地点。然后向医生询问有关其实践中人工智能的后续问题,包括是否允许使用人工智能,如果有的话他们是否会使用人工智能,使用或不使用人工智能的主要原因,鼓励使用人工智能的因素,以及道德问题。结果:941名受访医师中,主治医师384名(40.8%),住院医师557名(59.2%)。大多数医生(651/795,81.9%)表示,如果有机会,他们会在临床实践中采用人工智能。人工智能被引用最多的预期用途是风险分层、图像分析或分割以及疾病预后。最常见的保留意见是对人工智能造成的临床错误以及复制人类偏见的可能性的担忧。结论:迄今为止,这项研究包含了最大和最多样化的医生对人工智能的观点数据集。我们的研究结果强调,美国大多数学术医生对在临床实践中采用人工智能持开放态度。然而,为了使人工智能与临床相关,开发人员和医生必须协同合作,设计准确、可访问和直观的模型,同时彻底解决与人工智能在医学中实施相关的伦理问题。
{"title":"The Perceived Roles of AI in Clinical Practice: National Survey of 941 Academic Physicians.","authors":"Anshul Ratnaparkhi, Simon Moore, Abhinav Suri, Bayard Wilson, Jacob Alderete, T J Florence, David Zarrin, David Berin, Rami Abuqubo, Kirstin Cook, Matiar Jafari, Joseph S Bell, Luke Macyszyn, Andrew C Vivas, Joel Beckett","doi":"10.2196/72535","DOIUrl":"10.2196/72535","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) and machine learning models are frequently developed in medical research to optimize patient care, yet they remain rarely used in clinical practice.</p><p><strong>Objective: </strong>This study aims to understand the disconnect between model development and implementation by surveying physicians of all specialties across the United States.</p><p><strong>Methods: </strong>The present survey was distributed to residency coordinators at Accreditation Council for Graduate Medical Education-accredited residency programs to disseminate among attending physicians and resident physicians affiliated with their academic institution. Respondents were asked to identify and quantify the extent of their training and specialization, as well as the type and location of their practice. Physicians were then asked follow-up questions regarding AI in their practice, including whether its use is permitted, whether they would use it if made available, primary reasons for using or not using AI, elements that would encourage its use, and ethical concerns.</p><p><strong>Results: </strong>Of the 941 physicians who responded to the survey, 384 (40.8%) were attending physicians and 557 (59.2%) were resident physicians. The majority of the physicians (651/795, 81.9%) indicated that they would adopt AI in clinical practice if given the opportunity. The most cited intended uses for AI were risk stratification, image analysis or segmentation, and disease prognosis. The most common reservations were concerns about clinical errors made by AI and the potential to replicate human biases.</p><p><strong>Conclusions: </strong>To date, this study comprises the largest and most diverse dataset of physician perspectives on AI. Our results emphasize that most academic physicians in the United States are open to adopting AI in their clinical practice. However, for AI to become clinically relevant, developers and physicians must work synergistically to design models that are accurate, accessible, and intuitive while thoroughly addressing ethical concerns associated with the implementation of AI in medicine.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e72535"},"PeriodicalIF":2.0,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715463/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1