首页 > 最新文献

JMIR Medical Informatics最新文献

英文 中文
Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review. 基于大语言模型的医学教育历史记录虚拟病人系统:综合系统综述。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2026-01-02 DOI: 10.2196/79039
Dongliang Li, Syaheerah Lebai Lutfi

Background: Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.

Objective: This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.

Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.

Results: A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.

Conclusions: Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.

背景:大型语言模型(llm),如GPT-3.5和GPT-4 (OpenAI),通过为标准化患者提供可扩展且具有成本效益的替代方案,已经改变了医学教育中的虚拟患者系统。然而,对其性能的系统评价,特别是对涉及多种并存疾病的多发病情况的评价仍然有限。目的:本系统综述旨在评估基于法学硕士的虚拟患者病史采集系统,解决四个研究问题:(1)模拟患者类型和疾病范围,(2)性能增强技术,(3)实验设计和评估指标,以及(4)数据集特征和可用性。方法:按照PRISMA (Preferred Reporting Items for Systematic Reviews and meta - analysis) 2020,检索9个数据库(2020年1月1日至2025年8月18日)。非变压器llm和非历史记录任务被排除在外。进行了多维质量和偏倚评估。结果:共纳入39项研究,由一名计算机科学研究人员在监督下筛选。基于法学硕士的虚拟患者系统主要模拟内科和精神健康障碍,其中许多针对不同的单一疾病类型,但很少涵盖多病或罕见疾病。基于角色的提示、少镜头学习、多智能体框架、知识图(KG)集成(top-k准确率16.02%)和微调等技术提高了对话和诊断的准确性。多模态输入(如语音和图像)提高了沉浸感和真实感。评估通常涉及10-50名学生和3-10名专家,表现出很强的表现(最高准确率:0.45-0.98,幻觉率:0.31%-5%,系统可用性量表[SUS]≥80)。然而,小样本,不一致的指标和有限的控制限制了推广。常见的数据集如MIMIC-III(重症监护医疗信息市场- iii)显示重症监护病房(ICU)偏倚,缺乏多样性,影响了可重复性和外部效度。结论:纳入的研究显示偏倚风险中等,指标不一致,队列较小,数据集透明度有限。基于法学硕士的虚拟患者系统在模拟多种疾病类型方面表现出色,但缺乏多病症患者的代表。KGs提高了top-k的准确性,并支持结构化的疾病表示和推理。未来的研究应优先考虑将混合kg -思维链架构与开源kg(例如,UMLS[统一医学语言系统]和SNOMED-CT[系统化医学术语-临床术语])、参数高效的精细调整、对话压缩、多模态llm、标准化指标、更大的队列和开放访问的多模态数据集集成在一起,以进一步提高真实性、诊断准确性、公平性和教育效用。
{"title":"Large Language Model-Based Virtual Patient Systems for History-Taking in Medical Education: Comprehensive Systematic Review.","authors":"Dongliang Li, Syaheerah Lebai Lutfi","doi":"10.2196/79039","DOIUrl":"10.2196/79039","url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs), such as GPT-3.5 and GPT-4 (OpenAI), have been transforming virtual patient systems in medical education by providing scalable and cost-effective alternatives to standardized patients. However, systematic evaluations of their performance, particularly for multimorbidity scenarios involving multiple coexisting diseases, are still limited.</p><p><strong>Objective: </strong>This systematic review aimed to evaluate LLM-based virtual patient systems for medical history-taking, addressing four research questions: (1) simulated patient types and disease scope, (2) performance-enhancing techniques, (3) experimental designs and evaluation metrics, and (4) dataset characteristics and availability.</p><p><strong>Methods: </strong>Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020, 9 databases were searched (January 1, 2020, to August 18, 2025). Nontransformer LLMs and non-history-taking tasks were excluded. Multidimensional quality and bias assessments were conducted.</p><p><strong>Results: </strong>A total of 39 studies were included, screened by one computer science researcher under supervision. LLM-based virtual patient systems mainly simulated internal medicine and mental health disorders, with many addressing distinct single disease types but few covering multimorbidity or rare conditions. Techniques like role-based prompts, few-shot learning, multiagent frameworks, knowledge graph (KG) integration (top-k accuracy 16.02%), and fine-tuning enhanced dialogue and diagnostic accuracy. Multimodal inputs (eg, speech and imaging) improved immersion and realism. Evaluations, typically involving 10-50 students and 3-10 experts, demonstrated strong performance (top-k accuracy: 0.45-0.98, hallucination rate: 0.31%-5%, System Usability Scale [SUS] ≥80). However, small samples, inconsistent metrics, and limited controls restricted generalizability. Common datasets such as MIMIC-III (Medical Information Mart for Intensive Care-III) exhibited intensive care unit (ICU) bias and lacked diversity, affecting reproducibility and external validity.</p><p><strong>Conclusions: </strong>Included studies showed moderate risk of bias, inconsistent metrics, small cohorts, and limited dataset transparency. LLM-based virtual patient systems excel in simulating multiple disease types but lack multimorbidity patient representation. KGs improve top-k accuracy and support structured disease representation and reasoning. Future research should prioritize hybrid KG-chain-of-thought architectures integrated with open-source KGs (eg, UMLS [Unified Medical Language System] and SNOMED-CT [Systematized Nomenclature of Medicine - Clinical Terms]), parameter-efficient fine-tuning, dialogue compression, multimodal LLMs, standardized metrics, larger cohorts, and open-access multimodal datasets to further enhance realism, diagnostic accuracy, fairness, and educational utility.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"14 ","pages":"e79039"},"PeriodicalIF":3.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811743/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Left Ventricular Ejection Fraction Recovery After Percutaneous Coronary Intervention in Patients With Chronic Coronary Syndrome by Using Interpretable Machine Learning Models: Retrospective Study. 利用可解释的机器学习模型预测慢性冠脉综合征患者经皮冠状动脉介入治疗后左心室射血分数恢复:回顾性研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-29 DOI: 10.2196/77839
Jiayi Ding, Guanqi Lyu, Masaharu Nakayama, Kotaro Nochioka, Jun Takahashi, Satoshi Yasuda, Tetsuya Matoba, Takahide Kohro, Naoyuki Akashi, Hideo Fujita, Yusuke Oba, Tomoyuki Kabutoya, Kazuomi Kario, Yasushi Imai, Arihiro Kiyosue, Yoshiko Mizuno, Takamasa Iwai, Yoshihiro Miyamoto, Masanobu Ishii, Kenichi Tsujita, Taishi Nakamura, Hisahiko Sato, Ryozo Nagai

Background: Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.

Objective: This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.

Methods: We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.

Results: The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.

Conclusions: ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.

背景:准确预测慢性冠脉综合征(CCS)患者经皮冠状动脉介入治疗(PCI)后左室射血分数(LVEF)恢复对临床决策至关重要。目的:本研究旨在开发和比较多个机器学习(ML)模型来预测LVEF恢复并确定关键贡献特征。方法:回顾性分析来自临床深度数据积累系统数据库的520例CCS患者。根据基线LVEF(≥50%或10%与≤0%相比)和(2)正常恢复,将患者分为4个二元分类任务,定义为LVEF增加0%至10%与≤0%相比。对于每个任务,将3种特征选择策略(所有特征、最小绝对收缩和选择算子[LASSO]回归和递归特征消除[RFE])与4种ML算法(极端梯度增强[XGBoost]、分类增强、轻梯度增强机和随机森林)相结合,得到48个模型。采用10倍交叉验证对模型进行评估,并通过曲线下面积(AUC)、决策曲线分析和校准图进行评估。结果:保存良好的LVEF, RFE联合XGBoost的AUC最高(AUC=0.93),保存正常的LVEF, LASSO联合XGBoost (AUC=0.79),还原良好的LVEF, LASSO联合XGBoost (AUC=0.88),还原正常的LVEF, RFE联合XGBoost (AUC=0.84)。Shapley加性解释分析发现,尿酸、血小板、红细胞压积、脑利钠肽、糖化血红蛋白、葡萄糖、肌酐、基线LVEF、左室舒张末期内径、心率、V5 R波振幅、V6 R波振幅是LVEF恢复的重要预测因素。结论:结合特征选择策略的ML模型对PCI术后LVEF恢复表现出较强的预测性能。这些可解释的模型可以支持临床决策,并可以改善PCI术后CCS患者的管理。
{"title":"Predicting Left Ventricular Ejection Fraction Recovery After Percutaneous Coronary Intervention in Patients With Chronic Coronary Syndrome by Using Interpretable Machine Learning Models: Retrospective Study.","authors":"Jiayi Ding, Guanqi Lyu, Masaharu Nakayama, Kotaro Nochioka, Jun Takahashi, Satoshi Yasuda, Tetsuya Matoba, Takahide Kohro, Naoyuki Akashi, Hideo Fujita, Yusuke Oba, Tomoyuki Kabutoya, Kazuomi Kario, Yasushi Imai, Arihiro Kiyosue, Yoshiko Mizuno, Takamasa Iwai, Yoshihiro Miyamoto, Masanobu Ishii, Kenichi Tsujita, Taishi Nakamura, Hisahiko Sato, Ryozo Nagai","doi":"10.2196/77839","DOIUrl":"10.2196/77839","url":null,"abstract":"<p><strong>Background: </strong>Accurately predicting left ventricular ejection fraction (LVEF) recovery after percutaneous coronary intervention (PCI) in patients with chronic coronary syndrome (CCS) is crucial for clinical decision-making.</p><p><strong>Objective: </strong>This study aimed to develop and compare multiple machine learning (ML) models to predict LVEF recovery and identify key contributing features.</p><p><strong>Methods: </strong>We retrospectively analyzed 520 patients with CCS from the Clinical Deep Data Accumulation System database. Patients were categorized into 4 binary classification tasks based on baseline LVEF (≥50% or <50%) and degree of recovery: (1) good recovery, defined as an LVEF increase of >10% compared with ≤0%; and (2) normal recovery, defined as an LVEF increase of 0% to 10% compared with ≤0%. For each task, 3 feature selection strategies (all features, least absolute shrinkage and selection operator [LASSO] regression, and recursive feature elimination [RFE]) were combined with 4 ML algorithms (extreme gradient boosting [XGBoost], categorical boosting, light gradient boosting machine, and random forest), resulting in 48 models. Models were evaluated using 10-fold cross-validation and assessed by the area under the curve (AUC), decision curve analysis, and calibration plots.</p><p><strong>Results: </strong>The highest AUCs were achieved by RFE combined with XGBoost (AUC=0.93) for preserved LVEF with good recovery, LASSO combined with XGBoost (AUC=0.79) for preserved LVEF with normal recovery, LASSO combined with XGBoost (AUC=0.88) for reduced LVEF with good recovery, and RFE combined with XGBoost (AUC=0.84) for reduced LVEF with normal recovery. Shapley Additive Explanation analysis identified uric acid, platelets, hematocrit, brain natriuretic peptide, glycated hemoglobin, glucose, creatinine, baseline LVEF, left ventricular end-diastolic internal diameter, heart rate, R wave amplitude in V5, and R wave amplitude in V6 as important predictive factors of LVEF recovery.</p><p><strong>Conclusions: </strong>ML models incorporating feature selection strategies demonstrated strong predictive performance for LVEF recovery after PCI. These interpretable models may support clinical decision-making and can improve the management of patients with CCS after PCI.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e77839"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12796882/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1 in BI-RADS Category 4 Classification and Malignancy Prediction From Mammography Reports: Retrospective Diagnostic Study. chatgpt - 40、Claude 3 Opus和DeepSeek-R1在BI-RADS 4类分类和恶性肿瘤预测中的应用:回顾性诊断研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-25 DOI: 10.2196/80182
Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan
<p><strong>Background: </strong>Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.</p><p><strong>Objective: </strong>This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.</p><p><strong>Methods: </strong>This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.</p><p><strong>Results: </strong>ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P<.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F<sub>1</sub>-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.</p><p><strong>Conclusions: </strong>LLMs are feasible for distinguishing between benign and mali
背景:乳房x线摄影是乳腺癌筛查和诊断的关键成像方式,乳房成像报告和数据系统(BI-RADS)提供标准化的风险分层。然而,BI-RADS 4类病变由于其广泛的恶性概率范围和大量的良恶性重叠,给诊断带来了挑战。此外,目前的解释严重依赖于放射科医生的专业知识,导致可变性和潜在的诊断错误。大型语言模型(llm)的最新进展,如chatgpt - 40、Claude 3 Opus和DeepSeek-R1,为自动医疗报告解释提供了新的可能性。目的:本研究旨在探讨基于自由文本乳腺x线摄影报告的LLMs评估BI-RADS 4类病变良恶性亚类的可行性。方法:这项回顾性的单中心研究纳入了307例患者(平均年龄47.25岁,11.39岁),这些患者在2021年5月至2024年3月期间进行了BI-RADS 4类乳房x光检查。三位法学硕士(chatgpt - 40、Claude 3 Opus和DeepSeek-R1)仅从报告文本中将BI-RADS分为4个子类别,而放射科医生则根据图像审查进行分类。病理学作为参考标准,并评估LLMs预测的可重复性。比较放射科医生和法学硕士的诊断表现,分析法学硕士错误分类背后的内在原因。结果:chatgpt - 40的重复性高于DeepSeek-R1和Claude 3 Opus (Fleiss κ 0.850比0.824和0.732)。虽然LLMs的整体准确率低于放射科医生(高级:74.5%,初级:72.0%,DeepSeek-R1: 63.5%, chatgpt - 40: 62.4%, Claude 3 Opus: 60.8%),但其敏感性更高(高级:80.7%,初级:68.0%,DeepSeek-R1: 84.0%, chatgpt - 40: 84.7%, Claude 3 Opus: 92.7%),而特异性仍然较低(高级:68.3%,初级:76.1%,DeepSeek-R1: 43.0%, chatgpt - 40: 40.1%, Claude 3 Opus: 28.9%)。在llm中,DeepSeek-R1的预测精度最好,在受试者工作特征曲线下的面积为0.64 (95% CI 0.57 ~ 0.70),其次是chatgpt - 40 (0.62, 95% CI 0.56 ~ 0.69)和Claude 3 Opus (0.61, 95% CI 0.54 ~ 0.67)。相比之下,初级和高级放射科医师在接受者工作特征曲线下的面积更高,分别为0.72 (95% CI 0.66-0.78)和0.75 (95% CI 0.69-0.80)。DeLong测试证实,所有三位llm的表现都明显低于初级和高级放射科医生(所有p1得分为47.6%,DeepSeek-R1得分为45.6%,Claude 3 Opus得分为36.2%)。结论:LLMs鉴别BI-RADS 4类良恶性病变是可行的,稳定性好,敏感性高,但特异性相对不足。它们在筛查中显示出潜力,并可能帮助放射科医生减少漏诊。
{"title":"Performance of ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1 in BI-RADS Category 4 Classification and Malignancy Prediction From Mammography Reports: Retrospective Diagnostic Study.","authors":"Xingwei Dai, Man Ke, Dixing Xie, Mengting Mei, Si Wei, Yi Dai, Ronghua Yan","doi":"10.2196/80182","DOIUrl":"10.2196/80182","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Mammography is a key imaging modality for breast cancer screening and diagnosis, with the Breast Imaging Reporting and Data System (BI-RADS) providing standardized risk stratification. However, BI-RADS category 4 lesions pose a diagnostic challenge due to their wide malignancy probability range and substantial overlap between benign and malignant findings. Moreover, current interpretations rely heavily on radiologists' expertise, leading to variability and potential diagnostic errors. Recent advances in large language models (LLMs), such as ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1, offer new possibilities for automated medical report interpretation.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to explore the feasibility of LLMs in evaluating the benign or malignant subcategories of BI-RADS category 4 lesions based on free-text mammography reports.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;This retrospective, single-center study included 307 patients (mean age 47.25, 11.39 years) with BI-RADS category 4 mammography reports between May 2021 and March 2024. Three LLMs (ChatGPT-4o, Claude 3 Opus, and DeepSeek-R1) classified BI-RADS 4 subcategories from the reports' text only, whereas radiologists based their classifications on image review. Pathology served as the reference standard, and the reproducibility of LLMs' predictions was assessed. The diagnostic performance of radiologists and LLMs was compared, and the internal reasoning behind LLMs' misclassifications was analyzed.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;ChatGPT-4o demonstrated higher reproducibility than DeepSeek-R1 and Claude 3 Opus (Fleiss κ 0.850 vs 0.824 and 0.732, respectively). Although the overall accuracy of LLMs was lower than that of radiologists (senior: 74.5%; junior: 72.0%; DeepSeek-R1: 63.5%; ChatGPT-4o: 62.4%; Claude 3 Opus: 60.8%), their sensitivity was higher (senior: 80.7%; junior: 68.0%; DeepSeek-R1: 84.0%; ChatGPT-4o: 84.7%; Claude 3 Opus: 92.7%), while specificity remained lower (senior: 68.3%; junior: 76.1%; DeepSeek-R1: 43.0%; ChatGPT-4o: 40.1%; Claude 3 Opus: 28.9%). DeepSeek-R1 achieved the best prediction accuracy among LLMs with an area under the receiver operating characteristic curve of 0.64 (95% CI 0.57-0.70), followed by ChatGPT-4o (0.62, 95% CI 0.56-0.69) and Claude 3 Opus (0.61, 95% CI 0.54-0.67). By comparison, junior and senior radiologists achieved higher area under the receiver operating characteristic curves of 0.72 (95% CI 0.66-0.78) and 0.75 (95% CI 0.69-0.80), respectively. DeLong testing confirmed that all three LLMs performed significantly worse than both junior and senior radiologists (all P&lt;.05), and no significant difference was observed between the two radiologist groups (P=.55). At the subcategory level, ChatGPT-4o yielded an overall F&lt;sub&gt;1&lt;/sub&gt;-score of 47.6%, DeepSeek-R1 achieved 45.6%, and Claude 3 Opus achieved 36.2%.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;LLMs are feasible for distinguishing between benign and mali","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e80182"},"PeriodicalIF":3.8,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784141/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145829101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acceptance of Electronic Medical Records and Associated Factors Among Health Care Workers in Northwest Ethiopia: Cross-Sectional Study. 埃塞俄比亚西北部卫生保健工作者接受电子病历及其相关因素:横断面研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-23 DOI: 10.2196/72030
Asmamaw Ketemaw Tsehay, Kholofelo Lorraine Matlhaba

Background: Although electronic medical records (EMRs) play a vital role in strengthening the health care system by improving efficiency, data management, and patient care, their development in Ethiopia is still in its early stages. Hence, most public health care facilities manage their patient information using paper-based recording, which results in errors, delays, and reduced service quality.

Objective: This study aims to determine the level of acceptance of the EMR system and describe contributing factors.

Methods: A cross-sectional study was conducted at health care facilities in Bahir City, Northwestern Ethiopia. A total of 322 health workers participated in the study, drawn from 5 health facilities that have implemented the EMR system. Descriptive statistics and bivariate and multivariate binary logistic regression were done to determine factors associated with EMR acceptance computed from mediating factors (perceived ease of use and perceived usefulness), and which is more appropriate in early-stage implementation.

Results: Out of the total 322 respondents, 256 (73%) respondents with 95% CI 67.4-78.2 had a good acceptance of using EMRs. In regression analysis, significant predictors including work experience over 10 years (odds ratio [OR] 14.32, 95% CI 4.60-44.58), income dissatisfaction (OR 0.28, 95% CI 0.10-0.82), owning a personal computer (OR 11.08, 95% CI 4.03-30.24), EMR-specific training (OR 4.71, 95% CI 1.52-14.54), basic electronic health management information system/district health information system 2 training (OR 3.06, 95% CI 1.02-9.17), and system usability (OR 38.24, 95% CI 12.26-119.27) were identified.

Conclusions: The study demonstrated a moderate level of EMR acceptance among health care workers, with system usability identified as the strongest predictor. Significant factors influencing EMR acceptance included longer work experience, ownership of a personal computer, and prior EMR or electronic health management information system/district health information system 2 training. Context-specific strategies are needed to enhance system usability, provide targeted digital health training, and improve access to technological resources in order to support broader EMR adoption in health care settings.

背景:虽然电子病历(EMRs)通过提高效率、数据管理和患者护理在加强卫生保健系统方面发挥着至关重要的作用,但它们在埃塞俄比亚的发展仍处于早期阶段。因此,大多数公共卫生保健机构使用基于纸张的记录来管理其患者信息,这导致错误、延迟和服务质量下降。目的:本研究旨在确定电子病历系统的接受程度,并描述影响因素。方法:在埃塞俄比亚西北部的Bahir市的卫生保健机构进行了横断面研究。共有322名卫生工作者参与了这项研究,他们来自已实施电子病历系统的5个卫生机构。通过描述性统计和双变量和多变量二元逻辑回归来确定从中介因素(感知易用性和感知有用性)计算的EMR接受度相关的因素,以及哪个更适合早期实施。结果:在总共322名受访者中,256名(73%)受访者(95% CI为67.4-78.2)对使用电子病历有良好的接受度。回归分析发现,10年以上的工作经验(比值比[OR] 14.32, 95% CI 4.60-44.58)、收入不满意度(OR 0.28, 95% CI 0.10-0.82)、拥有个人电脑(OR 11.08, 95% CI 4.03-30.24)、emr专用培训(OR 4.71, 95% CI 1.52-14.54)、基本电子健康管理信息系统/区域卫生信息系统2培训(OR 3.06, 95% CI 1.02-9.17)和系统可用性(OR 38.24, 95% CI 12.26-119.27)具有显著预测意义。结论:该研究表明卫生保健工作者对电子病历的接受程度中等,系统可用性被认为是最强的预测因子。影响电子病历接受程度的重要因素包括工作经验较长、拥有个人电脑、是否接受过电子病历或电子健康管理信息系统/地区卫生信息系统2培训。需要根据具体情况制定战略,以提高系统可用性,提供有针对性的数字卫生培训,并改善技术资源的获取,从而支持在卫生保健环境中更广泛地采用电子病历。
{"title":"Acceptance of Electronic Medical Records and Associated Factors Among Health Care Workers in Northwest Ethiopia: Cross-Sectional Study.","authors":"Asmamaw Ketemaw Tsehay, Kholofelo Lorraine Matlhaba","doi":"10.2196/72030","DOIUrl":"10.2196/72030","url":null,"abstract":"<p><strong>Background: </strong>Although electronic medical records (EMRs) play a vital role in strengthening the health care system by improving efficiency, data management, and patient care, their development in Ethiopia is still in its early stages. Hence, most public health care facilities manage their patient information using paper-based recording, which results in errors, delays, and reduced service quality.</p><p><strong>Objective: </strong>This study aims to determine the level of acceptance of the EMR system and describe contributing factors.</p><p><strong>Methods: </strong>A cross-sectional study was conducted at health care facilities in Bahir City, Northwestern Ethiopia. A total of 322 health workers participated in the study, drawn from 5 health facilities that have implemented the EMR system. Descriptive statistics and bivariate and multivariate binary logistic regression were done to determine factors associated with EMR acceptance computed from mediating factors (perceived ease of use and perceived usefulness), and which is more appropriate in early-stage implementation.</p><p><strong>Results: </strong>Out of the total 322 respondents, 256 (73%) respondents with 95% CI 67.4-78.2 had a good acceptance of using EMRs. In regression analysis, significant predictors including work experience over 10 years (odds ratio [OR] 14.32, 95% CI 4.60-44.58), income dissatisfaction (OR 0.28, 95% CI 0.10-0.82), owning a personal computer (OR 11.08, 95% CI 4.03-30.24), EMR-specific training (OR 4.71, 95% CI 1.52-14.54), basic electronic health management information system/district health information system 2 training (OR 3.06, 95% CI 1.02-9.17), and system usability (OR 38.24, 95% CI 12.26-119.27) were identified.</p><p><strong>Conclusions: </strong>The study demonstrated a moderate level of EMR acceptance among health care workers, with system usability identified as the strongest predictor. Significant factors influencing EMR acceptance included longer work experience, ownership of a personal computer, and prior EMR or electronic health management information system/district health information system 2 training. Context-specific strategies are needed to enhance system usability, provide targeted digital health training, and improve access to technological resources in order to support broader EMR adoption in health care settings.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e72030"},"PeriodicalIF":3.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12775752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Competition or Complementarity Among Telemedicine Tools in Ambulatory Care Practice: Cross-Sectional Analysis. 远程医疗工具在门诊护理实践中的竞争或互补:横断面分析。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-23 DOI: 10.2196/75246
Xiang Oliver Liu, Avijit Sengupta
<p><strong>Background: </strong>Telemedicine use surged due to its capacity to deliver safe, remote care. As the public health crisis subsides, evaluating the interplay among various tools, such as video, audio, and text, becomes critical to sustained use. With health care shifting back to in-person models, understanding whether telemedicine tools complement or compete provides valuable insights for future technology design and usage strategies.</p><p><strong>Objective: </strong>This study investigates whether different types of telemedicine technology tools complement or compete while physicians deliver health care services through them. A clear understanding of the relationships between telemedicine technology tools, physicians' satisfaction, evaluation of care quality, and patient visit percentages is crucial for the design of new telemedicine technology platforms and ensuring quality of care services through technology platforms.</p><p><strong>Methods: </strong>To fulfill our objective, we analyzed data from the 2021 National Electronic Health Records Survey. We used ordered logit and probit regression models to evaluate the effects of telemedicine technology tools on physicians' overall satisfaction, quality of health care evaluation, and the percentage of patient visits via telemedicine.</p><p><strong>Results: </strong>A total of 1875 office-based physicians in the United States completed the survey. Three main outcomes were assessed, including physician satisfaction (n=1614), evaluation of health care quality (n=1617), and the percentage of patient visits conducted via telemedicine (n=1558). Ordered logit and probit regression analyses revealed that the aggravated use of telemedicine tools had a significant impact on improvements in all 3 outcomes. A unit increase in telemedicine tools was associated with a 4.2 percentage point increase in the predicted probability of physicians being "very satisfied" (P<.001) and a 5.2 percentage point increase in evaluating telemedicine quality as "to a great extent" (P<.001). For patient visits, a unit increase in telemedicine tools was associated with a 1.8 percentage point increase in the likelihood of reporting "≥75% of visits via telemedicine" (P<.001). Disaggregated analysis indicated that all individual tools were positively associated with physician satisfaction and quality evaluation (P<.05). Bundle models revealed patterns consistent with complementarity (several bundles exceeded their constituent tools) and competition (some significant bundles were smaller than at least one constituent tool), aligning with the presence of both reinforcing and overlapping functionalities.</p><p><strong>Conclusions: </strong>Our study demonstrates that telemedicine tools interact in ways that can be either complementary or competitive, depending on how their functionalities align within physicians' workflows. Videoconferencing tools, especially when integrated with electronic health record platforms, act as a c
背景:远程医疗的使用激增,因为它能够提供安全的远程护理。随着公共卫生危机的消退,评估各种工具(如视频、音频和文本)之间的相互作用对于持续使用至关重要。随着医疗保健转向面对面模式,了解远程医疗工具是互补还是竞争,为未来的技术设计和使用策略提供了有价值的见解。目的:本研究探讨不同类型的远程医疗技术工具在医生通过它们提供医疗服务时是互补还是竞争。明确远程医疗技术工具与医生满意度、护理质量评价和患者就诊比例之间的关系,对于设计新型远程医疗技术平台和通过技术平台确保护理服务质量至关重要。方法:为了实现我们的目标,我们分析了2021年全国电子健康记录调查的数据。我们使用有序logit和probit回归模型来评估远程医疗技术工具对医生总体满意度、医疗保健评估质量和通过远程医疗就诊的患者百分比的影响。结果:美国共有1875名办公室医生完成了调查。评估了三个主要结果,包括医生满意度(n=1614)、医疗保健质量评价(n=1617)和通过远程医疗进行的患者就诊百分比(n=1558)。有序logit和probit回归分析显示,加重远程医疗工具的使用对所有3种结果的改善都有显著影响。远程医疗工具的单位增加与医生“非常满意”的预测概率增加4.2个百分点相关(结论:我们的研究表明,远程医疗工具的交互方式可以是互补的,也可以是竞争的,这取决于它们的功能如何与医生的工作流程保持一致。视频会议工具,特别是在与电子健康记录平台集成时,可以作为一个核心的补充组件,提高医生的满意度和对护理质量的评估。相比之下,缺乏视频功能或涉及多个非集成平台的组合会破坏工作流程并增加认知负担。这些发现强调了设计远程医疗工具包的重要性,这些工具包使媒体能力与临床沟通需求保持一致,从而提高满意度并支持可持续的高质量远程医疗实践。
{"title":"Competition or Complementarity Among Telemedicine Tools in Ambulatory Care Practice: Cross-Sectional Analysis.","authors":"Xiang Oliver Liu, Avijit Sengupta","doi":"10.2196/75246","DOIUrl":"10.2196/75246","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Telemedicine use surged due to its capacity to deliver safe, remote care. As the public health crisis subsides, evaluating the interplay among various tools, such as video, audio, and text, becomes critical to sustained use. With health care shifting back to in-person models, understanding whether telemedicine tools complement or compete provides valuable insights for future technology design and usage strategies.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study investigates whether different types of telemedicine technology tools complement or compete while physicians deliver health care services through them. A clear understanding of the relationships between telemedicine technology tools, physicians' satisfaction, evaluation of care quality, and patient visit percentages is crucial for the design of new telemedicine technology platforms and ensuring quality of care services through technology platforms.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;To fulfill our objective, we analyzed data from the 2021 National Electronic Health Records Survey. We used ordered logit and probit regression models to evaluate the effects of telemedicine technology tools on physicians' overall satisfaction, quality of health care evaluation, and the percentage of patient visits via telemedicine.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;A total of 1875 office-based physicians in the United States completed the survey. Three main outcomes were assessed, including physician satisfaction (n=1614), evaluation of health care quality (n=1617), and the percentage of patient visits conducted via telemedicine (n=1558). Ordered logit and probit regression analyses revealed that the aggravated use of telemedicine tools had a significant impact on improvements in all 3 outcomes. A unit increase in telemedicine tools was associated with a 4.2 percentage point increase in the predicted probability of physicians being \"very satisfied\" (P&lt;.001) and a 5.2 percentage point increase in evaluating telemedicine quality as \"to a great extent\" (P&lt;.001). For patient visits, a unit increase in telemedicine tools was associated with a 1.8 percentage point increase in the likelihood of reporting \"≥75% of visits via telemedicine\" (P&lt;.001). Disaggregated analysis indicated that all individual tools were positively associated with physician satisfaction and quality evaluation (P&lt;.05). Bundle models revealed patterns consistent with complementarity (several bundles exceeded their constituent tools) and competition (some significant bundles were smaller than at least one constituent tool), aligning with the presence of both reinforcing and overlapping functionalities.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Our study demonstrates that telemedicine tools interact in ways that can be either complementary or competitive, depending on how their functionalities align within physicians' workflows. Videoconferencing tools, especially when integrated with electronic health record platforms, act as a c","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e75246"},"PeriodicalIF":3.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12775760/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Multiple Input Strategies of Large Language Models for Gallbladder Polyps on Ultrasound: Comparative Study. 超声评价胆囊息肉大语言模型的多种输入策略:比较研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-23 DOI: 10.2196/71178
Lin Jiang, Jiaqian Yao, Zebang Yang, Fuqiu Tang, Xin Zheng, Xiaoer Zhang, Xiaoyan Xie, Ming Xu, Tongyi Huang
<p><strong>Background: </strong>Gallbladder polyps have a high prevalence and are predominantly benign lesions, often detected via ultrasound. They impose diagnostic burdens on radiologists while generating substantial patient demand for report interpretation. Benign polyps include nonneoplastic polyps without malignant potential and premalignant adenomas that require cholecystectomy. Current guidelines recommending surgery for polyps ≥1.0 cm may lead to unnecessary interventions. Advanced multimodal large language models (LLMs) such as ChatGPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic PBC) demonstrate emerging capabilities in medical image analysis. Implementing LLMs in gallbladder polyp ultrasound evaluation can potentially alleviate radiologists' workload, provide patient-accessible consultation platforms, and even reduce overtreatment.</p><p><strong>Objective: </strong>We aimed to analyze the feasibility and conduct an early-stage evaluation of using LLMs for differentiating between adenomatous and nonneoplastic gallbladder polyps (≥1.0 cm) based on ChatGPT-4o and Claude 3.5 Sonnet, compared to assessments by radiologists and the guideline.</p><p><strong>Methods: </strong>Ultrasound images and reports of gallbladder polyps ≥1.0 cm with pathology were retrospectively collected from a hospital between January 2011 and January 2022. LLM performance was evaluated using three input strategies: (1) direct image analysis (LLMs-image), (2) feature-based text analysis (LLMs-text), and (3) scoring model-based text analysis (LLMs-model). Both intra- and interreader agreement and diagnostic performance of LLMs were evaluated for all three strategies. The diagnostic performance metrics-including sensitivity, specificity, accuracy, area under the receiver operating characteristic curve, and unnecessary resection rate of nonneoplastic polyps of LLMs in the three strategies were compared with the guideline. Additionally, the strategy LLMs-model was specifically compared with radiologists using the same scoring system (strategy readers-model).</p><p><strong>Results: </strong>This study included 223 patients (aged 18-72 years; 132/223, 59.2% female) as the initial cohort, with 48 adenomatous polyps and 175 nonneoplastic polyps. The external test set comprised 100 patients. The intrareader agreement coefficients for strategy LLMs-model were significantly higher than those for strategy LLMs-image and LLMs-text (all P<.01). The interreader agreement of the three diagnostic strategies was ranked as LLMs-model>LLMs-text>LLMs-image. The sensitivity of strategies LLMs-image and LLMs-text was significantly lower than that of the guideline (all P<.001). When applying a scoring model (readers/LLMs-model strategy), both radiologists and the LLMs achieved a significantly higher accuracy compared to the guideline (0.34, 0.35, and 0.34 vs 0.22, all P<.01), and the unnecessary resection rate of nonneoplastic polyps was significantly lower (82%, 83%, and 83% vs 100%, all P
背景:胆囊息肉发病率高,主要为良性病变,常通过超声检测。它们给放射科医生增加了诊断负担,同时产生了大量患者对报告解释的需求。良性息肉包括无恶性潜能的非肿瘤性息肉和需要胆囊切除术的癌前腺瘤。目前的指南建议对≥1.0 cm的息肉进行手术可能导致不必要的干预。先进的多模态大型语言模型(llm),如chatgpt - 40 (OpenAI)和Claude 3.5 Sonnet (Anthropic PBC),展示了医学图像分析的新兴能力。在胆囊息肉超声评估中实施llm可以潜在地减轻放射科医生的工作量,为患者提供方便的咨询平台,甚至减少过度治疗。目的:对比放射科医师的评估和指南,我们旨在分析基于chatgpt - 40和Claude 3.5 Sonnet的LLMs鉴别腺瘤性和非肿瘤性胆囊息肉(≥1.0 cm)的可行性并进行早期评估。方法:回顾性收集我院2011年1月~ 2022年1月收治的≥1.0 cm胆囊息肉的超声影像及病理报告。使用三种输入策略对LLM性能进行评估:(1)直接图像分析(LLMs-image),(2)基于特征的文本分析(LLMs-text)和(3)基于评分模型的文本分析(LLMs-model)。对所有三种策略的llm的内部和解读者一致性和诊断性能进行了评估。比较三种策略的诊断性能指标,包括敏感性、特异性、准确性、受者工作特征曲线下面积、LLMs非肿瘤性息肉的不必要切除率。此外,使用相同的评分系统(策略阅读者模型),专门将llms策略模型与放射科医生进行比较。结果:本研究纳入223例患者(18-72岁;132/223,女性59.2%)作为初始队列,其中腺瘤性息肉48例,非肿瘤性息肉175例。外部测试组包括100名患者。策略LLMs-model的读者内部一致系数显著高于策略LLMs-image和策略LLMs-text(均为plms -text>LLMs-image)。LLMs-image和LLMs-text策略的敏感性显著低于指南(p < 0.05)。gpt模型和claude模型的所有诊断绩效指标与放射科医师无显著差异(P < 0.05)。结论:LLMs对医学图像的识别和解读能力有待进一步提高。带有评分系统的文本策略是目前llm最合适的诊断策略。
{"title":"Evaluating Multiple Input Strategies of Large Language Models for Gallbladder Polyps on Ultrasound: Comparative Study.","authors":"Lin Jiang, Jiaqian Yao, Zebang Yang, Fuqiu Tang, Xin Zheng, Xiaoer Zhang, Xiaoyan Xie, Ming Xu, Tongyi Huang","doi":"10.2196/71178","DOIUrl":"10.2196/71178","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Gallbladder polyps have a high prevalence and are predominantly benign lesions, often detected via ultrasound. They impose diagnostic burdens on radiologists while generating substantial patient demand for report interpretation. Benign polyps include nonneoplastic polyps without malignant potential and premalignant adenomas that require cholecystectomy. Current guidelines recommending surgery for polyps ≥1.0 cm may lead to unnecessary interventions. Advanced multimodal large language models (LLMs) such as ChatGPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic PBC) demonstrate emerging capabilities in medical image analysis. Implementing LLMs in gallbladder polyp ultrasound evaluation can potentially alleviate radiologists' workload, provide patient-accessible consultation platforms, and even reduce overtreatment.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;We aimed to analyze the feasibility and conduct an early-stage evaluation of using LLMs for differentiating between adenomatous and nonneoplastic gallbladder polyps (≥1.0 cm) based on ChatGPT-4o and Claude 3.5 Sonnet, compared to assessments by radiologists and the guideline.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;Ultrasound images and reports of gallbladder polyps ≥1.0 cm with pathology were retrospectively collected from a hospital between January 2011 and January 2022. LLM performance was evaluated using three input strategies: (1) direct image analysis (LLMs-image), (2) feature-based text analysis (LLMs-text), and (3) scoring model-based text analysis (LLMs-model). Both intra- and interreader agreement and diagnostic performance of LLMs were evaluated for all three strategies. The diagnostic performance metrics-including sensitivity, specificity, accuracy, area under the receiver operating characteristic curve, and unnecessary resection rate of nonneoplastic polyps of LLMs in the three strategies were compared with the guideline. Additionally, the strategy LLMs-model was specifically compared with radiologists using the same scoring system (strategy readers-model).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;This study included 223 patients (aged 18-72 years; 132/223, 59.2% female) as the initial cohort, with 48 adenomatous polyps and 175 nonneoplastic polyps. The external test set comprised 100 patients. The intrareader agreement coefficients for strategy LLMs-model were significantly higher than those for strategy LLMs-image and LLMs-text (all P&lt;.01). The interreader agreement of the three diagnostic strategies was ranked as LLMs-model&gt;LLMs-text&gt;LLMs-image. The sensitivity of strategies LLMs-image and LLMs-text was significantly lower than that of the guideline (all P&lt;.001). When applying a scoring model (readers/LLMs-model strategy), both radiologists and the LLMs achieved a significantly higher accuracy compared to the guideline (0.34, 0.35, and 0.34 vs 0.22, all P&lt;.01), and the unnecessary resection rate of nonneoplastic polyps was significantly lower (82%, 83%, and 83% vs 100%, all P","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e71178"},"PeriodicalIF":3.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Body Roundness Index as a Novel Digital Biomarker for Psoriasis Risk Prediction: Cross-Sectional Study. 评估身体圆度指数作为牛皮癣风险预测的新型数字生物标志物:横断面研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-23 DOI: 10.2196/75727
Pengfei Wen, Xiaoyan Wang, Xiaoxue Zhuo, Siliang Xue

Background: Psoriasis is a chronic inflammatory skin disorder that has been increasingly linked to metabolic imbalances, particularly obesity. Conventional anthropometric indicators such as BMI and waist circumference (WC) may not sufficiently capture body fat distribution or reflect metabolic risk. The body roundness index (BRI), which integrates both height and waist measurements, has emerged as a potentially superior metric, though its relevance to psoriasis risk remains underexplored.

Objective: This study aimed to investigate the use of BRI as a digital biomarker for assessing psoriasis risk and to compare its predictive strength against BMI and WC across various demographic and metabolic subgroups using data from a nationally representative sample.

Methods: A cross-sectional analysis was conducted using data from 13,798 adults aged 20 to 59 years who participated in the National Health and Nutrition Examination Survey between 2003 and 2006 as well as between 2009 and 2014. Psoriasis status was self-reported. Anthropometric measures (BRI, BMI, and WC) were calculated from standardized physical assessments. Weighted multivariable logistic regression models and restricted cubic spline analyses were used to examine associations while adjusting for demographic, metabolic, and lifestyle variables. A nomogram was constructed to quantify the relative predictive contributions of each metric.

Results: BRI exhibited a strong linear association with psoriasis risk (odds ratio [OR] 1.11 per unit increase, 95% CI 1.05-1.17; P<.001), outperforming BMI (OR 1.03) and WC (OR 1.01). Tertile analysis revealed a 1.73-fold increased risk of psoriasis in the highest BRI group (P=.003). Subgroup analyses confirmed consistent associations across age, sex, race or ethnicity, and metabolic status (P for interaction >.05). The nomogram highlighted BRI as the most influential predictor, indicated by its broad scoring range.

Conclusions: BRI shows stronger and more consistent associations with psoriasis risk than BMI or WC, supporting its potential role as a digital biomarker for early risk stratification. Incorporating BRI into clinical decision-making tools may enhance personalized approaches to psoriasis prevention and management.

背景:牛皮癣是一种慢性炎症性皮肤病,与代谢失衡,尤其是肥胖的关系越来越密切。传统的人体测量指标,如BMI和腰围(WC)可能不能充分捕捉身体脂肪分布或反映代谢风险。身体圆度指数(BRI)结合了身高和腰围,已经成为一种潜在的优越指标,尽管它与牛皮癣风险的相关性仍未得到充分研究。目的:本研究旨在研究BRI作为评估牛皮癣风险的数字生物标志物的使用,并使用来自全国代表性样本的数据,比较其与BMI和WC在不同人口统计学和代谢亚组中的预测强度。方法:对2003 - 2006年和2009 - 2014年参加全国健康与营养检查调查的13798名20 - 59岁成年人的数据进行横断面分析。牛皮癣状况自行报告。人体测量(BRI、BMI和WC)是根据标准化的身体评估计算的。加权多变量logistic回归模型和限制三次样条分析用于检验在调整人口统计学、代谢和生活方式变量时的相关性。构建了一个模态图来量化每个指标的相对预测贡献。结果:BRI与牛皮癣风险呈强线性相关(比值比[OR] 1.11 /单位增加,95% CI 1.05-1.17; P.05)。nomogram强调BRI是最具影响力的预测指标,其广泛的评分范围表明。结论:与BMI或WC相比,BRI与牛皮癣风险的相关性更强、更一致,支持其作为早期风险分层的数字生物标志物的潜在作用。将BRI纳入临床决策工具可以增强牛皮癣预防和管理的个性化方法。
{"title":"Evaluating the Body Roundness Index as a Novel Digital Biomarker for Psoriasis Risk Prediction: Cross-Sectional Study.","authors":"Pengfei Wen, Xiaoyan Wang, Xiaoxue Zhuo, Siliang Xue","doi":"10.2196/75727","DOIUrl":"10.2196/75727","url":null,"abstract":"<p><strong>Background: </strong>Psoriasis is a chronic inflammatory skin disorder that has been increasingly linked to metabolic imbalances, particularly obesity. Conventional anthropometric indicators such as BMI and waist circumference (WC) may not sufficiently capture body fat distribution or reflect metabolic risk. The body roundness index (BRI), which integrates both height and waist measurements, has emerged as a potentially superior metric, though its relevance to psoriasis risk remains underexplored.</p><p><strong>Objective: </strong>This study aimed to investigate the use of BRI as a digital biomarker for assessing psoriasis risk and to compare its predictive strength against BMI and WC across various demographic and metabolic subgroups using data from a nationally representative sample.</p><p><strong>Methods: </strong>A cross-sectional analysis was conducted using data from 13,798 adults aged 20 to 59 years who participated in the National Health and Nutrition Examination Survey between 2003 and 2006 as well as between 2009 and 2014. Psoriasis status was self-reported. Anthropometric measures (BRI, BMI, and WC) were calculated from standardized physical assessments. Weighted multivariable logistic regression models and restricted cubic spline analyses were used to examine associations while adjusting for demographic, metabolic, and lifestyle variables. A nomogram was constructed to quantify the relative predictive contributions of each metric.</p><p><strong>Results: </strong>BRI exhibited a strong linear association with psoriasis risk (odds ratio [OR] 1.11 per unit increase, 95% CI 1.05-1.17; P<.001), outperforming BMI (OR 1.03) and WC (OR 1.01). Tertile analysis revealed a 1.73-fold increased risk of psoriasis in the highest BRI group (P=.003). Subgroup analyses confirmed consistent associations across age, sex, race or ethnicity, and metabolic status (P for interaction >.05). The nomogram highlighted BRI as the most influential predictor, indicated by its broad scoring range.</p><p><strong>Conclusions: </strong>BRI shows stronger and more consistent associations with psoriasis risk than BMI or WC, supporting its potential role as a digital biomarker for early risk stratification. Incorporating BRI into clinical decision-making tools may enhance personalized approaches to psoriasis prevention and management.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e75727"},"PeriodicalIF":3.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12724066/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning Approaches for Classifying Children With and Without Autism Spectrum Disorder Using Inertial Measurement Unit Hand Tracking Data: Comparative Study. 基于惯性测量单元手部跟踪数据的深度学习方法对自闭症谱系障碍儿童进行分类:比较研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-22 DOI: 10.2196/73440
John Mutersbaugh, Wan-Chun Su, Anjana Bhat, Amir Gandjbakhche

Background: Autism spectrum disorder (ASD) is a prevalent neurodevelopmental condition that can be quite difficult to diagnose due to a lack of objective diagnostic methods in the currently used behavioral assessments. Recent work has shown that children with ASD have a higher incidence of motor control differences. A compilation of studies indicates that between 50% and 88% of the children with ASD have issues with movement control based on standardized motor assessments or parent-reported questionnaires.

Objective: In this study, we assess a variety of deep learning approaches for the classification of ASD, utilizing data collected via inertial measurement unit (IMU) hand tracking during goal-directed arm movements.

Methods: IMU hand tracking data were recorded from 41 school-aged children both with and without an ASD diagnosis to track their arm movements during a reach-to-clean up task. The IMU data were then preprocessed using a moving average and z score normalization to prepare the data for deep learning models. We evaluated the effectiveness of different deep learning models using the preprocessed data and a k-fold validation approach, as well as a patient-separated approach.

Results: The best result was achieved with a convolutional autoencoder combined with long short-term memory layers, reaching an accuracy of 90.21% and an F1-score of 90.02%. Once the convolutional autoencoder+long short-term memory was determined to be the most effective model for this datatype, it was retrained and evaluated with a patient-separated dataset to assess the generalization capability of the model, achieving an accuracy of 91.87% and an F1-score of 93.66%.

Conclusions: Our deep learning approach demonstrates that our models hold potential for facilitating ASD diagnosis in clinical settings. This work validates that there are significant differences between the physical movements of typically developing children and children with ASD, and these differences can be identified by analyzing hand-eye coordination skills. Additionally, we have validated that small-scale models can still achieve a high accuracy and good generalization when classifying medical data, opening the door for future research into diagnostic models that may not require massive amounts of data.

背景:自闭症谱系障碍(ASD)是一种普遍存在的神经发育疾病,由于目前使用的行为评估缺乏客观的诊断方法,因此很难诊断。最近的研究表明,自闭症儿童在运动控制方面的差异发生率更高。一项研究汇编表明,根据标准化的运动评估或家长报告的问卷,50%至88%的自闭症儿童存在运动控制问题。目的:在本研究中,我们利用惯性测量单元(IMU)手部跟踪在目标定向手臂运动中收集的数据,评估了各种用于ASD分类的深度学习方法。方法:记录41名有或没有ASD诊断的学龄儿童的IMU手部跟踪数据,以跟踪他们在伸手清理任务中的手臂运动。然后使用移动平均线和z分数归一化对IMU数据进行预处理,为深度学习模型准备数据。我们使用预处理数据和k-fold验证方法以及患者分离方法评估了不同深度学习模型的有效性。结果:结合长短期记忆层的卷积自编码器的准确率为90.21%,f1评分为90.02%,效果最好。一旦确定卷积自编码器+长短期记忆是该数据类型最有效的模型,则使用患者分离的数据集对其进行重新训练和评估,以评估模型的泛化能力,准确率为91.87%,f1得分为93.66%。结论:我们的深度学习方法表明,我们的模型在临床环境中具有促进ASD诊断的潜力。这项工作证实了正常发育儿童和自闭症儿童的身体运动存在显著差异,这些差异可以通过分析手眼协调技能来识别。此外,我们已经验证了小规模模型在对医疗数据进行分类时仍然可以达到很高的准确性和良好的泛化,这为未来可能不需要大量数据的诊断模型的研究打开了大门。
{"title":"Deep Learning Approaches for Classifying Children With and Without Autism Spectrum Disorder Using Inertial Measurement Unit Hand Tracking Data: Comparative Study.","authors":"John Mutersbaugh, Wan-Chun Su, Anjana Bhat, Amir Gandjbakhche","doi":"10.2196/73440","DOIUrl":"10.2196/73440","url":null,"abstract":"<p><strong>Background: </strong>Autism spectrum disorder (ASD) is a prevalent neurodevelopmental condition that can be quite difficult to diagnose due to a lack of objective diagnostic methods in the currently used behavioral assessments. Recent work has shown that children with ASD have a higher incidence of motor control differences. A compilation of studies indicates that between 50% and 88% of the children with ASD have issues with movement control based on standardized motor assessments or parent-reported questionnaires.</p><p><strong>Objective: </strong>In this study, we assess a variety of deep learning approaches for the classification of ASD, utilizing data collected via inertial measurement unit (IMU) hand tracking during goal-directed arm movements.</p><p><strong>Methods: </strong>IMU hand tracking data were recorded from 41 school-aged children both with and without an ASD diagnosis to track their arm movements during a reach-to-clean up task. The IMU data were then preprocessed using a moving average and z score normalization to prepare the data for deep learning models. We evaluated the effectiveness of different deep learning models using the preprocessed data and a k-fold validation approach, as well as a patient-separated approach.</p><p><strong>Results: </strong>The best result was achieved with a convolutional autoencoder combined with long short-term memory layers, reaching an accuracy of 90.21% and an F1-score of 90.02%. Once the convolutional autoencoder+long short-term memory was determined to be the most effective model for this datatype, it was retrained and evaluated with a patient-separated dataset to assess the generalization capability of the model, achieving an accuracy of 91.87% and an F1-score of 93.66%.</p><p><strong>Conclusions: </strong>Our deep learning approach demonstrates that our models hold potential for facilitating ASD diagnosis in clinical settings. This work validates that there are significant differences between the physical movements of typically developing children and children with ASD, and these differences can be identified by analyzing hand-eye coordination skills. Additionally, we have validated that small-scale models can still achieve a high accuracy and good generalization when classifying medical data, opening the door for future research into diagnostic models that may not require massive amounts of data.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73440"},"PeriodicalIF":3.8,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12721220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145806481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Big Data Platform With End-to-End Traceability for Health Data Monitoring in Older Adults: Development and Performance Evaluation. 具有端到端可追溯性的可扩展大数据平台,用于老年人健康数据监测:开发和绩效评估。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-22 DOI: 10.2196/81701
Ander Cejudo, Yone Tellechea, Amaia Calvo, Aitor Almeida, Cristina Martín, Andoni Beristain
<p><strong>Background: </strong>The increasing use of real-time health data from wearable devices and self-reported questionnaires offers significant opportunities for preventive care in aging populations. However, current health data platforms often lack built-in mechanisms for data and model traceability, version control, and coordinated management of heterogeneous data streams, which are essential for clinical accountability, regulatory compliance, and reproducibility. The absence of these features limits the reuse of health data and the reproducibility of analytical workflows across research and clinical environments.</p><p><strong>Objective: </strong>This work presents DeltaTrace, a unified big data health platform designed with traceability as a key architectural feature. The platform integrates end-to-end tracking of data and model versions with real-time and batch processing capabilities. Built entirely on open source technologies, DeltaTrace combines components for data management, model management, orchestration, and visualization. The main objective is to demonstrate that embedding traceability within the architecture enables scalable, auditable, and version-controlled processing of health data, thereby facilitating reproducible analytics and long-term maintenance of health monitoring systems.</p><p><strong>Methods: </strong>DeltaTrace adopts a medallion architecture implemented with Delta Lake to ensure atomic and version-controlled data transformations. Apache Spark is used for distributed computation, Apache Kafka for continuous data ingestion, and Apache Airflow for orchestration of batch and streaming workflows. MLflow manages the lifecycle and versioning of machine learning models, while Grafana provides visualization dashboards for real-time and aggregated data inspection. The platform is evaluated using continuous physiological signals from wearable devices and batch-ingested questionnaire data, combining synthetic and real data from the LifeSnaps dataset. Performance tests are conducted on central processing unit-only servers with 8-core and 24-core configurations to assess ingestion, aggregation, visualization, and anomaly detection latency.</p><p><strong>Results: </strong>DeltaTrace supports continuous processing for approximately 1500 users with end-to-end delays below 10 minutes. Ingestion and visualization tasks operate between mean 4.9 (SD 0.12) and 7.5 (SD 0.28) minutes, while aggregation and anomaly detection required less than mean 5.6 (SD 0.04) and 10.5 (SD 1.70) minutes, respectively. Increasing from 8 to 24 cores improved ingestion and cleaning latency by up to 25% and anomaly detection performance by up to 50%. The system maintains consistent performance across different data types, processing modes, and loads.</p><p><strong>Conclusions: </strong>DeltaTrace provides a scalable and modular architecture that incorporates traceability as a core component together with functions for model management, orchestration, an
背景:越来越多地使用来自可穿戴设备和自我报告问卷的实时健康数据,为老年人的预防保健提供了重要的机会。然而,当前的健康数据平台通常缺乏数据和模型可追溯性、版本控制和异构数据流协调管理的内置机制,这些机制对于临床问责制、法规遵从性和可重复性至关重要。这些特性的缺失限制了健康数据的重用以及研究和临床环境中分析工作流程的可重复性。目的:本文介绍了一个统一的大数据健康平台DeltaTrace,该平台以可追溯性为主要架构特征。该平台将数据和模型版本的端到端跟踪与实时和批处理功能集成在一起。DeltaTrace完全建立在开源技术之上,它结合了数据管理、模型管理、编排和可视化的组件。主要目标是证明在体系结构中嵌入可追溯性可以实现对健康数据的可伸缩、可审计和版本控制处理,从而促进健康监测系统的可重复分析和长期维护。方法:DeltaTrace采用Delta Lake实现的纪念章架构,以确保原子和版本控制的数据转换。Apache Spark用于分布式计算,Apache Kafka用于连续数据摄取,Apache Airflow用于批处理和流工作流程的编排。MLflow管理机器学习模型的生命周期和版本控制,而Grafana为实时和聚合数据检查提供可视化仪表板。该平台使用来自可穿戴设备的连续生理信号和批量摄取的问卷数据进行评估,并结合来自LifeSnaps数据集的合成和真实数据。性能测试在8核和24核配置的仅中央处理单元的服务器上进行,以评估摄取、聚合、可视化和异常检测延迟。结果:DeltaTrace支持大约1500个用户的连续处理,端到端延迟低于10分钟。摄取和可视化任务的平均操作时间在4.9 (SD 0.12)和7.5 (SD 0.28)分钟之间,而聚合和异常检测所需的时间分别低于平均5.6 (SD 0.04)和10.5 (SD 1.70)分钟。从8核增加到24核将摄取和清理延迟提高了25%,异常检测性能提高了50%。系统在不同的数据类型、处理模式和负载之间保持一致的性能。结论:DeltaTrace提供了一个可伸缩和模块化的体系结构,它将可追溯性作为核心组件与模型管理、编排和可视化的功能结合在一起。该平台支持跨数据和模型的完整版本控制,并在有限的硬件条件下保持性能。这些特性支持可重复和可审计的健康数据处理,使DeltaTrace适合于老龄化人口的持续监测和预防性保健。
{"title":"Scalable Big Data Platform With End-to-End Traceability for Health Data Monitoring in Older Adults: Development and Performance Evaluation.","authors":"Ander Cejudo, Yone Tellechea, Amaia Calvo, Aitor Almeida, Cristina Martín, Andoni Beristain","doi":"10.2196/81701","DOIUrl":"10.2196/81701","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;The increasing use of real-time health data from wearable devices and self-reported questionnaires offers significant opportunities for preventive care in aging populations. However, current health data platforms often lack built-in mechanisms for data and model traceability, version control, and coordinated management of heterogeneous data streams, which are essential for clinical accountability, regulatory compliance, and reproducibility. The absence of these features limits the reuse of health data and the reproducibility of analytical workflows across research and clinical environments.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This work presents DeltaTrace, a unified big data health platform designed with traceability as a key architectural feature. The platform integrates end-to-end tracking of data and model versions with real-time and batch processing capabilities. Built entirely on open source technologies, DeltaTrace combines components for data management, model management, orchestration, and visualization. The main objective is to demonstrate that embedding traceability within the architecture enables scalable, auditable, and version-controlled processing of health data, thereby facilitating reproducible analytics and long-term maintenance of health monitoring systems.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;DeltaTrace adopts a medallion architecture implemented with Delta Lake to ensure atomic and version-controlled data transformations. Apache Spark is used for distributed computation, Apache Kafka for continuous data ingestion, and Apache Airflow for orchestration of batch and streaming workflows. MLflow manages the lifecycle and versioning of machine learning models, while Grafana provides visualization dashboards for real-time and aggregated data inspection. The platform is evaluated using continuous physiological signals from wearable devices and batch-ingested questionnaire data, combining synthetic and real data from the LifeSnaps dataset. Performance tests are conducted on central processing unit-only servers with 8-core and 24-core configurations to assess ingestion, aggregation, visualization, and anomaly detection latency.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;DeltaTrace supports continuous processing for approximately 1500 users with end-to-end delays below 10 minutes. Ingestion and visualization tasks operate between mean 4.9 (SD 0.12) and 7.5 (SD 0.28) minutes, while aggregation and anomaly detection required less than mean 5.6 (SD 0.04) and 10.5 (SD 1.70) minutes, respectively. Increasing from 8 to 24 cores improved ingestion and cleaning latency by up to 25% and anomaly detection performance by up to 50%. The system maintains consistent performance across different data types, processing modes, and loads.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;DeltaTrace provides a scalable and modular architecture that incorporates traceability as a core component together with functions for model management, orchestration, an","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e81701"},"PeriodicalIF":3.8,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12721222/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145806460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Machine Learning Model Based on Clinical Factors to Predict the Efficacy of First-Line Immunochemotherapy for Patients With Advanced Gastric Cancer: Retrospective Study. 基于临床因素的机器学习模型预测晚期胃癌患者一线免疫化疗疗效:回顾性研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-22 DOI: 10.2196/82533
Xu Cheng, Ping Li, Enqing Meng, Xinyi Wu, Hao Wu
<p><strong>Background: </strong>The development of immunotherapy has provided new hope for patients with advanced gastric cancer (AGC). However, due to the high heterogeneity of the disease, the efficacy of first-line immunochemotherapy varies among patients. There is still a lack of simple and effective models to predict the efficacy of immunochemotherapy in this setting.</p><p><strong>Objective: </strong>This study aimed to identify critical factors and develop predictive models to evaluate the efficacy of first-line immunochemotherapy in patients with AGC using clinically available data. The goal was to offer evidence-based guidance for clinical practice and enable personalized treatment strategies.</p><p><strong>Methods: </strong>To evaluate the effectiveness of first-line immunochemotherapy in AGC, we retrospectively collected clinical data from The First Affiliated Hospital of Nanjing Medical University between January 2018 and October 2023. The data collected were divided into a training set (168/240, 70%) and an internal validation set (72/240, 30%). Additionally, a temporal validation cohort of 76 patients recruited from November 2023 to September 2024 was assembled to further evaluate the predictive performance of the models. We used univariate and multivariate Cox regression analyses, along with the least absolute shrinkage and selection operator (LASSO) regression, and integrated clinical expertise to identify key predictors of treatment efficacy and to construct the LASSO-Cox model. We developed 4 models (LASSO-Cox, random survival forest [RSF], extreme gradient boosting, and survival support vector machine) and evaluated their performance using the C-index, area under the curve (AUC), calibration curves, and decision curve analysis. The optimal model was interpreted using Shapley additive explanations, and its risk scores were used to stratify patients for Kaplan-Meier survival analysis.</p><p><strong>Results: </strong>Among the 4 prognostic models developed in this study, the RSF model demonstrated superior predictive accuracy and discrimination for progression-free survival, as evidenced by its higher AUC, concordance index, continuous AUC curves, and calibration curves compared with the other 3 models. Additionally, decision curve analysis showed that the RSF model offered greater net clinical benefit. The Shapley additive explanations results identified that age, histological subtype, the proportion of CD19<sup>+</sup> B cells, CD16<sup>+</sup>CD56<sup>+</sup> natural killer cells, and the presence of liver metastasis were key prognostic factors influencing patient outcomes. Patients in the low-risk group, as determined by the RSF model's risk score, exhibited a significantly higher progression-free survival rate than those in the high-risk group, further validating the value of the RSF model for risk stratification.</p><p><strong>Conclusions: </strong>This study is the first to use machine learning algorithms to develop a predi
背景:免疫疗法的发展为晚期胃癌患者提供了新的希望。然而,由于疾病的高度异质性,一线免疫化疗的疗效因患者而异。目前仍缺乏简单有效的模型来预测免疫化疗在这种情况下的疗效。目的:本研究旨在利用临床数据,确定关键因素并建立预测模型,以评估一线免疫化疗对AGC患者的疗效。目标是为临床实践提供循证指导,并实现个性化治疗策略。方法:为评价一线免疫化疗治疗AGC的有效性,回顾性收集南京医科大学第一附属医院2018年1月至2023年10月的临床资料。将收集到的数据分为训练集(168/ 240,70%)和内部验证集(72/ 240,30%)。此外,还收集了2023年11月至2024年9月期间招募的76名患者的时间验证队列,以进一步评估模型的预测性能。我们使用单变量和多变量Cox回归分析,以及最小绝对收缩和选择算子(LASSO)回归,并整合临床专业知识来确定治疗效果的关键预测因素,并构建LASSO-Cox模型。我们开发了LASSO-Cox、随机生存森林(RSF)、极端梯度增强(extreme gradient boosting)和生存支持向量机(survival support vector machine) 4个模型,并使用c指数、曲线下面积(AUC)、校准曲线和决策曲线分析来评估它们的性能。使用Shapley加性解释解释最优模型,并使用其风险评分对患者进行分层,进行Kaplan-Meier生存分析。结果:在本研究建立的4种预后模型中,RSF模型的AUC、一致性指数、连续AUC曲线和校准曲线均高于其他3种模型,对无进展生存期的预测准确性和判别性均优于其他3种模型。此外,决策曲线分析显示,RSF模型提供了更大的净临床效益。Shapley加性解释结果发现,年龄、组织学亚型、CD19+ B细胞、CD16+CD56+自然杀伤细胞的比例以及是否存在肝转移是影响患者预后的关键因素。根据RSF模型的风险评分,低危组患者的无进展生存率明显高于高危组,进一步验证了RSF模型在风险分层中的价值。结论:本研究首次使用机器学习算法建立了一线免疫化疗治疗AGC疗效的预测模型,并确定了治疗结果的关键预测因素。结果表明,RSF模型不仅可以对可能受益的患者进行精确分层,更重要的是,为个性化临床策略提供可量化的决策支持,强调了其在临床决策中的潜在价值。
{"title":"A Machine Learning Model Based on Clinical Factors to Predict the Efficacy of First-Line Immunochemotherapy for Patients With Advanced Gastric Cancer: Retrospective Study.","authors":"Xu Cheng, Ping Li, Enqing Meng, Xinyi Wu, Hao Wu","doi":"10.2196/82533","DOIUrl":"10.2196/82533","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;The development of immunotherapy has provided new hope for patients with advanced gastric cancer (AGC). However, due to the high heterogeneity of the disease, the efficacy of first-line immunochemotherapy varies among patients. There is still a lack of simple and effective models to predict the efficacy of immunochemotherapy in this setting.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aimed to identify critical factors and develop predictive models to evaluate the efficacy of first-line immunochemotherapy in patients with AGC using clinically available data. The goal was to offer evidence-based guidance for clinical practice and enable personalized treatment strategies.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;To evaluate the effectiveness of first-line immunochemotherapy in AGC, we retrospectively collected clinical data from The First Affiliated Hospital of Nanjing Medical University between January 2018 and October 2023. The data collected were divided into a training set (168/240, 70%) and an internal validation set (72/240, 30%). Additionally, a temporal validation cohort of 76 patients recruited from November 2023 to September 2024 was assembled to further evaluate the predictive performance of the models. We used univariate and multivariate Cox regression analyses, along with the least absolute shrinkage and selection operator (LASSO) regression, and integrated clinical expertise to identify key predictors of treatment efficacy and to construct the LASSO-Cox model. We developed 4 models (LASSO-Cox, random survival forest [RSF], extreme gradient boosting, and survival support vector machine) and evaluated their performance using the C-index, area under the curve (AUC), calibration curves, and decision curve analysis. The optimal model was interpreted using Shapley additive explanations, and its risk scores were used to stratify patients for Kaplan-Meier survival analysis.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Among the 4 prognostic models developed in this study, the RSF model demonstrated superior predictive accuracy and discrimination for progression-free survival, as evidenced by its higher AUC, concordance index, continuous AUC curves, and calibration curves compared with the other 3 models. Additionally, decision curve analysis showed that the RSF model offered greater net clinical benefit. The Shapley additive explanations results identified that age, histological subtype, the proportion of CD19&lt;sup&gt;+&lt;/sup&gt; B cells, CD16&lt;sup&gt;+&lt;/sup&gt;CD56&lt;sup&gt;+&lt;/sup&gt; natural killer cells, and the presence of liver metastasis were key prognostic factors influencing patient outcomes. Patients in the low-risk group, as determined by the RSF model's risk score, exhibited a significantly higher progression-free survival rate than those in the high-risk group, further validating the value of the RSF model for risk stratification.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;This study is the first to use machine learning algorithms to develop a predi","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e82533"},"PeriodicalIF":3.8,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12770927/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145812326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR Medical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1