首页 > 最新文献

JMIR AI最新文献

英文 中文
Comparative Diagnostic Performance of a Multimodal Large Language Model Versus a Dedicated Electrocardiogram AI in Detecting Myocardial Infarction From Electrocardiogram Images: Comparative Study. 多模态大语言模型与专用心电图AI在从心电图图像检测心肌梗死中的比较诊断性能:比较研究。
IF 2 Pub Date : 2025-09-17 DOI: 10.2196/75910
Haemin Lee, Sooyoung Yoo, Joonghee Kim, Youngjin Cho, Dongbum Suh, Keehyuck Lee

Background: Accurate and timely electrocardiogram (ECG) interpretation is critical for diagnosing myocardial infarction (MI) in emergency settings. Recent advances in multimodal large language models (LLMs), such as ChatGPT (OpenAI) and Gemini (Google DeepMind), have shown promise in clinical interpretation for medical imaging. However, whether these models analyze waveform patterns or simply rely on text cues remains unclear, underscoring the need for direct comparisons with dedicated ECG artificial intelligence (AI) tools.

Objective: This study aimed to evaluate the diagnostic performance of ChatGPT and Gemini, a general-purpose LLM, in detecting MI from ECG images and to compare its performance with that of ECG Buddy (ARPI Inc), a dedicated AI-driven ECG analysis tool.

Methods: This retrospective study evaluated and compared AI models for classifying MI using a publicly available 12-lead ECG dataset from Pakistan, categorizing cases into MI-positive (239 images) and MI-negative (689 images). ChatGPT (GPT-4o, version November 20, 2024) and Gemini (Gemini 2.5 pro) were queried with 5 MI confidence options, whereas ECG Buddy for Microsoft Windows analyzed the images based on ST-elevation MI, acute coronary syndrome, and myocardial injury biomarkers.

Results: Among 928 ECG recordings (239/928, 25.8% MI-positive), ChatGPT achieved an accuracy of 65.95% (95% CI 62.80-69.00), area under the curve (AUC) of 57.34% (95% CI 53.44-61.24), sensitivity of 36.40% (95% CI 30.30-42.85), and specificity of 76.2% (95% CI 72.84-79.33). With Gemini 2.5 Pro, accuracy dropped to 29.63% (95% CI 26.71-32.69), AUC to 51.63% (95% CI 50.22-53.04), and sensitivity rose to 97.07% (95% CI 94.06-98.81), but specificity fell sharply to 6.24% (95% CI 4.55-8.31). However, ECG Buddy reached an accuracy of 96.98% (95% CI 95.67-97.99), AUC of 98.8% (95% CI 98.3-99.43), sensitivity of 96.65% (95% CI 93.51-98.54), and specificity of 97.10% (95% CI 95.55-98.22). DeLong test confirmed that ECG Buddy significantly outperformed ChatGPT (all P<.001). In a qualitative error analysis of LLMs' diagnostic explanations, GPT-4o produced fully accurate explanations in only 5% of cases (2/40), was partially accurate in 38% (15/40), and completely inaccurate in 58% (23/40). By contrast, Gemini 2.5 Pro yielded fully accurate explanations in 32% of cases (12/37), was partially accurate in 14% (5/37), and completely inaccurate in 54% (20/37).

Conclusions: LLMs, such as ChatGPT and Gemini, underperform relative to specialized tools such as ECG Buddy in ECG image-based MI diagnosis. Further training may improve LLMs; however, domain-specific AI remains essential for clinical accuracy. The high performance of ECG Buddy underscores the importance of specialized models for achieving reliable and robust diagnostic outcomes.

背景:准确和及时的心电图(ECG)解释对于急诊诊断心肌梗死(MI)至关重要。最近在多模态大型语言模型(llm)方面的进展,如ChatGPT (OpenAI)和Gemini(谷歌DeepMind),在医学成像的临床解释中显示出了希望。然而,这些模型是分析波形模式还是仅仅依赖文本线索尚不清楚,这强调了与专用ECG人工智能(AI)工具进行直接比较的必要性。目的:本研究旨在评估ChatGPT和通用LLM Gemini从心电图像中检测心肌梗死的诊断性能,并将其与专用人工智能心电分析工具ECG Buddy (ARPI Inc)的性能进行比较。方法:本回顾性研究使用巴基斯坦公开的12导联心电图数据集评估和比较了用于心肌梗死分类的AI模型,将病例分为心肌梗死阳性(239张图像)和心肌梗死阴性(689张图像)。ChatGPT (gpt - 40, 2024年11月20日版本)和Gemini (Gemini 2.5 pro)具有5个心肌梗死置信度选项,而ECG Buddy for Microsoft Windows则基于st段抬高心肌梗死、急性冠状动脉综合征和心肌损伤生物标志物分析图像。结果:在928份心电图记录中(239/928,mi阳性25.8%),ChatGPT准确率为65.95% (95% CI 62.80 ~ 69.00),曲线下面积(AUC)为57.34% (95% CI 53.44 ~ 61.24),灵敏度为36.40% (95% CI 30.30 ~ 42.85),特异性为76.2% (95% CI 72.84 ~ 79.33)。使用Gemini 2.5 Pro,准确率降至29.63% (95% CI 26.71-32.69), AUC降至51.63% (95% CI 50.22-53.04),敏感性上升至97.07% (95% CI 94.06-98.81),但特异性急剧下降至6.24% (95% CI 4.55-8.31)。然而,ECG Buddy的准确率为96.98% (95% CI 95.67-97.99), AUC为98.8% (95% CI 983 -99.43),灵敏度为96.65% (95% CI 93.51-98.54),特异性为97.10% (95% CI 95.55-98.22)。DeLong测试证实,ECG Buddy的表现明显优于ChatGPT。结论:在基于心电图像的心梗诊断中,llm(如ChatGPT和Gemini)的表现低于ECG Buddy等专业工具。进一步的培训可以提高法学硕士水平;然而,特定领域的人工智能对于临床准确性仍然至关重要。ECG Buddy的高性能强调了专业模型对于实现可靠和稳健的诊断结果的重要性。
{"title":"Comparative Diagnostic Performance of a Multimodal Large Language Model Versus a Dedicated Electrocardiogram AI in Detecting Myocardial Infarction From Electrocardiogram Images: Comparative Study.","authors":"Haemin Lee, Sooyoung Yoo, Joonghee Kim, Youngjin Cho, Dongbum Suh, Keehyuck Lee","doi":"10.2196/75910","DOIUrl":"10.2196/75910","url":null,"abstract":"<p><strong>Background: </strong>Accurate and timely electrocardiogram (ECG) interpretation is critical for diagnosing myocardial infarction (MI) in emergency settings. Recent advances in multimodal large language models (LLMs), such as ChatGPT (OpenAI) and Gemini (Google DeepMind), have shown promise in clinical interpretation for medical imaging. However, whether these models analyze waveform patterns or simply rely on text cues remains unclear, underscoring the need for direct comparisons with dedicated ECG artificial intelligence (AI) tools.</p><p><strong>Objective: </strong>This study aimed to evaluate the diagnostic performance of ChatGPT and Gemini, a general-purpose LLM, in detecting MI from ECG images and to compare its performance with that of ECG Buddy (ARPI Inc), a dedicated AI-driven ECG analysis tool.</p><p><strong>Methods: </strong>This retrospective study evaluated and compared AI models for classifying MI using a publicly available 12-lead ECG dataset from Pakistan, categorizing cases into MI-positive (239 images) and MI-negative (689 images). ChatGPT (GPT-4o, version November 20, 2024) and Gemini (Gemini 2.5 pro) were queried with 5 MI confidence options, whereas ECG Buddy for Microsoft Windows analyzed the images based on ST-elevation MI, acute coronary syndrome, and myocardial injury biomarkers.</p><p><strong>Results: </strong>Among 928 ECG recordings (239/928, 25.8% MI-positive), ChatGPT achieved an accuracy of 65.95% (95% CI 62.80-69.00), area under the curve (AUC) of 57.34% (95% CI 53.44-61.24), sensitivity of 36.40% (95% CI 30.30-42.85), and specificity of 76.2% (95% CI 72.84-79.33). With Gemini 2.5 Pro, accuracy dropped to 29.63% (95% CI 26.71-32.69), AUC to 51.63% (95% CI 50.22-53.04), and sensitivity rose to 97.07% (95% CI 94.06-98.81), but specificity fell sharply to 6.24% (95% CI 4.55-8.31). However, ECG Buddy reached an accuracy of 96.98% (95% CI 95.67-97.99), AUC of 98.8% (95% CI 98.3-99.43), sensitivity of 96.65% (95% CI 93.51-98.54), and specificity of 97.10% (95% CI 95.55-98.22). DeLong test confirmed that ECG Buddy significantly outperformed ChatGPT (all P<.001). In a qualitative error analysis of LLMs' diagnostic explanations, GPT-4o produced fully accurate explanations in only 5% of cases (2/40), was partially accurate in 38% (15/40), and completely inaccurate in 58% (23/40). By contrast, Gemini 2.5 Pro yielded fully accurate explanations in 32% of cases (12/37), was partially accurate in 14% (5/37), and completely inaccurate in 54% (20/37).</p><p><strong>Conclusions: </strong>LLMs, such as ChatGPT and Gemini, underperform relative to specialized tools such as ECG Buddy in ECG image-based MI diagnosis. Further training may improve LLMs; however, domain-specific AI remains essential for clinical accuracy. The high performance of ECG Buddy underscores the importance of specialized models for achieving reliable and robust diagnostic outcomes.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e75910"},"PeriodicalIF":2.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12443349/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145082735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trade-Off Analysis of Classical Machine Learning and Deep Learning Models for Robust Brain Tumor Detection: Benchmark Study. 经典机器学习和深度学习模型在稳健脑肿瘤检测中的权衡分析:基准研究。
IF 2 Pub Date : 2025-09-15 DOI: 10.2196/76344
Yuting Tian

Background: Medical image analysis plays a critical role in brain tumor detection, but training deep learning models often requires large, labeled datasets, which can be time-consuming and costly. This study explores a comparative analysis of machine learning and deep learning models for brain tumor classification, focusing on whether deep learning models are necessary for small medical datasets and whether self-supervised learning can reduce annotation costs.

Objective: The primary goal is to evaluate trade-offs between traditional machine learning and deep learning, including self-supervised models under small medical image data. The secondary goal is to assess model robustness, transferability, and generalization through evaluation of unseen data within- and cross-domains.

Methods: Four models were compared: (1) support vector machine (SVM) with histogram of oriented gradients (HOG) features, (2) a convolutional neural network based on ResNet18, (3) a transformer-based model using vision transformer (ViT-B/16), and (4) a self-supervised learning approach using Simple Contrastive Learning of Visual Representations (SimCLR). These models were selected to represent diverse paradigms. SVM+HOG represents traditional feature engineering with low computational cost, ResNet18 serves as a well-established convolutional neural network with strong baseline performance, ViT-B/16 leverages self-attention to capture long-range spatial features, and SimCLR enables learning from unlabeled data, potentially reducing annotation costs. The primary dataset consisted of 2870 brain magnetic resonance images across 4 classes: glioma, meningioma, pituitary, and nontumor. All models were trained under consistent settings, including data augmentation, early stopping, and 3 independent runs using the different random seeds to account for performance variability. Performance metrics included accuracy, precision, recall, F1-score, and convergence. To assess robustness and generalization capability, evaluation was performed on unseen test data from both the primary and cross datasets. No retraining or test augmentations were applied to the external data, thereby reflecting realistic deployment conditions. The models demonstrated consistently strong performance in both within-domain and cross-domain evaluations.

Results: The results revealed distinct trade-offs; ResNet18 achieved the highest validation accuracy (mean 99.77%, SD 0.00%) and the lowest validation loss, along with a weighted test accuracy of 99% within-domain and 95% cross-domain. SimCLR reached a mean validation accuracy of 97.29% (SD 0.86%) and achieved up to 97% weighted test accuracy within-domain and 91% cross-domain, despite requiring 2-stage training phases involving contrastive pretraining followed by linear evaluation. ViT-B/16 reached a mean validation accuracy of 97.36% (SD 0.11%), with a weighted test

背景:医学图像分析在脑肿瘤检测中起着至关重要的作用,但训练深度学习模型通常需要大量的标记数据集,这可能既耗时又昂贵。本研究对机器学习和深度学习模型进行脑肿瘤分类的比较分析,重点关注深度学习模型对于小型医疗数据集是否必要,以及自监督学习是否可以降低标注成本。目的:主要目标是评估传统机器学习和深度学习之间的权衡,包括小医学图像数据下的自监督模型。第二个目标是评估模型的鲁棒性,可转移性,并通过评估内部和跨领域看不见的数据泛化。方法:对四种模型进行比较:(1)基于定向梯度直方图(HOG)特征的支持向量机(SVM)模型、(2)基于ResNet18的卷积神经网络模型、(3)基于视觉转换器的基于变压器的模型(ViT-B/16)和(4)基于视觉表征简单对比学习(SimCLR)的自监督学习方法。选择这些模型是为了代表不同的范式。SVM+HOG代表了计算成本低的传统特征工程,ResNet18是一个成熟的卷积神经网络,具有较强的基线性能,viti - b /16利用自关注捕获远程空间特征,SimCLR可以从未标记的数据中学习,可能降低标注成本。主要数据集包括2870张脑磁共振图像,分为4类:胶质瘤、脑膜瘤、垂体和非肿瘤。所有模型都在一致的设置下进行训练,包括数据增强、提前停止和3次独立运行,使用不同的随机种子来解释性能的可变性。性能指标包括准确性、精密度、召回率、f1分数和收敛性。为了评估鲁棒性和泛化能力,对来自主数据集和交叉数据集的未见测试数据进行了评估。没有对外部数据应用再培训或测试扩展,因此反映了实际的部署条件。这些模型在域内和跨域评估中都表现出一致的强大性能。结果:结果揭示了明显的权衡;ResNet18的验证准确率最高(平均99.77%,标准差0.00%),验证损失最低,域内加权测试准确率为99%,跨域加权测试准确率为95%。SimCLR的平均验证准确率达到97.29% (SD 0.86%),域内加权测试准确率达到97%,跨域测试准确率达到91%,尽管需要两个阶段的训练阶段,包括对比预训练和线性评估。ViT-B/16平均验证准确率为97.36% (SD 0.11%),域内加权检验准确率为98%,跨域加权检验准确率为93%。SVM+HOG的竞争验证准确率为96.51%,域内测试准确率为97%,跨域测试准确率下降至80%。结论:该研究揭示了模型复杂性、注释要求和部署可行性之间有意义的权衡,这是在现实世界的医学成像应用中选择模型的关键因素。
{"title":"Trade-Off Analysis of Classical Machine Learning and Deep Learning Models for Robust Brain Tumor Detection: Benchmark Study.","authors":"Yuting Tian","doi":"10.2196/76344","DOIUrl":"10.2196/76344","url":null,"abstract":"<p><strong>Background: </strong>Medical image analysis plays a critical role in brain tumor detection, but training deep learning models often requires large, labeled datasets, which can be time-consuming and costly. This study explores a comparative analysis of machine learning and deep learning models for brain tumor classification, focusing on whether deep learning models are necessary for small medical datasets and whether self-supervised learning can reduce annotation costs.</p><p><strong>Objective: </strong>The primary goal is to evaluate trade-offs between traditional machine learning and deep learning, including self-supervised models under small medical image data. The secondary goal is to assess model robustness, transferability, and generalization through evaluation of unseen data within- and cross-domains.</p><p><strong>Methods: </strong>Four models were compared: (1) support vector machine (SVM) with histogram of oriented gradients (HOG) features, (2) a convolutional neural network based on ResNet18, (3) a transformer-based model using vision transformer (ViT-B/16), and (4) a self-supervised learning approach using Simple Contrastive Learning of Visual Representations (SimCLR). These models were selected to represent diverse paradigms. SVM+HOG represents traditional feature engineering with low computational cost, ResNet18 serves as a well-established convolutional neural network with strong baseline performance, ViT-B/16 leverages self-attention to capture long-range spatial features, and SimCLR enables learning from unlabeled data, potentially reducing annotation costs. The primary dataset consisted of 2870 brain magnetic resonance images across 4 classes: glioma, meningioma, pituitary, and nontumor. All models were trained under consistent settings, including data augmentation, early stopping, and 3 independent runs using the different random seeds to account for performance variability. Performance metrics included accuracy, precision, recall, F<sub>1</sub>-score, and convergence. To assess robustness and generalization capability, evaluation was performed on unseen test data from both the primary and cross datasets. No retraining or test augmentations were applied to the external data, thereby reflecting realistic deployment conditions. The models demonstrated consistently strong performance in both within-domain and cross-domain evaluations.</p><p><strong>Results: </strong>The results revealed distinct trade-offs; ResNet18 achieved the highest validation accuracy (mean 99.77%, SD 0.00%) and the lowest validation loss, along with a weighted test accuracy of 99% within-domain and 95% cross-domain. SimCLR reached a mean validation accuracy of 97.29% (SD 0.86%) and achieved up to 97% weighted test accuracy within-domain and 91% cross-domain, despite requiring 2-stage training phases involving contrastive pretraining followed by linear evaluation. ViT-B/16 reached a mean validation accuracy of 97.36% (SD 0.11%), with a weighted test","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e76344"},"PeriodicalIF":2.0,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12456844/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine-Learning Predictive Tool for the Individualized Prediction of Outcomes of Hematopoietic Cell Transplantation for Sickle Cell Disease: Registry-Based Study. 用于个体化预测镰状细胞病造血细胞移植结果的机器学习预测工具:基于登记的研究。
IF 2 Pub Date : 2025-09-15 DOI: 10.2196/64519
Rajagopal Subramaniam Chandrasekar, Michael Kane, Lakshmanan Krishnamurti

Background: Disease-modifying therapies ameliorate disease severity of sickle cell disease (SCD), but hematopoietic cell transplantation (HCT), and more recently, autologous gene therapy are the only treatments that have curative potential for SCD. While registry-based studies provide population-level estimates, they do not address the uncertainty regarding individual outcomes of HCT. Computational machine learning (ML) has the potential to identify generalizable predictive patterns and quantify uncertainty in estimates, thereby improving clinical decision-making. There is no existing ML model for SCD, and ML models for HCT for other diseases focus on single outcomes rather than all relevant outcomes.

Objective: This study aims to address the existing knowledge gap by developing and validating an individualized ML prediction model SPRIGHT (Sickle Cell Predicting Outcomes of Hematopoietic Cell Transplantation), incorporating multiple relevant pre-HCT features to make predictions of key post-HCT clinical outcomes.

Methods: We applied a supervised random forest ML model to clinical parameters in a deidentified Center for International Blood and Marrow Transplant Research (CIBMTR) dataset of 1641 patients who underwent HCT between 1991 and 2021 and were followed for a median of 42.5 (IQR 52.5;range 0.3-312.9) months. We applied forward and reverse feature selection methods to optimize a set of predictive variables. To counter the imbalance bias toward predicting positive outcomes due to the small number of negative outcomes, we constructed a training dataset, taking each outcome as variable of interest, and performed 2-times repeated 10-fold cross-validation. SPRIGHT is a web-based individualized prediction tool accessible by smartphone, tablet, or personal computer. It incorporates predictive variables of age, age group, Karnofsky or Lansky score, comorbidity index, recipient cytomegalovirus seropositivity, history of acute chest syndrome, need for exchange transfusion, occurrence and frequency of vaso-occlusive crisis (VOC) before HCT, and either a published or custom chemotherapy or radiation conditioning, serotherapy, and graft-versus-host disease prophylaxis. SPRIGHT makes individualized predictions of overall survival (OS), event-free survival, graft failure, acute graft-versus-host disease (AGVHD), chronic graft-versus-host disease (CGVHD), and occurrence of VOC or stroke post-HCT.

Results: The model's ability to distinguish between positive and negative classes, that is, discrimination, was evaluated using the area under the curve, accuracy, and balanced accuracy. Discrimination met or exceeded published predictive benchmarks with area under the curve for OS (0.7925), event-free survival (0.7900), graft failure (0.8024), acute graft-versus-host disease (0.6793), chronic graft-versus-host disease (0.7320), and VOC post-HCT (0.8779). SPRIGHT revealed good c

背景:疾病修饰疗法可以改善镰状细胞病(SCD)的疾病严重程度,但造血细胞移植(HCT)和最近的自体基因治疗是唯一具有治疗SCD潜力的治疗方法。虽然基于登记的研究提供了人口水平的估计,但它们并没有解决关于HCT个体结果的不确定性。计算机器学习(ML)具有识别可推广的预测模式和量化估计中的不确定性的潜力,从而改善临床决策。目前还没有针对SCD的ML模型,其他疾病的HCT ML模型关注的是单一结果,而不是所有相关结果。目的:本研究旨在通过开发和验证个体化ML预测模型SPRIGHT(镰状细胞预测造血细胞移植结果)来解决现有的知识空白,该模型结合hct前的多个相关特征来预测hct后的关键临床结果。方法:我们将监督随机森林ML模型应用于国际血液和骨髓移植研究中心(CIBMTR)数据集的临床参数,该数据集包括1641名在1991年至2021年间接受HCT的患者,随访时间中位数为42.5个月(IQR为52.5;范围为0.3-312.9)。我们应用正向和反向特征选择方法来优化一组预测变量。为了消除由于负面结果较少而导致预测正面结果的不平衡偏差,我们构建了一个训练数据集,将每个结果作为感兴趣的变量,并进行了2次重复的10次交叉验证。sprright是一个基于网络的个性化预测工具,可以通过智能手机、平板电脑或个人电脑访问。它包括年龄、年龄组、Karnofsky或Lansky评分、合并症指数、受体巨细胞病毒血清阳性、急性胸综合征史、换血需求、HCT前血管闭塞危像(VOC)的发生和频率、公开的或定制的化疗或放疗、血清治疗和移植物抗宿主病预防等预测变量。SPRIGHT可以个性化预测总生存期(OS)、无事件生存期、移植物衰竭、急性移植物抗宿主病(AGVHD)、慢性移植物抗宿主病(CGVHD)以及hct后VOC或卒中的发生。结果:模型区分正负类的能力,即判别,用曲线下面积、精度和平衡精度来评估。识别达到或超过公布的预测基准,曲线下面积为OS(0.7925)、无事件生存(0.7900)、移植物失败(0.8024)、急性移植物抗宿主病(0.6793)、慢性移植物抗宿主病(0.7320)和hct后VOC(0.8779)。sprright显示出良好的校准,斜率为0.87-0.96,负截距较小(-0.01至0.03)。然而,OS表现出非理想校准,这可能反映了所有亚组的总体高OS。结论:基于网络的机器学习预测工具包含多个临床相关变量,预测关键的临床结果具有高水平的区分和校准,并具有共同决策的潜力。
{"title":"Machine-Learning Predictive Tool for the Individualized Prediction of Outcomes of Hematopoietic Cell Transplantation for Sickle Cell Disease: Registry-Based Study.","authors":"Rajagopal Subramaniam Chandrasekar, Michael Kane, Lakshmanan Krishnamurti","doi":"10.2196/64519","DOIUrl":"10.2196/64519","url":null,"abstract":"<p><strong>Background: </strong>Disease-modifying therapies ameliorate disease severity of sickle cell disease (SCD), but hematopoietic cell transplantation (HCT), and more recently, autologous gene therapy are the only treatments that have curative potential for SCD. While registry-based studies provide population-level estimates, they do not address the uncertainty regarding individual outcomes of HCT. Computational machine learning (ML) has the potential to identify generalizable predictive patterns and quantify uncertainty in estimates, thereby improving clinical decision-making. There is no existing ML model for SCD, and ML models for HCT for other diseases focus on single outcomes rather than all relevant outcomes.</p><p><strong>Objective: </strong>This study aims to address the existing knowledge gap by developing and validating an individualized ML prediction model SPRIGHT (Sickle Cell Predicting Outcomes of Hematopoietic Cell Transplantation), incorporating multiple relevant pre-HCT features to make predictions of key post-HCT clinical outcomes.</p><p><strong>Methods: </strong>We applied a supervised random forest ML model to clinical parameters in a deidentified Center for International Blood and Marrow Transplant Research (CIBMTR) dataset of 1641 patients who underwent HCT between 1991 and 2021 and were followed for a median of 42.5 (IQR 52.5;range 0.3-312.9) months. We applied forward and reverse feature selection methods to optimize a set of predictive variables. To counter the imbalance bias toward predicting positive outcomes due to the small number of negative outcomes, we constructed a training dataset, taking each outcome as variable of interest, and performed 2-times repeated 10-fold cross-validation. SPRIGHT is a web-based individualized prediction tool accessible by smartphone, tablet, or personal computer. It incorporates predictive variables of age, age group, Karnofsky or Lansky score, comorbidity index, recipient cytomegalovirus seropositivity, history of acute chest syndrome, need for exchange transfusion, occurrence and frequency of vaso-occlusive crisis (VOC) before HCT, and either a published or custom chemotherapy or radiation conditioning, serotherapy, and graft-versus-host disease prophylaxis. SPRIGHT makes individualized predictions of overall survival (OS), event-free survival, graft failure, acute graft-versus-host disease (AGVHD), chronic graft-versus-host disease (CGVHD), and occurrence of VOC or stroke post-HCT.</p><p><strong>Results: </strong>The model's ability to distinguish between positive and negative classes, that is, discrimination, was evaluated using the area under the curve, accuracy, and balanced accuracy. Discrimination met or exceeded published predictive benchmarks with area under the curve for OS (0.7925), event-free survival (0.7900), graft failure (0.8024), acute graft-versus-host disease (0.6793), chronic graft-versus-host disease (0.7320), and VOC post-HCT (0.8779). SPRIGHT revealed good c","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e64519"},"PeriodicalIF":2.0,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12435087/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study. 大型语言模型(ChatGPT)在系统评价数据提取中的关键评估:探索性研究。
IF 2 Pub Date : 2025-09-11 DOI: 10.2196/68097
Hesam Mahmoudi, Doris Chang, Hannah Lee, Navid Ghaffarzadegan, Mohammad S Jalali

Background: Systematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs.

Objective: Our study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).

Methods: We screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations.

Results: ChatGPT's accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations.

Conclusions: Our findings underscore LLMs' utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.

背景:系统文献综述(slr)是综合各领域证据的基础,在指导卫生和生物医学科学的研究和实践方面尤为重要。然而,由于需要人工从多个研究中提取数据,这是一项劳动密集型工作。随着大型语言模型(llm)因其自动化研究任务和提取基本信息的潜力而受到关注,了解它们从学术论文中准确提取明确数据的能力对于推进slr至关重要。目的:我们的研究旨在探索法学硕士使用ChatGPT (GPT-4)提取明确概述的研究特征和需要细致评估的更深层次、更上下文信息的能力。方法:我们筛选了一份COVID-19建模研究样本全文,分析了研究设置的三个基本指标(即分析地点、建模方法和分析干预措施)和模型中行为成分的三个复杂指标(即流动性、风险感知和依从性)。为了提取这些度量的数据,两位研究人员使用手动编码独立提取了60个数据元素,并将它们与ChatGPT对跨越7次迭代的420个查询的响应进行了比较。结果:ChatGPT的准确性随着提示的改进而提高,在提取研究设置和行为成分的初始迭代和最终迭代之间分别提高了33%和23%。在最初的提示中,60个ChatGPT回答中有26个(43.3%)是正确的。然而,在最后的迭代中,ChatGPT提取了60个数据元素中的43个(71.7%),在提取明确陈述的研究设置(28/ 30,93.3%)方面的性能优于提取主观行为成分(15/ 30,50%)。尽管如此,不同测量方法准确度的差异凸显了其局限性。结论:我们的研究结果强调了llm通过使用有效提示在单反中提取基本数据和明确数据的效用。然而,结果显示在处理细微的主观标准方面存在重大局限性,强调了人为监督的必要性。
{"title":"Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study.","authors":"Hesam Mahmoudi, Doris Chang, Hannah Lee, Navid Ghaffarzadegan, Mohammad S Jalali","doi":"10.2196/68097","DOIUrl":"10.2196/68097","url":null,"abstract":"<p><strong>Background: </strong>Systematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs.</p><p><strong>Objective: </strong>Our study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4).</p><p><strong>Methods: </strong>We screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations.</p><p><strong>Results: </strong>ChatGPT's accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations.</p><p><strong>Conclusions: </strong>Our findings underscore LLMs' utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68097"},"PeriodicalIF":2.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425462/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diagnostic and Screening AI Tools in Brazil's Resource-Limited Settings: Systematic Review. 诊断和筛选人工智能工具在巴西资源有限的设置:系统评价。
IF 2 Pub Date : 2025-09-10 DOI: 10.2196/69547
Leticia Medeiros Mancini, Luiz Eduardo Vanderlei Torres, Jorge Artur P de M Coelho, Nichollas Botelho da Fonseca, Pedro Fellipe Dantas Cordeiro, Samara Silva Noronha Cavalcante, Diego Dermeval

Background: Artificial intelligence (AI) has the potential to transform global health care, with extensive application in Brazil, particularly for diagnosis and screening.

Objective: This study aimed to conduct a systematic review to understand AI applications in Brazilian health care, especially focusing on the resource-constrained environments.

Methods: A systematic review was performed. The search strategy included the following databases: PubMed, Cochrane Library, Embase, Web of Science, LILACS, and SciELO. The search covered papers from 1993 to November 2023, with an initial overview of 714 papers found, of which 25 papers were selected for the final sample. Meta-analysis data were evaluated based on three main metrics: area under the receiver operating characteristic curve, sensitivity, and specificity. A random effects model was applied for each metric to address study variability.

Results: Key specialties for AI tools include ophthalmology and infectious disease, with a significant concentration of studies conducted in São Paulo state (13/25, 52%). All papers included testing to evaluate and validate the tools; however, only two conducted secondary testing with a different population. In terms of risk of bias, 10 of 25 (40%) papers had medium risk, 8 of 25 (32%) had low risk, and 7 of 25 (28%) had high risk. Most studies were public initiatives, totaling 17 of 25 (68%), while 5 of 25 (20%) were private. In limited-income countries like Brazil, minimum technological requirements for implementing AI in health care must be carefully considered due to financial limitations and often insufficient technological infrastructure. Of the papers reviewed, 19 of 25 (76%) used computers, and 18 of 25 (72%) required the Windows operating system. The most used AI algorithm was machine learning (11/25, 44%). The combined sensitivity was 0.8113, the combined specificity was 0.7417, and the combined area under the receiver operating characteristic curve was 0.8308, all with P<.001.

Conclusions: There is a relative balance in the use of both diagnostic and screening tools, with widespread application across Brazil in varied contexts. The need for secondary testing highlights opportunities for future research.

背景:人工智能(AI)具有改变全球卫生保健的潜力,在巴西得到了广泛应用,特别是在诊断和筛查方面。目的:本研究旨在进行系统综述,以了解人工智能在巴西卫生保健中的应用,特别是关注资源受限的环境。方法:进行系统评价。搜索策略包括以下数据库:PubMed、Cochrane Library、Embase、Web of Science、LILACS和SciELO。检索涵盖了1993年至2023年11月的论文,初步综述了714篇论文,其中25篇论文被选为最终样本。meta分析数据基于三个主要指标进行评估:受试者工作特征曲线下面积、敏感性和特异性。每个指标采用随机效应模型来解决研究的可变性。结果:人工智能工具的关键专业包括眼科和传染病,在圣保罗州进行了大量研究(13/25,52%)。所有论文都包括评估和验证工具的测试;然而,只有两个国家对不同的人群进行了二次检测。在偏倚风险方面,25篇论文中有10篇(40%)为中等风险,8篇(32%)为低风险,7篇(28%)为高风险。大多数研究是公共倡议,25个研究中有17个(68%),而25个研究中有5个(20%)是私人的。在巴西等收入有限的国家,由于财政限制和往往技术基础设施不足,必须仔细考虑在卫生保健中实施人工智能的最低技术要求。在被审查的论文中,25篇中有19篇(76%)使用了电脑,25篇中有18篇(72%)需要Windows操作系统。使用最多的人工智能算法是机器学习(11/25,44%)。联合敏感性为0.8113,联合特异性为0.7417,受试者工作特征曲线下的联合面积为0.8308,均为p。结论:诊断和筛查工具的使用相对平衡,在巴西各地的不同背景下广泛应用。对二次测试的需求突出了未来研究的机会。
{"title":"Diagnostic and Screening AI Tools in Brazil's Resource-Limited Settings: Systematic Review.","authors":"Leticia Medeiros Mancini, Luiz Eduardo Vanderlei Torres, Jorge Artur P de M Coelho, Nichollas Botelho da Fonseca, Pedro Fellipe Dantas Cordeiro, Samara Silva Noronha Cavalcante, Diego Dermeval","doi":"10.2196/69547","DOIUrl":"10.2196/69547","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) has the potential to transform global health care, with extensive application in Brazil, particularly for diagnosis and screening.</p><p><strong>Objective: </strong>This study aimed to conduct a systematic review to understand AI applications in Brazilian health care, especially focusing on the resource-constrained environments.</p><p><strong>Methods: </strong>A systematic review was performed. The search strategy included the following databases: PubMed, Cochrane Library, Embase, Web of Science, LILACS, and SciELO. The search covered papers from 1993 to November 2023, with an initial overview of 714 papers found, of which 25 papers were selected for the final sample. Meta-analysis data were evaluated based on three main metrics: area under the receiver operating characteristic curve, sensitivity, and specificity. A random effects model was applied for each metric to address study variability.</p><p><strong>Results: </strong>Key specialties for AI tools include ophthalmology and infectious disease, with a significant concentration of studies conducted in São Paulo state (13/25, 52%). All papers included testing to evaluate and validate the tools; however, only two conducted secondary testing with a different population. In terms of risk of bias, 10 of 25 (40%) papers had medium risk, 8 of 25 (32%) had low risk, and 7 of 25 (28%) had high risk. Most studies were public initiatives, totaling 17 of 25 (68%), while 5 of 25 (20%) were private. In limited-income countries like Brazil, minimum technological requirements for implementing AI in health care must be carefully considered due to financial limitations and often insufficient technological infrastructure. Of the papers reviewed, 19 of 25 (76%) used computers, and 18 of 25 (72%) required the Windows operating system. The most used AI algorithm was machine learning (11/25, 44%). The combined sensitivity was 0.8113, the combined specificity was 0.7417, and the combined area under the receiver operating characteristic curve was 0.8308, all with P<.001.</p><p><strong>Conclusions: </strong>There is a relative balance in the use of both diagnostic and screening tools, with widespread application across Brazil in varied contexts. The need for secondary testing highlights opportunities for future research.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e69547"},"PeriodicalIF":2.0,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12422524/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145034859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study. 探索命名实体识别潜力和定制的自然语言处理管道的价值,用于放射学,病理学和临床决策支持的进展记录:定量研究。
IF 2 Pub Date : 2025-09-05 DOI: 10.2196/59251
Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia

Background: Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.

Objective: This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.

Methods: We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.

Results: Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.

Conclusions: Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.

背景:临床记录包含丰富但非结构化的患者数据,由于医学术语、缩写和同义词导致歧义,使得分析具有挑战性。这使得决策支持工具的实时提取变得复杂。目的:本研究旨在研究命名实体识别(NER)管道的数据管理、技术和工作流程,这是一个更广泛的临床决策支持工具的组成部分,该工具使用NER模型识别关键实体,并通过NER断言模型将这些实体分类为患者存在或不存在。方法:收集5000例患者的进展监护、影像学、病理记录,分为5批,每批1000例。每个病人的笔记和报告、句子数、令牌大小、运行时间、中央处理单元和内存使用等指标被测量为每个笔记类型。我们还评估了NER输出的精度,然后根据临床专家的手动注释评估了NER断言模型的精度和召回率。结果:使用Spark自然语言处理临床预训练NER模型对138,250份临床记录进行处理,我们观察到良好的NER精度,程序中的峰值为0.989 (95% CI 0.977-1.000),断言模型的准确率为0.889 (95% CI 0.856-0.922)。我们的分析强调了每位患者笔记、笔记长度和实体密度的长尾分布。进展护理笔记比放射学和病理学笔记每句有更多的实体,分别显示4倍和16倍的差异。结论:进一步的研究应探索本研究范围之外的临床记录分析,包括出院总结和精神病学评估记录。认识到不同音符类型的独特语言特征强调了针对每种类型开发专门的NER模型或自然语言处理管道设置的重要性。通过这样做,我们可以在更多样化的临床场景中提高他们的表现。
{"title":"Exploring Named Entity Recognition Potential and the Value of Tailored Natural Language Processing Pipelines for Radiology, Pathology, and Progress Notes in Clinical Decision Support: Quantitative Study.","authors":"Veysel Kocaman, Fu-Yuan Cheng, Julio Bonis, Ganesh Raut, Prem Timsina, David Talby, Arash Kia","doi":"10.2196/59251","DOIUrl":"10.2196/59251","url":null,"abstract":"<p><strong>Background: </strong>Clinical notes house rich, yet unstructured, patient data, making analysis challenging due to medical jargon, abbreviations, and synonyms causing ambiguity. This complicates real-time extraction for decision support tools.</p><p><strong>Objective: </strong>This study aimed to examine the data curation, technology, and workflow of the named entity recognition (NER) pipeline, a component of a broader clinical decision support tool that identifies key entities using NER models and classifies these entities as present or absent in the patient through an NER assertion model.</p><p><strong>Methods: </strong>We gathered progress care, radiology, and pathology notes from 5000 patients, dividing them into 5 batches of 1000 patients each. Metrics such as notes and reports per patient, sentence count, token size, runtime, central processing unit, and memory use were measured per note type. We also evaluated the precision of the NER outputs and then the precision and recall of NER assertion models against manual annotations by a clinical expert.</p><p><strong>Results: </strong>Using Spark natural language processing clinical pretrained NER models on 138,250 clinical notes, we observed excellent NER precision, with a peak in procedures at 0.989 (95% CI 0.977-1.000) and an accuracy in the assertion model of 0.889 (95% CI 0.856-0.922). Our analysis highlighted long-tail distributions in notes per patient, note length, and entity density. Progress care notes had notably more entities per sentence than radiology and pathology notes, showing 4-fold and 16-fold differences, respectively.</p><p><strong>Conclusions: </strong>Further research should explore the analysis of clinical notes beyond the scope of our study, including discharge summaries and psychiatric evaluation notes. Recognizing the unique linguistic characteristics of different note types underscores the importance of developing specialized NER models or natural language processing pipeline setups tailored to each type. By doing so, we can enhance their performance across a more diverse range of clinical scenarios.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e59251"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis. 青光眼系统评价中文献检索、数据提取和研究组成的AI工具与PRISMA方法的评价:内容分析
IF 2 Pub Date : 2025-09-05 DOI: 10.2196/68592
Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni

Background: Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.

Objective: This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.

Methods: Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.

Results: Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.

Conclusions: The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.

背景:人工智能(AI)在科学领域越来越受欢迎,因为它允许分析大量数据集,总结结果并协助撰写学术论文。目的:本研究以青光眼相关单反为案例研究,以基于系统评价和meta分析(PRISMA)的单反首选报告项目为基准,探讨人工智能在进行系统文献综述(SLR)过程中的作用,重点关注其在发展、研究选择、数据提取和研究组成三个关键阶段的贡献和局限性。方法:测试四个人工智能平台复制四个基于prisma的青光眼相关单反的能力。我们使用Connected Papers和Elicit对相关记录进行研究;然后我们评估了Elicit和ChatPDF提取和组织检索记录中包含的信息的能力。最后,我们测试了Jenni AI的单反构图能力。结果:Connected Papers和Elicit都没有提供使用PRISMA方法发现的全部结果。平均而言,从Elicit中提取的数据准确率为51.40% (SD 31.45%),不准确率为13.69% (SD 17.98%);22.37%(标准差27.54%)的回答缺失,12.51%(标准差14.70%)的回答错误。从ChatPDF中提取的数据准确率为60.33% (SD 30.72%),不准确率为7.41% (SD 13.88%);17.56% (SD 20.02%)的回答缺失,14.70% (SD 17.72%)的回答错误。Jenni AI生成的内容表现出令人满意的语言流畅性和技术熟练程度,但在定义方法、阐述结果和陈述结论方面存在不足。结论:在单反书写过程的文献检索、数据提取和研究组成阶段,PRISMA方法在再现性和准确性方面继续表现出明显的优势。虽然人工智能可以节省时间并帮助完成重复性任务,但研究人员在整个过程中的积极参与对于保持对其工作质量,准确性和客观性的控制仍然至关重要。
{"title":"Evaluation of AI Tools Versus the PRISMA Method for Literature Search, Data Extraction, and Study Composition in Glaucoma Systematic Reviews: Content Analysis.","authors":"Laura Antonia Meliante, Giulia Coco, Alessandro Rabiolo, Stefano De Cillà, Gianluca Manni","doi":"10.2196/68592","DOIUrl":"10.2196/68592","url":null,"abstract":"<p><strong>Background: </strong>Artificial intelligence (AI) is becoming increasingly popular in the scientific field, as it allows for the analysis of extensive datasets, summarizes results, and assists in writing academic papers.</p><p><strong>Objective: </strong>This study investigates the role of AI in the process of conducting a systematic literature review (SLR), focusing on its contributions and limitations at three key stages of its development, study selection, data extraction, and study composition, using glaucoma-related SLRs as case studies and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-based SLRs as benchmarks.</p><p><strong>Methods: </strong>Four AI platforms were tested on their ability to reproduce four PRISMA-based, glaucoma-related SLRs. We used Connected Papers and Elicit to perform research of relevant records; then we assessed Elicit and ChatPDF's ability to extract and organize information contained in the retrieved records. Finally, we tested Jenni AI's capacity to compose an SLR.</p><p><strong>Results: </strong>Neither Connected Papers nor Elicit provided the totality of the results found using the PRISMA method. On average, data extracted from Elicit were accurate in 51.40% (SD 31.45%) of cases and imprecise in 13.69% (SD 17.98%); 22.37% (SD 27.54%) of responses were missing, while 12.51% (SD 14.70%) were incorrect. Data extracted from ChatPDF were accurate in 60.33% (SD 30.72%) of cases and imprecise in 7.41% (SD 13.88%); 17.56% (SD 20.02%) of responses were missing, and 14.70% (SD 17.72%) were incorrect. Jenni AI's generated content exhibited satisfactory language fluency and technical proficiency but was insufficient in defining methods, elaborating results, and stating conclusions.</p><p><strong>Conclusions: </strong>The PRISMA method continues to exhibit clear superiority in terms of reproducibility and accuracy during the literature search, data extraction, and study composition phases of the SLR writing process. While AI can save time and assist with repetitive tasks, the active participation of the researcher throughout the entire process is still crucial to maintain control over the quality, accuracy, and objectivity of their work.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68592"},"PeriodicalIF":2.0,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413140/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145006987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study. 移植护理中人工智能驱动的他克莫司剂量:队列研究。
IF 2 Pub Date : 2025-09-02 DOI: 10.2196/67302
Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati

Background: Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.

Objective: The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.

Methods: We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.

Results: The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).

Conclusions: Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.

背景:他克莫司是实体器官移植中免疫抑制治疗的支柱,由于其治疗范围窄,需要精确的剂量。由于患者特点、供体器官因素、药物相互作用和围手术期生理变化的不同,术后维持他克莫司治疗水平具有挑战性。目的:设计一个机器学习模型来预测第二天他克莫司谷浓度(C0),并指导给药,以防止持续给药不足或过量。方法:我们使用来自加州大学圣地亚哥分校健康中心1597名成年肾脏和肝脏移植接受者的回顾性数据来开发一个长短期记忆(LSTM)模型来预测住院患者第二天服用他克莫司C0的情况。预测因素包括移植类型、人口统计学、合并症、生命体征、实验室参数、有序饮食和药物。对模型进行排列特征重要性评估。我们进一步实施了一个分类任务来评估模型识别剂量不足、治疗剂量和过量的能力。最后,我们提出了第二天的剂量建议,使他克莫司C0达到目标范围。结果:LSTM模型预测第二天他克莫司C0的平均绝对误差为1.880 ng/mL。最重要的预测特征包括最近的他克莫司C0、他克莫司剂量、移植器官类型、饮食和相互作用药物。在使用3类分类任务预测剂量不足、治疗剂量和过量剂量时,该模型的微平均f1得分为0.653。对于推荐剂量,当实际每日总剂量与模型推荐剂量(在3mg以内)密切一致时,达到最佳临床结果。结论:我们的研究是将人工智能应用于他克莫司给药的最大研究之一,我们的LSTM模型可以有效地预测他克莫司C0,并可能指导准确的剂量推荐。需要进一步的前瞻性研究来评估该模型在实际剂量调整中的性能。
{"title":"AI-Driven Tacrolimus Dosing in Transplant Care: Cohort Study.","authors":"Mingjia Huo, Sean Perez, Linda Awdishu, Janice S Kerr, Pengtao Xie, Adnan Khan, Kristin Mekeel, Shamim Nemati","doi":"10.2196/67302","DOIUrl":"10.2196/67302","url":null,"abstract":"<p><strong>Background: </strong>Tacrolimus forms the backbone of immunosuppressive therapy in solid organ transplantation, requiring precise dosing due to its narrow therapeutic range. Maintaining therapeutic tacrolimus levels in the postoperative period is challenging due to diverse patient characteristics, donor organ factors, drug interactions, and evolving perioperative physiology.</p><p><strong>Objective: </strong>The aim of this study is to design a machine learning model to predict the next-day tacrolimus trough concentrations (C0) and guide dosing to prevent persistent under- or overdosing.</p><p><strong>Methods: </strong>We used retrospective data from 1597 adult recipients of kidney and liver transplants at UC San Diego Health to develop a long short-term memory (LSTM) model to predict next-day tacrolimus C0 in an inpatient setting. Predictors included transplant type, demographics, comorbidities, vital signs, laboratory parameters, ordered diet, and medications. Permutation feature importance was evaluated for the model. We further implemented a classification task to evaluate the model's ability to identify underdosing, therapeutic dosing, and overdosing. Finally, we generated next-day dose recommendations that would achieve tacrolimus C0 within the target ranges.</p><p><strong>Results: </strong>The LSTM model provided a mean absolute error of 1.880 ng/mL when predicting next-day tacrolimus C0. Top predictive features included the recent tacrolimus C0, tacrolimus doses, transplant organ type, diet, and interactive drugs. When predicting underdosing, therapeutic dosing, and overdosing using a 3-class classification task, the model achieved a microaverage F1-score of 0.653. For dose recommendations, the best clinical outcomes were achieved when the actual total daily dose closely aligned with the model's recommended dose (within 3 mg).</p><p><strong>Conclusions: </strong>Ours is one of the largest studies to apply artificial intelligence to tacrolimus dosing, and our LSTM model effectively predicts tacrolimus C0 and could potentially guide accurate dose recommendations. Further prospective studies are needed to evaluate the model's performance in real-world dose adjustments.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67302"},"PeriodicalIF":2.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study. 医学专家知识与人工智能相结合,提高法布里病罕见病识别症状检查器的性能:混合方法研究。
IF 2 Pub Date : 2025-08-28 DOI: 10.2196/55001
Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner
<p><strong>Background: </strong>Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.</p><p><strong>Objective: </strong>The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.</p><p><strong>Methods: </strong>This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.</p><p><strong>Results: </strong>Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.</p><p><strong>Conclusions: </strong>This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to
背景:影响全世界数百万人的罕见疾病是一项重大挑战,因为往往需要数年时间才能做出准确诊断。由于误诊导致治疗不足和费用增加,这种延误给患者和卫生保健系统带来了沉重负担。人工智能(AI)驱动的症状检查器(SCs)提供了在诊断工作中早期标记罕见疾病的机会。然而,这些工具主要基于已发表的文献,这些文献通常包含罕见疾病的不完整数据,导致诊断准确性受到损害。将专家访谈的见解整合到SC模型中可以提高它们的性能,确保更快地考虑罕见疾病并更准确地诊断。目的:本研究的目的是在传统的文献综述之外,将专家访谈视频纳入人工智能驱动的SCs,并评估这种新方法是否提高了罕见病的诊断准确性和用户满意度,重点是法布里病。方法:这项混合方法前瞻性先导研究在德国汉诺威医学院进行。在第一阶段,与专门研究法布里病的医学专家进行了指导访谈,以创建临床小片段,丰富人工智能SC的法布里病模型。在第二阶段,确诊为Fabry病的成年患者以随机顺序使用原始和优化的SC版本。包含原始或优化法布里疾病模型的版本,根据诊断准确性和用户满意度进行评估,并通过问卷进行评估。结果:3位在法布里病溶酶体贮积障碍方面经验丰富的医学专家参与了5个临床小片段的创建,这些小片段被整合到人工智能驱动的SC中,研究比较了6例法布里病患者的原始版本和优化版本。优化后的版本提高了诊断的准确性,33%(2/6)的病例将Fabry病确定为最高建议,而原始模型为17%(1/6)。此外,优化版本的总体用户满意度更高,参与者在症状覆盖和完整性方面对其进行了更有利的评价。结论:本研究表明,将专家衍生的临床小插曲整合到人工智能驱动的SCs中可以提高诊断的准确性和用户满意度,特别是对于罕见疾病。与原始版本相比,优化后的SC版本在将法布里病识别为最高诊断建议方面表现出更好的性能,并获得了更高的用户满意度评级。为了充分实现这种方法的潜力,包括代表非典型表现的小插曲和进行更大规模的研究来验证这些发现是至关重要的。
{"title":"Medical Expert Knowledge Meets AI to Enhance Symptom Checker Performance for Rare Disease Identification in Fabry Disease: Mixed Methods Study.","authors":"Anne Pankow, Nico Meißner-Bendzko, Jessica Kaufeld, Laura Fouquette, Fabienne Cotte, Stephen Gilbert, Ewelina Türk, Anibh Das, Christoph Terkamp, Gerhard-Rüdiger Burmester, Annette Doris Wagner","doi":"10.2196/55001","DOIUrl":"10.2196/55001","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Rare diseases, which affect millions of people worldwide, pose a major challenge, as it often takes years before an accurate diagnosis can be made. This delay results in substantial burdens for patients and health care systems, as misdiagnoses lead to inadequate treatment and increased costs. Artificial intelligence (AI)-powered symptom checkers (SCs) present an opportunity to flag rare diseases earlier in the diagnostic work-up. However, these tools are primarily based on published literature, which often contains incomplete data on rare diseases, resulting in compromised diagnostic accuracy. Integrating expert interview insights into SC models may enhance their performance, ensuring that rare diseases are considered sooner and diagnosed more accurately.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;The objectives of our study were to incorporate expert interview vignettes into AI-powered SCs, in addition to a traditional literature review, and to evaluate whether this novel approach improves diagnostic accuracy and user satisfaction for rare diseases, focusing on Fabry disease.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;This mixed methods prospective pilot study was conducted at Hannover Medical School, Germany. In the first phase, guided interviews were conducted with medical experts specialized in Fabry disease to create clinical vignettes that enriched the AI SC's Fabry disease model. In the second phase, adult patients with a confirmed diagnosis of Fabry disease used both the original and optimized SC versions in a randomized order. The versions, containing either the original or the optimized Fabry disease model, were evaluated based on diagnostic accuracy and user satisfaction, which were assessed through questionnaires.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Three medical experts with extensive experience in lysosomal storage disorder Fabry disease contributed to the creation of 5 clinical vignettes, which were integrated into the AI-powered SC. The study compared the original and optimized SC versions in 6 patients with Fabry disease. The optimized version improved diagnostic accuracy, with Fabry disease identified as the top suggestion in 33% (2/6) of cases, compared to 17% (1/6) with the original model. Additionally, overall user satisfaction was higher for the optimized version, with participants rating it more favorably in terms of symptom coverage and completeness.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;This study demonstrates that integrating expert-derived clinical vignettes into AI-powered SCs can improve diagnostic accuracy and user satisfaction, particularly for rare diseases. The optimized SC version, which incorporated these vignettes, showed improved performance in identifying Fabry disease as a top diagnostic suggestion and received higher user satisfaction ratings compared to the original version. To fully realize the potential of this approach, it is crucial to include vignettes representing atypical presentations and to ","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e55001"},"PeriodicalIF":2.0,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12392689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study. 利用常规生理参数和人工智能预测重症监护病房低警觉性发作:衍生研究。
IF 2 Pub Date : 2025-08-27 DOI: 10.2196/60885
Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault
<p><strong>Background: </strong>Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.</p><p><strong>Objective: </strong>Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.</p><p><strong>Methods: </strong>This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).</p><p><strong>Results: </strong>Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F
背景:谵妄常见于重症监护病房(icu),常导致不良后果。低活动性谵妄尤其难以发现。尽管开发了新的工具,但由于其动态性,缺乏人力资源,缺乏可靠的监测工具以及包括警惕性低下在内的微妙临床症状,及时识别低活动性谵妄在临床上仍然具有挑战性。机器学习模型可以通过更好地检测低警觉性发作来支持识别低活性谵妄发作。目的:建立一种人工智能预测模型,利用常规收集的ICU生理数据检测低警觉性事件。方法:本衍生研究的数据来自加拿大魁地省魁地省的一项符合条件的ICU患者的前瞻性观察队列。我们纳入了2021年10月至2022年6月期间入住ICU的患者,年龄≥18岁,预计ICU住院时间≥48小时。ICU护士每小时使用Richmond躁动与镇静量表(RASS)或Ramsay镇静量表(RSS)识别低警惕性状态。常规生命体征(心率、呼吸频率、血压和血氧饱和度)以及其他生理和临床变量(室性早搏、插管、镇静药物使用和体温)均使用CARESCAPE Gateway(通用电气)自动收集和存储,或通过图表审查手动收集(用于社会人口统计学特征和药物)。时间序列是围绕低警惕性发作生成的,用于分析。然后使用随机森林、XGBoost和光梯度增强机分类器来检测基于时间序列分析的低警惕性事件。超参数优化使用随机搜索在10倍组为基础的交叉验证设置。为了解释表现最好的模型的预测,我们进行了Shapley加性解释(SHAP)分析。我们使用TRIPOD+AI(透明报告个体预后或机器学习模型诊断的多变量预测模型)指南报告了本研究的结果,并使用PROBAST(预测模型偏倚风险评估工具)评估了潜在的偏倚。结果:在136名可能符合条件的参与者中,收集了30名患者(平均年龄69岁,63%为男性)的数据进行分析。在所有参与者中,30%因手术原因入住ICU。数据预处理后,研究包括1493次低警惕性发作和764次非低警惕性发作。在评估的3种模型中,光梯度增强机表现出最好的性能。它检测低警惕性发作的平均准确率为68%,准确率为76%,召回率为74%,曲线下面积(AUC)为60%,f1评分为69%。SHAP分析显示,插管状态、呼吸频率和无创收缩压是模型预测的主要驱动因素。结论:所有分类器都产生了精度和召回值,显示出进一步发展的潜力,在分类低警惕性发作方面表现略有不同,但具有可比性。设计用于检测低警觉性的机器学习算法有可能支持ICU患者低活性谵妄的早期检测。
{"title":"Predicting Episodes of Hypovigilance in Intensive Care Units Using Routine Physiological Parameters and Artificial Intelligence: Derivation Study.","authors":"Raphaëlle Giguère, Victor Niaussat, Monia Noël-Hunter, William Witteman, Tanya S Paul, Alexandre Marois, Philippe Després, Simon Duchesne, Patrick M Archambault","doi":"10.2196/60885","DOIUrl":"10.2196/60885","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Delirium is prevalent in intensive care units (ICUs), often leading to adverse outcomes. Hypoactive delirium is particularly difficult to detect. Despite the development of new tools, the timely identification of hypoactive delirium remains clinically challenging due to its dynamic nature, lack of human resources, lack of reliable monitoring tools, and subtle clinical signs including hypovigilance. Machine learning models could support the identification of hypoactive delirium episodes by better detecting episodes of hypovigilance.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;Develop an artificial intelligence prediction model capable of detecting hypovigilance events using routinely collected physiological data in the ICU.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;This derivation study was conducted using data from a prospective observational cohort of eligible patients admitted to the ICU in Lévis, Québec, Canada. We included patients admitted to the ICU between October 2021 and June 2022 who were aged ≥18 years and had an anticipated ICU stay of ≥48 hours. ICU nurses identified hypovigilant states every hour using the Richmond Agitation and Sedation Scale (RASS) or the Ramsay Sedation Scale (RSS). Routine vital signs (heart rate, respiratory rate, blood pressure, and oxygen saturation), as well as other physiological and clinical variables (premature ventricular contractions, intubation, use of sedative medication, and temperature), were automatically collected and stored using a CARESCAPE Gateway (General Electric) or manually collected (for sociodemographic characteristics and medication) through chart review. Time series were generated around hypovigilance episodes for analysis. Random Forest, XGBoost, and Light Gradient Boosting Machine classifiers were then used to detect hypovigilant episodes based on time series analysis. Hyperparameter optimization was performed using a random search in a 10-fold group-based cross-validation setup. To interpret the predictions of the best-performing models, we conducted a Shapley Additive Explanations (SHAP) analysis. We report the results of this study using the TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis for machine learning models) guidelines, and potential biases were assessed using PROBAST (Prediction model Risk Of Bias ASsessment Tool).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Out of 136 potentially eligible participants, data from 30 patients (mean age 69 y, 63% male) were collected for analysis. Among all participants, 30% were admitted to the ICU for surgical reasons. Following data preprocessing, the study included 1493 hypovigilance episodes and 764 nonhypovigilant episodes. Among the 3 models evaluated, Light Gradient Boosting Machine demonstrated the best performance. It achieved an average accuracy of 68% to detect hypovigilant episodes, with a precision of 76%, a recall of 74%, an area under the curve (AUC) of 60%, and an F","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e60885"},"PeriodicalIF":2.0,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12384691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR AI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1