David Fernández-Narro, Pablo Ferri, Alba Gutiérrez-Sacristán, Juan M García-Gómez, Carlos Sáez
Background: Reusing long-term data from electronic health records is essential for training reliable and effective health artificial intelligence (AI). However, intrinsic changes in health data distributions over time-known as dataset shifts, which include concept, covariate, and prior shifts-can compromise model performance, leading to model obsolescence and inaccurate decisions.
Objective: In this study, we investigate whether unsupervised, model-agnostic characterization of temporal dataset shifts using data distribution analyses through Information Geometric Temporal (IGT) projections is an early indicator of potential AI performance variations before model development.
Methods: Using the real-world Medical Information Mart for Intensive Care-IV (MIMIC-IV) electronic health record database, encompassing data from over 40,000 patients from 2008 to 2019, we characterized its inherent dataset shift patterns through an unsupervised approach using IGT projections and data temporal heatmaps. We trained and evaluated annually a set of random forests and gradient boosting models to predict in-hospital mortality. To assess the impact of shifts on model performance, we checked the association between the temporal clusters found in both IGT projections and the intertime embedding of model performances using the Fisher exact test.
Results: Our results demonstrate a significant relationship between the unsupervised temporal shift patterns, specifically covariate and concept shifts, identified using the IGT projection method and the performance of the random forest and gradient boosting models (P<.05). We identified 2 primary temporal clusters that correspond to the periods before and after ICD-10 (International Statistical Classification of Diseases, Tenth Revision) implementation. The transition from ICD-9 (International Classification of Diseases, Ninth Revision) to ICD-10 was a major source of dataset shift, associated with a performance degradation.
Conclusions: Unsupervised, model-agnostic characterization of temporal shifts via IGT projections can serve as a proactive monitoring tool to anticipate performance shifts in clinical AI models. By incorporating early shift detection into the development pipeline, we can enhance decision-making during the training and maintenance of these models. This approach paves the way for more robust, trustworthy, and self-adapting AI systems in health care.
背景:重用电子健康记录中的长期数据对于训练可靠和有效的卫生人工智能(AI)至关重要。然而,随着时间的推移,健康数据分布的内在变化(即数据集移位,包括概念、协变量和先验移位)会损害模型性能,导致模型过时和决策不准确。目的:在本研究中,我们研究了通过信息几何时间(IGT)预测使用数据分布分析对时间数据集移动进行无监督、模型不可知的表征是否是在模型开发之前潜在的人工智能性能变化的早期指标。方法:利用真实世界的重症监护医疗信息市场- iv (MIMIC-IV)电子健康记录数据库,包括2008年至2019年超过40,000名患者的数据,我们通过使用IGT预测和数据时间热图的无监督方法表征了其固有的数据集转移模式。我们每年训练和评估一组随机森林和梯度增强模型来预测住院死亡率。为了评估变化对模型性能的影响,我们使用Fisher精确检验检查了在两个IGT预测中发现的时间聚类与模型性能的间期嵌入之间的关联。结果:我们的研究结果表明,使用IGT投影方法识别的无监督时间转移模式(特别是协变量和概念转移)与随机森林和梯度增强模型的性能之间存在显著关系(p结论:通过IGT投影识别的无监督、与模型无关的时间转移特征可以作为预测临床人工智能模型性能变化的主动监测工具)。通过将早期的转移检测合并到开发管道中,我们可以在培训和维护这些模型期间增强决策。这种方法为医疗保健领域更强大、更值得信赖和自适应的人工智能系统铺平了道路。
{"title":"Unsupervised Characterization of Temporal Dataset Shifts as an Early Indicator of AI Performance Variations: Evaluation Study Using the Medical Information Mart for Intensive Care-IV Dataset.","authors":"David Fernández-Narro, Pablo Ferri, Alba Gutiérrez-Sacristán, Juan M García-Gómez, Carlos Sáez","doi":"10.2196/78309","DOIUrl":"10.2196/78309","url":null,"abstract":"<p><strong>Background: </strong>Reusing long-term data from electronic health records is essential for training reliable and effective health artificial intelligence (AI). However, intrinsic changes in health data distributions over time-known as dataset shifts, which include concept, covariate, and prior shifts-can compromise model performance, leading to model obsolescence and inaccurate decisions.</p><p><strong>Objective: </strong>In this study, we investigate whether unsupervised, model-agnostic characterization of temporal dataset shifts using data distribution analyses through Information Geometric Temporal (IGT) projections is an early indicator of potential AI performance variations before model development.</p><p><strong>Methods: </strong>Using the real-world Medical Information Mart for Intensive Care-IV (MIMIC-IV) electronic health record database, encompassing data from over 40,000 patients from 2008 to 2019, we characterized its inherent dataset shift patterns through an unsupervised approach using IGT projections and data temporal heatmaps. We trained and evaluated annually a set of random forests and gradient boosting models to predict in-hospital mortality. To assess the impact of shifts on model performance, we checked the association between the temporal clusters found in both IGT projections and the intertime embedding of model performances using the Fisher exact test.</p><p><strong>Results: </strong>Our results demonstrate a significant relationship between the unsupervised temporal shift patterns, specifically covariate and concept shifts, identified using the IGT projection method and the performance of the random forest and gradient boosting models (P<.05). We identified 2 primary temporal clusters that correspond to the periods before and after ICD-10 (International Statistical Classification of Diseases, Tenth Revision) implementation. The transition from ICD-9 (International Classification of Diseases, Ninth Revision) to ICD-10 was a major source of dataset shift, associated with a performance degradation.</p><p><strong>Conclusions: </strong>Unsupervised, model-agnostic characterization of temporal shifts via IGT projections can serve as a proactive monitoring tool to anticipate performance shifts in clinical AI models. By incorporating early shift detection into the development pipeline, we can enhance decision-making during the training and maintenance of these models. This approach paves the way for more robust, trustworthy, and self-adapting AI systems in health care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e78309"},"PeriodicalIF":3.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145671034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manal Abumelha, Abdullah Al-Malaise Al-Ghamdi, Ayman Fayoumi, Mahmoud Ragab
<p><strong>Background: </strong>Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.</p><p><strong>Objective: </strong>This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.</p><p><strong>Methods: </strong>We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F<sub>1</sub>-scores, with error analysis conducted on predicted features from the private test set.</p><p><strong>Results: </strong>The framework achieved an F<sub>1</sub>-score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F<sub>1</sub>=0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F<sub>1</sub>=0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F<sub>1</sub> improvements from 0.819 to 0.986.</p><p><strong>Conclusions: </strong>Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F<sub>1</sub>=0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment appli
{"title":"Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework.","authors":"Manal Abumelha, Abdullah Al-Malaise Al-Ghamdi, Ayman Fayoumi, Mahmoud Ragab","doi":"10.2196/78432","DOIUrl":"10.2196/78432","url":null,"abstract":"<p><strong>Background: </strong>Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.</p><p><strong>Objective: </strong>This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.</p><p><strong>Methods: </strong>We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F<sub>1</sub>-scores, with error analysis conducted on predicted features from the private test set.</p><p><strong>Results: </strong>The framework achieved an F<sub>1</sub>-score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F<sub>1</sub>=0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F<sub>1</sub>=0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F<sub>1</sub> improvements from 0.819 to 0.986.</p><p><strong>Conclusions: </strong>Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F<sub>1</sub>=0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment appli","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":"e78432"},"PeriodicalIF":3.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712565/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: Prolonged hospital stays can lead to inefficiencies in health care delivery and unnecessary consumption of medical resources.
Objective: This study aimed to identify key clinical variances associated with prolonged length of stay (PLOS) in clinical pathways using a machine learning model trained on real-world data from the ePath system.
Methods: We analyzed data from 480 patients with lung cancer (age: mean 68.3, SD 11.2 years; n=263, 54.8% men) who underwent video-assisted thoracoscopic surgery at a university hospital between 2019 and 2023. PLOS was defined as a hospital stay exceeding 9 days after video-assisted thoracoscopic surgery. The variables collected between admission and 4 days after surgery were examined, and those that showed a significant association with PLOS in univariate analyses (P<.01) were selected as predictors. Predictive models were developed using sparse linear regression methods (Lasso, ridge, and elastic net) and decision tree ensembles (random forest and extreme gradient boosting). The data were divided into derivation (earlier study period) and testing (later period) cohorts for temporal validation. The model performance was assessed using the area under the receiver operating characteristic curve, Brier score, and calibration plots. Counterfactual analysis was used to identify key clinical factors influencing PLOS.
Results: A 3D heatmap illustrated the temporal relationships between clinical factors and PLOS based on patient demographics, comorbidities, functional status, surgical details, care processes, medications, and variances recorded from admission to 4 days after surgery. Among the 5 algorithms evaluated, the ridge regression model demonstrated the best performance in terms of both discrimination and calibration. Specifically, it achieved area under the receiver operating characteristic curve values of 0.84 and 0.82 and Brier scores of 0.16 and 0.17 in the derivation and test cohorts, respectively. In the final model, a range of variables, including blood tests, care, patient background, procedures, and clinical variances, were associated with PLOS. Among these, particular emphasis was placed on clinical variances. Counterfactual analysis using the ridge regression model identified 6 key variables strongly linked to PLOS. In order of impact, these were abnormal respiratory sounds, postoperative fever, arrhythmia, impaired ambulation, complications after drain removal, and pulmonary air leaks.
Conclusions: A machine learning-based model using ePath data effectively identified critical variances in the clinical pathways associated with PLOS. This automated tool may enhance clinical decision-making and improve patient management.
{"title":"Identifying Key Variances in Clinical Pathways Associated With Prolonged Hospital Stays Using Machine Learning and ePath Real-World Data: Model Development and Validation Study.","authors":"Saori Tou, Koutarou Matsumoto, Asato Hashinokuchi, Fumihiko Kinoshita, Yasunobu Nohara, Takanori Yamashita, Yoshifumi Wakata, Tomoyoshi Takenaka, Hidehisa Soejima, Tomoharu Yoshizumi, Naoki Nakashima, Masahiro Kamouchi","doi":"10.2196/71617","DOIUrl":"10.2196/71617","url":null,"abstract":"<p><strong>Background: </strong>Prolonged hospital stays can lead to inefficiencies in health care delivery and unnecessary consumption of medical resources.</p><p><strong>Objective: </strong>This study aimed to identify key clinical variances associated with prolonged length of stay (PLOS) in clinical pathways using a machine learning model trained on real-world data from the ePath system.</p><p><strong>Methods: </strong>We analyzed data from 480 patients with lung cancer (age: mean 68.3, SD 11.2 years; n=263, 54.8% men) who underwent video-assisted thoracoscopic surgery at a university hospital between 2019 and 2023. PLOS was defined as a hospital stay exceeding 9 days after video-assisted thoracoscopic surgery. The variables collected between admission and 4 days after surgery were examined, and those that showed a significant association with PLOS in univariate analyses (P<.01) were selected as predictors. Predictive models were developed using sparse linear regression methods (Lasso, ridge, and elastic net) and decision tree ensembles (random forest and extreme gradient boosting). The data were divided into derivation (earlier study period) and testing (later period) cohorts for temporal validation. The model performance was assessed using the area under the receiver operating characteristic curve, Brier score, and calibration plots. Counterfactual analysis was used to identify key clinical factors influencing PLOS.</p><p><strong>Results: </strong>A 3D heatmap illustrated the temporal relationships between clinical factors and PLOS based on patient demographics, comorbidities, functional status, surgical details, care processes, medications, and variances recorded from admission to 4 days after surgery. Among the 5 algorithms evaluated, the ridge regression model demonstrated the best performance in terms of both discrimination and calibration. Specifically, it achieved area under the receiver operating characteristic curve values of 0.84 and 0.82 and Brier scores of 0.16 and 0.17 in the derivation and test cohorts, respectively. In the final model, a range of variables, including blood tests, care, patient background, procedures, and clinical variances, were associated with PLOS. Among these, particular emphasis was placed on clinical variances. Counterfactual analysis using the ridge regression model identified 6 key variables strongly linked to PLOS. In order of impact, these were abnormal respiratory sounds, postoperative fever, arrhythmia, impaired ambulation, complications after drain removal, and pulmonary air leaks.</p><p><strong>Conclusions: </strong>A machine learning-based model using ePath data effectively identified critical variances in the clinical pathways associated with PLOS. This automated tool may enhance clinical decision-making and improve patient management.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e71617"},"PeriodicalIF":3.8,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12706448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter May, Julian Greß, Christoph Seidel, Sebastian Sommer, Markus K Schuler, Sina Nokodian, Florian Schröder, Johannes Jung
<p><strong>Background: </strong>Traditional cancer registries, limited by labor-intensive manual data abstraction and rigid, predefined schemas, often hinder timely and comprehensive oncology research. While large language models (LLMs) have shown promise in automating data extraction, their potential to perform direct, just-in-time (JIT) analysis on unstructured clinical narratives-potentially bypassing intermediate structured databases for many analytical tasks-remains largely unexplored.</p><p><strong>Objective: </strong>This study aimed to evaluate whether a state-of-the-art LLM (Gemini 2.5 Pro) can enable a JIT clinical oncology analysis paradigm by assessing its ability to (1) perform high-fidelity multiparameter data extraction, (2) answer complex clinical queries directly from raw text, (3) automate multistep survival analyses including executable code generation, and (4) generate novel, clinically plausible hypotheses from free-text documentation.</p><p><strong>Methods: </strong>A synthetic dataset of 240 unstructured clinical letters from patients with stage IV non-small cell lung cancer (NSCLC), embedding 14 predefined variables, was used. Gemini 2.5 Pro was evaluated on four core JIT capabilities. Performance was measured by using the following metrics: extraction accuracy (compared to human extraction of n=40 letters and across the full n=240 dataset); numerical deviation for direct question answering (n=40 to 240 letters, 5 questions); log-rank P value and Harrell concordance index for LLM-generated versus ground-truth Kaplan-Meier survival analyses (n=160 letters, overall survival and progression-free survival); and correct justification, novelty, and a qualitative evaluation of LLM-generated hypotheses (n=80 and n=160 letters).</p><p><strong>Results: </strong>For multiparameter extraction from 40 letters, the LLM achieved >99% average accuracy, comparable to human extraction, but in significantly less time (LLM: 3.7 min vs human: 133.8 min). Across the full 240-letter dataset, LLM multiparameter extraction maintained >98% accuracy for most variables. The LLM answered multiconditional clinical queries directly from raw text with a relative deviation rarely exceeding 1.5%, even with up to 240 letters. Crucially, it autonomously performed end-to-end survival analysis, generating text-to-R-code that produced Kaplan-Meier curves statistically indistinguishable from ground truth. Consistent performance was demonstrated on a small validation cohort of 80 synthetic acute myeloid leukemia reports. Stress testing on data with simulated imperfections revealed a key role of a human-in-the-loop to resolve AI-flagged ambiguities. Furthermore, the LLM generated several correctly justified, biologically plausible, and potentially novel hypotheses from datasets up to 80 letters.</p><p><strong>Conclusions: </strong>This feasibility study demonstrated that a frontier LLM (Gemini 2.5 Pro) can successfully perform high-fidelity data extraction, multic
背景:传统的癌症登记处受劳动密集型的人工数据抽象和僵化的预定义模式的限制,往往阻碍及时和全面的肿瘤学研究。虽然大型语言模型(llm)在自动化数据提取方面显示出了希望,但它们对非结构化临床叙述执行直接、即时(JIT)分析的潜力——可能绕过中间结构化数据库进行许多分析任务——在很大程度上仍未被探索。目的:本研究旨在评估最先进的LLM (Gemini 2.5 Pro)是否能够通过评估其能力来实现JIT临床肿瘤学分析范式:(1)执行高保真多参数数据提取,(2)直接从原始文本回答复杂的临床查询,(3)自动化多步骤生存分析,包括可执行代码生成,以及(4)从自由文本文档生成新颖的,临床可信的假设。方法:使用包含240份来自IV期非小细胞肺癌(NSCLC)患者的非结构化临床信函的合成数据集,嵌入14个预定义变量。对Gemini 2.5 Pro进行了四项核心JIT功能的评估。性能通过使用以下指标来衡量:提取准确性(与人类提取n=40个字母和整个n=240个数据集相比);直接问答的数值偏差(n=40 ~ 240个字母,5个问题);llm生成与真实Kaplan-Meier生存分析的log-rank P值和Harrell一致性指数(n=160个字母,总生存期和无进展生存期);以及法学硕士生成的假设的正确论证、新颖性和定性评价(n=80和n=160个字母)。结果:对于40个字母的多参数提取,LLM达到了bbbb99 %的平均准确率,与人类提取相当,但时间明显更短(LLM: 3.7 min vs .人类:133.8 min)。在整个240个字母的数据集中,LLM多参数提取对大多数变量保持了bb0 98%的准确率。LLM直接从原始文本回答多条件临床查询,相对偏差很少超过1.5%,即使高达240个字母。至关重要的是,它能自主执行端到端生存分析,生成文本到r代码,生成Kaplan-Meier曲线,在统计上与基础事实难以区分。在80例合成急性髓性白血病报告的小型验证队列中证明了一致的性能。对具有模拟缺陷的数据进行的压力测试显示,人工在解决人工智能标记的歧义方面发挥了关键作用。此外,法学硕士从多达80个字母的数据集中生成了几个正确证明的、生物学上合理的、潜在的新颖假设。结论:该可行性研究表明,前沿LLM (Gemini 2.5 Pro)可以成功地从非结构化文本中进行高保真数据提取、多条件查询和自动生存分析。这些结果为JIT临床分析方法的概念提供了基础证明。然而,这些发现仅限于合成患者,在考虑临床应用之前,对现实世界临床数据的严格验证是必不可少的下一步。
{"title":"Enabling Just-in-Time Clinical Oncology Analysis With Large Language Models: Feasibility and Validation Study Using Unstructured Synthetic Data.","authors":"Peter May, Julian Greß, Christoph Seidel, Sebastian Sommer, Markus K Schuler, Sina Nokodian, Florian Schröder, Johannes Jung","doi":"10.2196/78332","DOIUrl":"10.2196/78332","url":null,"abstract":"<p><strong>Background: </strong>Traditional cancer registries, limited by labor-intensive manual data abstraction and rigid, predefined schemas, often hinder timely and comprehensive oncology research. While large language models (LLMs) have shown promise in automating data extraction, their potential to perform direct, just-in-time (JIT) analysis on unstructured clinical narratives-potentially bypassing intermediate structured databases for many analytical tasks-remains largely unexplored.</p><p><strong>Objective: </strong>This study aimed to evaluate whether a state-of-the-art LLM (Gemini 2.5 Pro) can enable a JIT clinical oncology analysis paradigm by assessing its ability to (1) perform high-fidelity multiparameter data extraction, (2) answer complex clinical queries directly from raw text, (3) automate multistep survival analyses including executable code generation, and (4) generate novel, clinically plausible hypotheses from free-text documentation.</p><p><strong>Methods: </strong>A synthetic dataset of 240 unstructured clinical letters from patients with stage IV non-small cell lung cancer (NSCLC), embedding 14 predefined variables, was used. Gemini 2.5 Pro was evaluated on four core JIT capabilities. Performance was measured by using the following metrics: extraction accuracy (compared to human extraction of n=40 letters and across the full n=240 dataset); numerical deviation for direct question answering (n=40 to 240 letters, 5 questions); log-rank P value and Harrell concordance index for LLM-generated versus ground-truth Kaplan-Meier survival analyses (n=160 letters, overall survival and progression-free survival); and correct justification, novelty, and a qualitative evaluation of LLM-generated hypotheses (n=80 and n=160 letters).</p><p><strong>Results: </strong>For multiparameter extraction from 40 letters, the LLM achieved >99% average accuracy, comparable to human extraction, but in significantly less time (LLM: 3.7 min vs human: 133.8 min). Across the full 240-letter dataset, LLM multiparameter extraction maintained >98% accuracy for most variables. The LLM answered multiconditional clinical queries directly from raw text with a relative deviation rarely exceeding 1.5%, even with up to 240 letters. Crucially, it autonomously performed end-to-end survival analysis, generating text-to-R-code that produced Kaplan-Meier curves statistically indistinguishable from ground truth. Consistent performance was demonstrated on a small validation cohort of 80 synthetic acute myeloid leukemia reports. Stress testing on data with simulated imperfections revealed a key role of a human-in-the-loop to resolve AI-flagged ambiguities. Furthermore, the LLM generated several correctly justified, biologically plausible, and potentially novel hypotheses from datasets up to 80 letters.</p><p><strong>Conclusions: </strong>This feasibility study demonstrated that a frontier LLM (Gemini 2.5 Pro) can successfully perform high-fidelity data extraction, multic","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e78332"},"PeriodicalIF":3.8,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12670046/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elizabeth A Campbell, Felix Holl, Oliver J Bear Don't Walk Iv, Badisa Mosesane, Andrew S Kanter, Hamish Fraser, Amanda L Joseph, Judy Wawira Gichoya, Kabelo Leonard Mauco, Sansanee Craig
Unlabelled: Academic global health informatics (GHI) projects are impactful collaborations between institutions in high-income and low- and middle-income countries (LMICs) and play a crucial role in enhancing health care services and access in LMICs using eHealth practices. Researchers across all involved organizations bring unique expertise to these collaborations. However, these projects often face significant obstacles, including cultural and linguistic barriers, resource limitations, and sustainability issues. The lack of representation from LMIC researchers in knowledge generation and the high costs of open-access publications further complicate efforts to ensure inclusive, accessible, and collaborative scholarship. This viewpoint describes present gaps in the literature on academic GHI collaborations and describes a path forward for future research directions and successful research community development. Key recommendations include centering community-based participatory research, developing post-growth solutions, and creating sustainable funding models. Addressing these challenges is essential for fostering effective, scalable, and equitable GHI interventions that improve global health outcomes.
{"title":"Gaps and Pathways to Success in Global Health Informatics Academic Collaborations: Reflecting on Current Practices.","authors":"Elizabeth A Campbell, Felix Holl, Oliver J Bear Don't Walk Iv, Badisa Mosesane, Andrew S Kanter, Hamish Fraser, Amanda L Joseph, Judy Wawira Gichoya, Kabelo Leonard Mauco, Sansanee Craig","doi":"10.2196/67326","DOIUrl":"10.2196/67326","url":null,"abstract":"<p><strong>Unlabelled: </strong>Academic global health informatics (GHI) projects are impactful collaborations between institutions in high-income and low- and middle-income countries (LMICs) and play a crucial role in enhancing health care services and access in LMICs using eHealth practices. Researchers across all involved organizations bring unique expertise to these collaborations. However, these projects often face significant obstacles, including cultural and linguistic barriers, resource limitations, and sustainability issues. The lack of representation from LMIC researchers in knowledge generation and the high costs of open-access publications further complicate efforts to ensure inclusive, accessible, and collaborative scholarship. This viewpoint describes present gaps in the literature on academic GHI collaborations and describes a path forward for future research directions and successful research community development. Key recommendations include centering community-based participatory research, developing post-growth solutions, and creating sustainable funding models. Addressing these challenges is essential for fostering effective, scalable, and equitable GHI interventions that improve global health outcomes.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e67326"},"PeriodicalIF":3.8,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669914/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Sun, Bo Li, Chuanlan Ju, Liming Hu, Huiyi Sun, Jing An, Tae-Hun Kim, Zhijun Bu, Zeyang Shi, Jianping Liu, Zhaolan Liu
<p><strong>Background: </strong>Predicting colorectal cancer (CRC) recurrence risk remains a challenge in clinical practice. Owing to the widespread use of radiomics in CRC diagnosis and treatment, some researchers recently explored the effectiveness of radiomics-based models in forecasting CRC recurrence risk. Nonetheless, the lack of systematic evidence of the efficacy of such models has hampered their clinical adoption.</p><p><strong>Objective: </strong>This study aimed to explore the value of radiomics in predicting CRC recurrence, providing a scholarly rationale for developing more specific interventions.</p><p><strong>Methods: </strong>Overall, 4 databases (Embase, PubMed, the Cochrane Library, and Web of Science) were searched for relevant articles from inception to January 1, 2025. We included studies that developed or validated radiomics-based machine learning models for predicting CRC recurrence using computed tomography or magnetic resonance imaging and provided discriminative performance metrics (c-index). Nonoriginal articles, studies that did not develop a model, and those lacking clear outcome measures were excluded from the study. The quality of the included original studies was assessed using the Radiomics Quality Score. A bivariate mixed-effects model was used to conduct a meta-analysis in which the c-index values with 95% CI were pooled. For the meta-analysis, subgroup analyses were conducted separately on the validation and training sets.</p><p><strong>Results: </strong>This meta-analysis included 17 original studies involving 4600 patients with CRC. The quality of the identified studies was low (mean Radiomics Quality Score 13.23/36, SD 2.56), with limitations in prospective design and biological validation. In the validation set, the c-index values based on clinical features, radiomics features, and radiomics features combined with clinical features were 0.73 (95% CI 0.68-0.79), 0.80 (95% CI 0.75-0.85), and 0.83 (95% CI 0.79-0.87), respectively. In the internal validation set, the c-index values based on clinical features, radiomics features, and radiomics features+clinical features were 0.70 (95% CI 0.61-0.79), 0.83 (95% CI 0.78-0.88), and 0.83 (95% CI 0.78-0.88), respectively. Finally, in the external validation set, the c-index values based on clinical features, radiomics features, and radiomics features combined with clinical features were 0.76 (95% CI 0.70-0.83), 0.75 (95% CI 0.66-0.83), and 0.83 (95% CI 0.78-0.88), respectively.</p><p><strong>Conclusions: </strong>Radiomics-based machine learning models, especially those integrating radiomics and clinical features, showed promising predictive performance for CRC recurrence risk. However, this study has several limitations, such as moderate study quality, limited sample size, and high heterogeneity in modeling approaches. These findings suggest the potential clinical value of integrated models in risk stratification and their potential to enhance personalized treatment,
背景:预测结直肠癌(CRC)复发风险在临床实践中仍然是一个挑战。由于放射组学在CRC诊断和治疗中的广泛应用,一些研究人员最近探索了基于放射组学的模型在预测CRC复发风险方面的有效性。然而,缺乏系统的证据,这些模型的有效性阻碍了他们的临床应用。目的:本研究旨在探讨放射组学在预测结直肠癌复发中的价值,为制定更具体的干预措施提供学术依据。方法:在Embase、PubMed、Cochrane Library和Web of Science 4个数据库中检索从成立到2025年1月1日的相关文章。我们纳入了开发或验证基于放射组学的机器学习模型的研究,这些模型用于使用计算机断层扫描或磁共振成像预测CRC复发,并提供了判别性能指标(c-index)。非原创文章、未建立模型的研究和缺乏明确结果测量的研究被排除在研究之外。使用放射组学质量评分评估纳入的原始研究的质量。采用双变量混合效应模型进行荟萃分析,合并95% CI的c指数值。对于meta分析,分别对验证集和训练集进行亚组分析。结果:该荟萃分析包括17项原始研究,涉及4600例结直肠癌患者。所确定的研究质量较低(平均放射组学质量评分13.23/36,SD 2.56),在前瞻性设计和生物学验证方面存在局限性。在验证集中,基于临床特征、放射组学特征和放射组学特征联合临床特征的c指数分别为0.73 (95% CI 0.68-0.79)、0.80 (95% CI 0.75-0.85)和0.83 (95% CI 0.79-0.87)。在内部验证集中,基于临床特征、放射组学特征和放射组学特征+临床特征的c-指数值分别为0.70 (95% CI 0.61-0.79)、0.83 (95% CI 0.78-0.88)和0.83 (95% CI 0.78-0.88)。最后,在外部验证集中,基于临床特征、放射组学特征和放射组学特征结合临床特征的c-指数值分别为0.76 (95% CI 0.70-0.83)、0.75 (95% CI 0.66-0.83)和0.83 (95% CI 0.78-0.88)。结论:基于放射组学的机器学习模型,特别是结合放射组学和临床特征的机器学习模型,在预测结直肠癌复发风险方面表现出很好的效果。然而,本研究存在一些局限性,如研究质量适中,样本量有限,建模方法异质性高。这些发现表明,综合模型在风险分层中的潜在临床价值及其增强个性化治疗的潜力,但需要进一步进行高质量的前瞻性研究。
{"title":"Predictive Performance of Radiomics-Based Machine Learning for Colorectal Cancer Recurrence Risk: Systematic Review and Meta-Analysis.","authors":"Yuan Sun, Bo Li, Chuanlan Ju, Liming Hu, Huiyi Sun, Jing An, Tae-Hun Kim, Zhijun Bu, Zeyang Shi, Jianping Liu, Zhaolan Liu","doi":"10.2196/78644","DOIUrl":"10.2196/78644","url":null,"abstract":"<p><strong>Background: </strong>Predicting colorectal cancer (CRC) recurrence risk remains a challenge in clinical practice. Owing to the widespread use of radiomics in CRC diagnosis and treatment, some researchers recently explored the effectiveness of radiomics-based models in forecasting CRC recurrence risk. Nonetheless, the lack of systematic evidence of the efficacy of such models has hampered their clinical adoption.</p><p><strong>Objective: </strong>This study aimed to explore the value of radiomics in predicting CRC recurrence, providing a scholarly rationale for developing more specific interventions.</p><p><strong>Methods: </strong>Overall, 4 databases (Embase, PubMed, the Cochrane Library, and Web of Science) were searched for relevant articles from inception to January 1, 2025. We included studies that developed or validated radiomics-based machine learning models for predicting CRC recurrence using computed tomography or magnetic resonance imaging and provided discriminative performance metrics (c-index). Nonoriginal articles, studies that did not develop a model, and those lacking clear outcome measures were excluded from the study. The quality of the included original studies was assessed using the Radiomics Quality Score. A bivariate mixed-effects model was used to conduct a meta-analysis in which the c-index values with 95% CI were pooled. For the meta-analysis, subgroup analyses were conducted separately on the validation and training sets.</p><p><strong>Results: </strong>This meta-analysis included 17 original studies involving 4600 patients with CRC. The quality of the identified studies was low (mean Radiomics Quality Score 13.23/36, SD 2.56), with limitations in prospective design and biological validation. In the validation set, the c-index values based on clinical features, radiomics features, and radiomics features combined with clinical features were 0.73 (95% CI 0.68-0.79), 0.80 (95% CI 0.75-0.85), and 0.83 (95% CI 0.79-0.87), respectively. In the internal validation set, the c-index values based on clinical features, radiomics features, and radiomics features+clinical features were 0.70 (95% CI 0.61-0.79), 0.83 (95% CI 0.78-0.88), and 0.83 (95% CI 0.78-0.88), respectively. Finally, in the external validation set, the c-index values based on clinical features, radiomics features, and radiomics features combined with clinical features were 0.76 (95% CI 0.70-0.83), 0.75 (95% CI 0.66-0.83), and 0.83 (95% CI 0.78-0.88), respectively.</p><p><strong>Conclusions: </strong>Radiomics-based machine learning models, especially those integrating radiomics and clinical features, showed promising predictive performance for CRC recurrence risk. However, this study has several limitations, such as moderate study quality, limited sample size, and high heterogeneity in modeling approaches. These findings suggest the potential clinical value of integrated models in risk stratification and their potential to enhance personalized treatment,","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e78644"},"PeriodicalIF":3.8,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12669921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gary Leiserowitz, Jeff Mansfield, Scott MacDonald, Melissa Jost
<p><strong>Background: </strong>Many institutions are in various stages of deploying an artificial intelligence (AI) scribe system for clinic electronic health record (EHR) documentation. In anticipation of the University of California, Davis Health's deployment of an AI scribe program, we surveyed current patients about their perceptions of this technology to inform a patient-centered implementation.</p><p><strong>Objective: </strong>We assessed patient perceptions about current clinician EHR documentation practices before implementation of the AI scribe program, and preconceptions regarding the AI scribe's introduction.</p><p><strong>Methods: </strong>We conducted a descriptive preimplementation survey as a quality improvement study. A convenience sample of 9171 patients (aged ≥18 years) who had a clinic visit within the previous year, was recruited via an email postvisit survey. Patient-identified demographics (age, gender, and race and ethnicity) were collected. The survey included rating scales on questions related to the patient perception of the AI scribe program, plus open-ended comments. Data were collated to analyze patient perceptions of including AI Scribe technology in a clinician visit.</p><p><strong>Results: </strong>In total, 1893 patients completed the survey (20% response rate), with partial responses from another 549. Sixty-three percent (n=1205) of the respondents were female, and most were 51 years and older (87%, n=1649). Most patients identified themselves as White (69%, n=1312), multirace (8%, n=154), Latinx (7%, n=130), and Black (2%, n=42). The respondents were not representative of the overall clinic populations and skewed more toward being female, ages 50 years and older, and White in comparison. Patients reacted to the current EHR documentation system, with 71% (n=1349) feeling heard or sometimes heard, but 23% (n=416) expressed frustrations that their physician focused too much on typing into the computer. When asked about their anticipated response to the use of an AI scribe, 48% (n=904) were favorable, 33% (n=630) were neutral, and 19% (n=359) were unfavorable. Younger patients (ages 18-30 years) expressed more skepticism than those aged 51 years and older. Further, 42% (655/1567) of positive comments received indicated this technology could improve human interaction during their visits. Comments supported that the use of an AI scribe would enhance patient experience by allowing the clinician to focus on the patient. However, when asked about concerns regarding the AI scribe, 39% (515/1330) and 15% (203/1330) of comments expressed concerns about documentation accuracy and privacy, respectively. Providing previsit patient education and obtaining permission were viewed as very important.</p><p><strong>Conclusions: </strong>This patient survey showed that respondents are generally open to the use of an AI scribe program for EHR documentation to allow the clinician to focus on the patient during the actual encounter ra
背景:许多机构正处于为诊所电子健康记录(EHR)文档部署人工智能(AI)抄写系统的不同阶段。考虑到加州大学戴维斯健康中心(University of California, Davis Health)部署的人工智能记录程序,我们调查了当前患者对这项技术的看法,以告知以患者为中心的实施。目的:在实施人工智能抄写员计划之前,我们评估了患者对当前临床医生电子病历记录实践的看法,以及对人工智能抄写员引入的先入之见。方法:我们进行了一项描述性的实施前调查作为质量改进研究。通过电子邮件访后调查,选取9171例在前一年就诊的患者(年龄≥18岁)作为方便样本。收集患者确定的人口统计数据(年龄、性别、种族和民族)。该调查包括对患者对人工智能抄写程序的看法相关问题的评分量表,以及开放式评论。整理数据以分析患者对在临床医生访问中使用AI Scribe技术的看法。结果:共有1893名患者完成了调查(20%的有效率),另有549名患者部分应答。受访者中女性占63% (n=1205), 51岁及以上的占87% (n= 1649)。大多数患者认为自己是白人(69%,n=1312)、多种族(8%,n=154)、拉丁裔(7%,n=130)和黑人(2%,n=42)。受访者并不能代表整个诊所的人群,相比之下,他们更倾向于50岁及以上的女性和白人。患者对当前的EHR文件系统的反应是,71% (n=1349)的患者感觉被听到或有时被听到,但23% (n=416)的患者对他们的医生过于专注于在电脑上打字表示失望。当被问及他们对使用AI抄写器的预期反应时,48% (n=904)表示赞成,33% (n=630)表示中立,19% (n=359)表示不赞成。年轻患者(18-30岁)比51岁及以上的患者表达更多的怀疑。此外,收到的42%(655/1567)的积极评论表明,这项技术可以改善他们访问期间的人际互动。评论认为,使用人工智能抄写员可以让临床医生专注于患者,从而提高患者体验。然而,当被问及对AI抄写员的担忧时,39%(515/1330)和15%(203/1330)的评论分别表达了对文档准确性和隐私性的担忧。在就诊前对患者进行教育并获得许可是非常重要的。结论:该患者调查显示,受访者通常对使用人工智能抄写程序进行电子病历记录持开放态度,以便临床医生在实际遇到患者时专注于患者,而不是计算机。在使用人工智能之前提供患者教育和征得患者同意是获得患者信任的重要组成部分。考虑到低回复率和非代表性,对结果保持谨慎是适当的。
{"title":"Patient Attitudes Toward Ambient Voice Technology: Preimplementation Patient Survey in an Academic Medical Center.","authors":"Gary Leiserowitz, Jeff Mansfield, Scott MacDonald, Melissa Jost","doi":"10.2196/77901","DOIUrl":"10.2196/77901","url":null,"abstract":"<p><strong>Background: </strong>Many institutions are in various stages of deploying an artificial intelligence (AI) scribe system for clinic electronic health record (EHR) documentation. In anticipation of the University of California, Davis Health's deployment of an AI scribe program, we surveyed current patients about their perceptions of this technology to inform a patient-centered implementation.</p><p><strong>Objective: </strong>We assessed patient perceptions about current clinician EHR documentation practices before implementation of the AI scribe program, and preconceptions regarding the AI scribe's introduction.</p><p><strong>Methods: </strong>We conducted a descriptive preimplementation survey as a quality improvement study. A convenience sample of 9171 patients (aged ≥18 years) who had a clinic visit within the previous year, was recruited via an email postvisit survey. Patient-identified demographics (age, gender, and race and ethnicity) were collected. The survey included rating scales on questions related to the patient perception of the AI scribe program, plus open-ended comments. Data were collated to analyze patient perceptions of including AI Scribe technology in a clinician visit.</p><p><strong>Results: </strong>In total, 1893 patients completed the survey (20% response rate), with partial responses from another 549. Sixty-three percent (n=1205) of the respondents were female, and most were 51 years and older (87%, n=1649). Most patients identified themselves as White (69%, n=1312), multirace (8%, n=154), Latinx (7%, n=130), and Black (2%, n=42). The respondents were not representative of the overall clinic populations and skewed more toward being female, ages 50 years and older, and White in comparison. Patients reacted to the current EHR documentation system, with 71% (n=1349) feeling heard or sometimes heard, but 23% (n=416) expressed frustrations that their physician focused too much on typing into the computer. When asked about their anticipated response to the use of an AI scribe, 48% (n=904) were favorable, 33% (n=630) were neutral, and 19% (n=359) were unfavorable. Younger patients (ages 18-30 years) expressed more skepticism than those aged 51 years and older. Further, 42% (655/1567) of positive comments received indicated this technology could improve human interaction during their visits. Comments supported that the use of an AI scribe would enhance patient experience by allowing the clinician to focus on the patient. However, when asked about concerns regarding the AI scribe, 39% (515/1330) and 15% (203/1330) of comments expressed concerns about documentation accuracy and privacy, respectively. Providing previsit patient education and obtaining permission were viewed as very important.</p><p><strong>Conclusions: </strong>This patient survey showed that respondents are generally open to the use of an AI scribe program for EHR documentation to allow the clinician to focus on the patient during the actual encounter ra","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e77901"},"PeriodicalIF":3.8,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699246/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong-Jae Choi, Changhee Lee, Hack-Lyoung Kim, Youn-Jung Son
Background: Patients with acute coronary syndrome (ACS) who undergo percutaneous coronary intervention (PCI) remain at high risk for major adverse cardiovascular events (MACE). Conventional risk scores may not capture dynamic or nonlinear changes in postdischarge MACE risk, whereas machine learning (ML) approaches can improve predictive performance. However, few ML models have incorporated time-to-event analysis to reflect changes in MACE risk over time.
Objective: This study aimed to develop a time-to-event ML model for predicting MACE after PCI in patients with ACS and to identify the risk factors with time-varying contributions.
Methods: We analyzed electronic health records of 3159 patients with ACS who underwent PCI at a tertiary hospital in South Korea between 2008 and 2020. Six time-to-event ML models were developed using 54 variables. Model performance was evaluated using the time-dependent concordance index and Brier score. Variable importance was assessed using permutation importance and visualized with partial dependence plots to identify variables contributing to MACE risk over time.
Results: During a median follow-up of 3.8 years, 626 (19.8%) patients experienced MACE. The best-performing model achieved a time-dependent concordance index of 0.743 at day 30 and 0.616 at 1 year. Time-dependent Brier scores increased and remained stable across all ML models. Key predictors included contrast volume, age, medication adherence, coronary artery disease severity, and glomerular filtration rate. Contrast volume ≥300 mL, age ≥60 years, and medication adherence score ≥30 were associated with early postdischarge risk, whereas coronary artery disease severity and glomerular filtration rate became more influential beyond 60 days.
Conclusions: The proposed time-to-event ML model effectively captured dynamic risk patterns after PCI and identified key predictors with time-varying effects. These findings may support individualized postdischarge management and early intervention strategies to prevent MACE in high-risk patients.
{"title":"Risk Prediction of Major Adverse Cardiovascular Events Within One Year After Percutaneous Coronary Intervention in Patients With Acute Coronary Syndrome: Machine Learning-Based Time-to-Event Analysis.","authors":"Hong-Jae Choi, Changhee Lee, Hack-Lyoung Kim, Youn-Jung Son","doi":"10.2196/81778","DOIUrl":"10.2196/81778","url":null,"abstract":"<p><strong>Background: </strong>Patients with acute coronary syndrome (ACS) who undergo percutaneous coronary intervention (PCI) remain at high risk for major adverse cardiovascular events (MACE). Conventional risk scores may not capture dynamic or nonlinear changes in postdischarge MACE risk, whereas machine learning (ML) approaches can improve predictive performance. However, few ML models have incorporated time-to-event analysis to reflect changes in MACE risk over time.</p><p><strong>Objective: </strong>This study aimed to develop a time-to-event ML model for predicting MACE after PCI in patients with ACS and to identify the risk factors with time-varying contributions.</p><p><strong>Methods: </strong>We analyzed electronic health records of 3159 patients with ACS who underwent PCI at a tertiary hospital in South Korea between 2008 and 2020. Six time-to-event ML models were developed using 54 variables. Model performance was evaluated using the time-dependent concordance index and Brier score. Variable importance was assessed using permutation importance and visualized with partial dependence plots to identify variables contributing to MACE risk over time.</p><p><strong>Results: </strong>During a median follow-up of 3.8 years, 626 (19.8%) patients experienced MACE. The best-performing model achieved a time-dependent concordance index of 0.743 at day 30 and 0.616 at 1 year. Time-dependent Brier scores increased and remained stable across all ML models. Key predictors included contrast volume, age, medication adherence, coronary artery disease severity, and glomerular filtration rate. Contrast volume ≥300 mL, age ≥60 years, and medication adherence score ≥30 were associated with early postdischarge risk, whereas coronary artery disease severity and glomerular filtration rate became more influential beyond 60 days.</p><p><strong>Conclusions: </strong>The proposed time-to-event ML model effectively captured dynamic risk patterns after PCI and identified key predictors with time-varying effects. These findings may support individualized postdischarge management and early intervention strategies to prevent MACE in high-risk patients.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e81778"},"PeriodicalIF":3.8,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12699253/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zigui Wang, Jillian H Hurst, Chuan Hong, Benjamin Alan Goldstein
Background: Developing computable phenotypes (CP) based on electronic health records (EHR) data requires "gold-standard" labels for the outcome of interest. To generate these labels, clinicians typically chart-review a subset of patient charts. Charts to be reviewed are most often randomly sampled from the larger set of patients of interest. However, random sampling may fail to capture the diversity of the patient population, particularly if smaller subpopulations exist among those with the condition of interest. This can lead to poorly performing and biased CPs.
Objective: This study aimed to propose an unsupervised sampling approach designed to better capture a diverse patient cohort and improve the information coverage of chart review samples.
Methods: Our coverage sampling method starts by clustering by the patient population of interest. We then perform a stratified sampling from each cluster to ensure even representation for the chart review sample. We introduce a novel metric, nearest neighbor distance, to evaluate the coverage of the generated sample. To evaluate our method, we first conducted a simulation study to model and compare the performance of random versus our proposed coverage sampling. We varied the size and number of subpopulations within the larger cohort. Finally, we apply our approach to a real-world data set to develop a CP for hospitalization due to COVID-19. We evaluate the different sampling strategies based on the information coverage as well as the performance of the learned CP, using the area under the receiver operator characteristic curve.
Results: Our simulation studies show that the unsupervised coverage sampling approach provides broader coverage of patient populations compared to random sampling. When there are no underlying subpopulations, both random and coverage perform equally well for CP development. When there are subgroups, coverage sampling achieves area under the receiver operating characteristic curve gains of approximately 0.03-0.05 over random sampling. In the real-world application, the approach also outperformed random sampling, generating both a more representative sample and an area under the receiver operating characteristic curve improvement of 0.02 (95% CI -0.08 to 0.04).
Conclusions: The proposed coverage sampling method is an easy-to-implement approach that produces a chart review sample that is more representative of the source population. This allows one to learn a CP that has better performance both for subpopulations and the overall cohort. Studies that aim to develop CPs should consider alternative strategies other than randomly sampling patient charts.
背景:基于电子健康记录(EHR)数据开发可计算表型(CP)需要对感兴趣的结果进行“金标准”标签。为了生成这些标签,临床医生通常会对患者图表的一个子集进行图表审查。要审查的图表通常是从感兴趣的较大患者组中随机抽取的。然而,随机抽样可能无法捕获患者群体的多样性,特别是如果在那些有兴趣的条件中存在较小的亚群。这可能导致cp表现不佳和有偏见。目的:本研究旨在提出一种无监督抽样方法,旨在更好地捕获多样化的患者队列,并提高图表回顾样本的信息覆盖率。方法:我们的覆盖抽样方法从感兴趣的患者群体聚类开始。然后,我们从每个集群中执行分层抽样,以确保图表审查样本的均匀表示。我们引入了一种新的度量,最近邻距离,来评估生成样本的覆盖率。为了评估我们的方法,我们首先进行了模拟研究,对随机抽样和我们建议的覆盖抽样的性能进行了建模和比较。我们在更大的队列中改变了亚种群的大小和数量。最后,我们将我们的方法应用于现实世界的数据集,以制定因COVID-19住院的CP。我们利用接收算子特征曲线下的面积,根据信息覆盖率和学习到的CP的性能来评估不同的采样策略。结果:我们的模拟研究表明,与随机抽样相比,无监督覆盖抽样方法提供了更广泛的患者群体覆盖。当没有潜在的亚群时,对于CP的发展,随机和覆盖都表现得同样好。当存在子组时,覆盖抽样比随机抽样实现了接受者工作特征曲线下面积增益约0.03-0.05。在实际应用中,该方法也优于随机抽样,产生了更具代表性的样本,并且接收者工作特征曲线下的面积提高了0.02 (95% CI -0.08至0.04)。结论:建议的覆盖抽样方法是一种易于实施的方法,它产生的图表审查样本更能代表源人群。这允许人们学习一种对亚群体和整体群体都有更好表现的CP。旨在发展CPs的研究应考虑其他策略,而不是随机抽样患者图表。
{"title":"Unsupervised Coverage Sampling to Enhance Clinical Chart Review Coverage for Computable Phenotype Development: Simulation and Empirical Study.","authors":"Zigui Wang, Jillian H Hurst, Chuan Hong, Benjamin Alan Goldstein","doi":"10.2196/72068","DOIUrl":"10.2196/72068","url":null,"abstract":"<p><strong>Background: </strong>Developing computable phenotypes (CP) based on electronic health records (EHR) data requires \"gold-standard\" labels for the outcome of interest. To generate these labels, clinicians typically chart-review a subset of patient charts. Charts to be reviewed are most often randomly sampled from the larger set of patients of interest. However, random sampling may fail to capture the diversity of the patient population, particularly if smaller subpopulations exist among those with the condition of interest. This can lead to poorly performing and biased CPs.</p><p><strong>Objective: </strong>This study aimed to propose an unsupervised sampling approach designed to better capture a diverse patient cohort and improve the information coverage of chart review samples.</p><p><strong>Methods: </strong>Our coverage sampling method starts by clustering by the patient population of interest. We then perform a stratified sampling from each cluster to ensure even representation for the chart review sample. We introduce a novel metric, nearest neighbor distance, to evaluate the coverage of the generated sample. To evaluate our method, we first conducted a simulation study to model and compare the performance of random versus our proposed coverage sampling. We varied the size and number of subpopulations within the larger cohort. Finally, we apply our approach to a real-world data set to develop a CP for hospitalization due to COVID-19. We evaluate the different sampling strategies based on the information coverage as well as the performance of the learned CP, using the area under the receiver operator characteristic curve.</p><p><strong>Results: </strong>Our simulation studies show that the unsupervised coverage sampling approach provides broader coverage of patient populations compared to random sampling. When there are no underlying subpopulations, both random and coverage perform equally well for CP development. When there are subgroups, coverage sampling achieves area under the receiver operating characteristic curve gains of approximately 0.03-0.05 over random sampling. In the real-world application, the approach also outperformed random sampling, generating both a more representative sample and an area under the receiver operating characteristic curve improvement of 0.02 (95% CI -0.08 to 0.04).</p><p><strong>Conclusions: </strong>The proposed coverage sampling method is an easy-to-implement approach that produces a chart review sample that is more representative of the source population. This allows one to learn a CP that has better performance both for subpopulations and the overall cohort. Studies that aim to develop CPs should consider alternative strategies other than randomly sampling patient charts.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e72068"},"PeriodicalIF":3.8,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12661603/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145643472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: The Computerized Digit Vigilance Test (CDVT) is a well-established measure of sustained attention. However, the CDVT only measures the total reaction time and response accuracy and fails to capture other crucial attentional features such as the eye blink rate, yawns, head movements, and eye movements. Omitting such features might provide an incomplete representative picture of sustained attention.
Objective: This study aimed to develop an artificial intelligence (AI)-based Computerized Digit Vigilance Test (AI-CDVT) for older adults.
Methods: Participants were assessed by the CDVT with video recordings capturing their head and face. The Montreal Cognitive Assessment (MoCA), Stroop Color Word Test (SCW), and Color Trails Test (CTT) were also administered. The AI-CDVT was developed in three steps: (1) retrieving attentional features using OpenFace AI software (CMU MultiComp Lab), (2) establishing an AI-based scoring model with the Extreme Gradient Boosting regressor, and (3) assessing the AI-CDVT's validity by Pearson r values and test-retest reliability by intraclass correlation coefficients (ICCs).
Results: In total, 153 participants were included. Pearson r values of the AI-CDVT with the MoCA were -0.42, -0.31 with the SCW, and 0.46-0.61 with the CTT. The ICC of the AI-CDVT was 0.78.
Conclusions: We developed an AI-CDVT, which leveraged AI to extract attentional features from video recordings and integrated them to generate a comprehensive attention score. Our findings demonstrated good validity and test-retest reliability for the AI-CDVT, suggesting its potential as a reliable and valid tool for assessing sustained attention in older adults.
{"title":"Artificial Intelligence-Based Computerized Digit Vigilance Test in Community-Dwelling Older Adults: Development and Validation Study.","authors":"Gong-Hong Lin, Dorothy Bai, Yi-Jing Huang, Shih-Chieh Lee, Mai Thi Thuy Vu, Tsu-Hsien Chiu","doi":"10.2196/73038","DOIUrl":"10.2196/73038","url":null,"abstract":"<p><strong>Background: </strong>The Computerized Digit Vigilance Test (CDVT) is a well-established measure of sustained attention. However, the CDVT only measures the total reaction time and response accuracy and fails to capture other crucial attentional features such as the eye blink rate, yawns, head movements, and eye movements. Omitting such features might provide an incomplete representative picture of sustained attention.</p><p><strong>Objective: </strong>This study aimed to develop an artificial intelligence (AI)-based Computerized Digit Vigilance Test (AI-CDVT) for older adults.</p><p><strong>Methods: </strong>Participants were assessed by the CDVT with video recordings capturing their head and face. The Montreal Cognitive Assessment (MoCA), Stroop Color Word Test (SCW), and Color Trails Test (CTT) were also administered. The AI-CDVT was developed in three steps: (1) retrieving attentional features using OpenFace AI software (CMU MultiComp Lab), (2) establishing an AI-based scoring model with the Extreme Gradient Boosting regressor, and (3) assessing the AI-CDVT's validity by Pearson r values and test-retest reliability by intraclass correlation coefficients (ICCs).</p><p><strong>Results: </strong>In total, 153 participants were included. Pearson r values of the AI-CDVT with the MoCA were -0.42, -0.31 with the SCW, and 0.46-0.61 with the CTT. The ICC of the AI-CDVT was 0.78.</p><p><strong>Conclusions: </strong>We developed an AI-CDVT, which leveraged AI to extract attentional features from video recordings and integrated them to generate a comprehensive attention score. Our findings demonstrated good validity and test-retest reliability for the AI-CDVT, suggesting its potential as a reliable and valid tool for assessing sustained attention in older adults.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e73038"},"PeriodicalIF":3.8,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12670460/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}