首页 > 最新文献

JMIR Medical Informatics最新文献

英文 中文
Trends and Trajectories in the Rise of Large Language Models in Radiology: Scoping Review. 放射学中大型语言模型兴起的趋势和轨迹:范围综述。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-09 DOI: 10.2196/78041
Adhari Al Zaabi, Rashid Alshibli, Abdullah AlAmri, Ibrahim AlRuheili, Syaheerah Lebai Lutfi

Background: The use of large language models (LLMs) in radiology is expanding rapidly, offering new possibilities in report generation, decision support, and workflow optimization. However, a comprehensive evaluation of their applications, performance, and limitations across the radiology domain remains limited.

Objective: This review aimed to map current applications of LLMs in radiology, evaluate their performance across key tasks, and identify prevailing limitations and directions for future research.

Methods: A scoping review was conducted in accordance with the framework by Arksey and O'Malley framework and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Three databases-PubMed, ScopusCOPUS, and IEEE Xplore-were searched for peer-reviewed studies published between January 2022 and December 2024. Eligible studies included empirical evaluations of LLMs applied to radiological data or workflows. Commentaries, reviews, and technical model proposals without evaluation were excluded. Two reviewers independently screened studies and extracted data on study characteristics, LLM type, radiological use case, data modality, and evaluation metrics. A thematic synthesis was used to identify key domains of application. No formal risk-of-bias assessment was performed, but a narrative appraisal of dataset representativeness and study quality was included.

Results: A total of 67 studies were included. (n/N, %)GPT-4 was the most frequently used model (n=28, 42%), with text-based corpora as the primary type of data used (n=43, 64%). Identified use cases fell into three thematic domains: (1) decision support (n=39, 58%), (2) report generation and summarization (n=16, 24%), and (3) workflow optimization (n=12, 18%). While LLMs demonstrated strong performance in structured-text tasks (eg, report simplification with >94% accuracy), diagnostic performance varied widely (16%-86%) and was limited by dataset bias, lack of fine tuning, and minimal clinical validation. Most studies (n=53, 79.1%) had single-center, proof-of-concept designs with limited generalizability.

Conclusions: LLMs show strong potential for augmenting radiological workflows, particularly for structured reporting, summarization, and educational tasks. However, their diagnostic performance remains inconsistent, and current implementations lack robust external validation. Future work should prioritize prospective, multicenter validation of domain-adapted and multimodal models to support safe clinical integration.

背景:大型语言模型(llm)在放射学中的应用正在迅速扩大,为报告生成、决策支持和工作流程优化提供了新的可能性。然而,对它们在放射学领域的应用、性能和局限性的综合评估仍然有限。目的:本综述旨在绘制llm在放射学中的当前应用,评估其在关键任务中的表现,并确定当前的局限性和未来研究的方向。方法:根据Arksey和O'Malley框架和PRISMA-ScR(系统评价和荟萃分析扩展范围评价的首选报告项目)指南的框架进行范围评价。在pubmed、ScopusCOPUS和IEEE explore三个数据库中检索了2022年1月至2024年12月间发表的同行评议研究。合格的研究包括法学硕士应用于放射学数据或工作流程的实证评估。没有评估的评论、评论和技术模型建议被排除在外。两位审稿人独立筛选研究并提取研究特征、LLM类型、放射学用例、数据模式和评估指标方面的数据。专题综合用于确定关键的应用领域。没有进行正式的偏倚风险评估,但包括对数据集代表性和研究质量的叙述性评估。结果:共纳入67项研究。(n/ n, %)GPT-4是最常用的模型(n=28, 42%),基于文本的语料库是使用的主要数据类型(n=43, 64%)。确定的用例分为三个主题领域:(1)决策支持(n=39, 58%),(2)报告生成和总结(n=16, 24%),以及(3)工作流优化(n=12, 18%)。虽然llm在结构化文本任务中表现出强大的性能(例如,报告简化的准确率为60% - 94%),但诊断性能差异很大(16%-86%),并且受到数据集偏差,缺乏微调和最小临床验证的限制。大多数研究(n=53, 79.1%)采用单中心、概念验证设计,通用性有限。结论:法学硕士显示出增强放射学工作流程的强大潜力,特别是在结构化报告、总结和教育任务方面。然而,它们的诊断性能仍然不一致,并且当前的实现缺乏健壮的外部验证。未来的工作应优先考虑前瞻性、多中心的领域适应性和多模态模型验证,以支持安全的临床整合。
{"title":"Trends and Trajectories in the Rise of Large Language Models in Radiology: Scoping Review.","authors":"Adhari Al Zaabi, Rashid Alshibli, Abdullah AlAmri, Ibrahim AlRuheili, Syaheerah Lebai Lutfi","doi":"10.2196/78041","DOIUrl":"10.2196/78041","url":null,"abstract":"<p><strong>Background: </strong>The use of large language models (LLMs) in radiology is expanding rapidly, offering new possibilities in report generation, decision support, and workflow optimization. However, a comprehensive evaluation of their applications, performance, and limitations across the radiology domain remains limited.</p><p><strong>Objective: </strong>This review aimed to map current applications of LLMs in radiology, evaluate their performance across key tasks, and identify prevailing limitations and directions for future research.</p><p><strong>Methods: </strong>A scoping review was conducted in accordance with the framework by Arksey and O'Malley framework and the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Three databases-PubMed, ScopusCOPUS, and IEEE Xplore-were searched for peer-reviewed studies published between January 2022 and December 2024. Eligible studies included empirical evaluations of LLMs applied to radiological data or workflows. Commentaries, reviews, and technical model proposals without evaluation were excluded. Two reviewers independently screened studies and extracted data on study characteristics, LLM type, radiological use case, data modality, and evaluation metrics. A thematic synthesis was used to identify key domains of application. No formal risk-of-bias assessment was performed, but a narrative appraisal of dataset representativeness and study quality was included.</p><p><strong>Results: </strong>A total of 67 studies were included. (n/N, %)GPT-4 was the most frequently used model (n=28, 42%), with text-based corpora as the primary type of data used (n=43, 64%). Identified use cases fell into three thematic domains: (1) decision support (n=39, 58%), (2) report generation and summarization (n=16, 24%), and (3) workflow optimization (n=12, 18%). While LLMs demonstrated strong performance in structured-text tasks (eg, report simplification with >94% accuracy), diagnostic performance varied widely (16%-86%) and was limited by dataset bias, lack of fine tuning, and minimal clinical validation. Most studies (n=53, 79.1%) had single-center, proof-of-concept designs with limited generalizability.</p><p><strong>Conclusions: </strong>LLMs show strong potential for augmenting radiological workflows, particularly for structured reporting, summarization, and educational tasks. However, their diagnostic performance remains inconsistent, and current implementations lack robust external validation. Future work should prioritize prospective, multicenter validation of domain-adapted and multimodal models to support safe clinical integration.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e78041"},"PeriodicalIF":3.8,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12688054/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145716933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coagulation Risk Prediction in Patients With Liver Failure: Integrated Meta-Analysis and Machine Learning Model Study. 肝衰竭患者凝血风险预测:综合meta分析和机器学习模型研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-08 DOI: 10.2196/76348
Hao Wang, Tao He, Liang Ren, Tingjun Zhang

Background: Liver failure often results in significant coagulation dysfunction, which is a major complication. Artificial liver support systems (ALSS) have been used to ameliorate coagulation parameters, but the dynamic nature of these improvements and the development of predictive models remain insufficiently explored.

Objective: This study aimed to evaluate the effects of ALSS on coagulation function and to develop a dynamic prediction model using machine learning techniques to predict the improvement trends of coagulation parameters.

Methods: A systematic search was conducted in PubMed, Embase, and other databases to identify relevant studies, resulting in 18 studies comprising 1771 patients. A meta-analysis was performed to assess the impact of ALSS on coagulation parameters, including international normalized ratio (INR), prothrombin time (PT), activated partial thromboplastin time (APTT), and fibrinogen levels. In addition, clinical data from the Medical Information Mart for Intensive Care database were used to construct prediction models using logistic regression, extreme gradient boosting, random forest, and long short-term memory networks.

Results: Meta-analysis results showed that ALSS significantly improved INR, PT, APTT, and fibrinogen levels (all P<.05), with the treatment efficacy varying by modality. Among the machine learning models, the random forest model demonstrated the best performance, achieving an area under the curve of 92.12%. Dynamic INR was identified as the key predictor for coagulation abnormalities.

Conclusions: This study systematically evaluated the effects of ALSS on coagulation function in patients with liver failure, demonstrating significant improvements in key parameters such as INR, PT, and APTT, with efficacy varying across different treatment modalities. Simultaneously, a machine learning model built using intensive care unit clinical data exhibited strong predictive capability for identifying the risk of coagulation dysfunction, particularly useful in supporting early-stage clinical recognition of high-risk patients and guiding personalized coagulation management strategies. It is important to emphasize that this model is positioned as a dynamic risk alert and assessment tool, intended to assist clinical baseline evaluation and nursing interventions, rather than serving as direct validation of ALSS therapeutic efficacy.

背景:肝功能衰竭常导致明显的凝血功能障碍,这是主要的并发症。人工肝支持系统(ALSS)已被用于改善凝血参数,但这些改善的动态性质和预测模型的发展仍未得到充分探讨。目的:本研究旨在评价ALSS对凝血功能的影响,并利用机器学习技术建立动态预测模型,预测凝血参数的改善趋势。方法:系统检索PubMed、Embase等数据库,确定相关研究,共纳入18项研究,1771例患者。荟萃分析评估ALSS对凝血参数的影响,包括国际标准化比率(INR)、凝血酶原时间(PT)、活化部分凝血活素时间(APTT)和纤维蛋白原水平。此外,利用重症监护医学信息市场数据库中的临床数据,利用逻辑回归、极端梯度增强、随机森林和长短期记忆网络构建预测模型。结果:荟萃分析结果显示,ALSS可显著改善INR、PT、APTT和纤维蛋白原水平(均为p)。结论:本研究系统评估了ALSS对肝功能衰竭患者凝血功能的影响,显示出INR、PT、APTT等关键参数均有显著改善,且不同治疗方式的疗效不同。同时,利用重症监护室临床数据建立的机器学习模型在识别凝血功能障碍风险方面表现出很强的预测能力,特别是在支持高危患者的早期临床识别和指导个性化凝血管理策略方面。需要强调的是,该模型定位为动态风险预警和评估工具,旨在协助临床基线评估和护理干预,而不是直接验证ALSS的治疗效果。
{"title":"Coagulation Risk Prediction in Patients With Liver Failure: Integrated Meta-Analysis and Machine Learning Model Study.","authors":"Hao Wang, Tao He, Liang Ren, Tingjun Zhang","doi":"10.2196/76348","DOIUrl":"10.2196/76348","url":null,"abstract":"<p><strong>Background: </strong>Liver failure often results in significant coagulation dysfunction, which is a major complication. Artificial liver support systems (ALSS) have been used to ameliorate coagulation parameters, but the dynamic nature of these improvements and the development of predictive models remain insufficiently explored.</p><p><strong>Objective: </strong>This study aimed to evaluate the effects of ALSS on coagulation function and to develop a dynamic prediction model using machine learning techniques to predict the improvement trends of coagulation parameters.</p><p><strong>Methods: </strong>A systematic search was conducted in PubMed, Embase, and other databases to identify relevant studies, resulting in 18 studies comprising 1771 patients. A meta-analysis was performed to assess the impact of ALSS on coagulation parameters, including international normalized ratio (INR), prothrombin time (PT), activated partial thromboplastin time (APTT), and fibrinogen levels. In addition, clinical data from the Medical Information Mart for Intensive Care database were used to construct prediction models using logistic regression, extreme gradient boosting, random forest, and long short-term memory networks.</p><p><strong>Results: </strong>Meta-analysis results showed that ALSS significantly improved INR, PT, APTT, and fibrinogen levels (all P<.05), with the treatment efficacy varying by modality. Among the machine learning models, the random forest model demonstrated the best performance, achieving an area under the curve of 92.12%. Dynamic INR was identified as the key predictor for coagulation abnormalities.</p><p><strong>Conclusions: </strong>This study systematically evaluated the effects of ALSS on coagulation function in patients with liver failure, demonstrating significant improvements in key parameters such as INR, PT, and APTT, with efficacy varying across different treatment modalities. Simultaneously, a machine learning model built using intensive care unit clinical data exhibited strong predictive capability for identifying the risk of coagulation dysfunction, particularly useful in supporting early-stage clinical recognition of high-risk patients and guiding personalized coagulation management strategies. It is important to emphasize that this model is positioned as a dynamic risk alert and assessment tool, intended to assist clinical baseline evaluation and nursing interventions, rather than serving as direct validation of ALSS therapeutic efficacy.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e76348"},"PeriodicalIF":3.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12723362/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145709516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Speech Analysis for Screening and Monitoring Bipolar Depression: Machine Learning Model Development and Interpretation Study. 筛选和监测双相抑郁症的自动语音分析:机器学习模型开发和解释研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-04 DOI: 10.2196/79093
Sooyeon Min, Tae-Sung Yeum, Daun Shin, Sang Jin Rhee, Hyunju Lee, Han-Sung Lee, Seongmin Park, Jihwa Lee, Yong Min Ahn
<p><strong>Background: </strong>Depressive episodes in bipolar disorder are frequent, prolonged, and contribute substantially to functional impairment and reduced quality of life. Therefore, early and objective detection of bipolar depression is critical for timely intervention and improved outcomes. Multimodal speech analyses hold promise for capturing psychomotor, cognitive, and affective changes associated with bipolar depression.</p><p><strong>Objective: </strong>This study aims to develop between- and within-person classifiers to screen for bipolar depression and monitor longitudinal changes to detect depressive recurrence in patients with bipolar disorder. A secondary objective was to compare the predictive performance across speech modalities.</p><p><strong>Methods: </strong>We collected 304 voice audio recordings obtained during semistructured interviews with 92 patients diagnosed with bipolar disorder over a 1-year period. Depression severity was assessed using the Hamilton Depression Rating Scale. Acoustic features were extracted using the openSMILE toolkit, and linguistic features were extracted using the Linguistic Inquiry and Word Count frameworks following automatic speech recognition and machine translation. Mixed-effects multivariate linear regression evaluated the associations between speech markers and Hamilton Depression Rating Scale scores adjusting for demographic variables, diagnosis, and feature-specific covariates. Extreme gradient boosting and light gradient boosting were used as base learners. We developed a between-person classifier to detect moderate to severe depression and a within-person classifier to detect recurrence. Hyperparameter tuning and 95% CI estimation were performed using a bootstrap bias-corrected cross-validation (k=5) approach combined with a grid search. Feature contributions were interpreted using Shapley additive explanations.</p><p><strong>Results: </strong>Patients with depression showed reduced energy modulation, prolonged monotony, and more frequent use of words related to death and negative emotions. The between-person classifier combining acoustic and linguistic features detected moderate to severe depression with an area under the curve of 0.76 compared to 0.54 for the demographic model. The within-person classifier based on speech features detected depression recurrence with an area under the curve of 0.70 compared to 0.55 for the demographic model.</p><p><strong>Conclusions: </strong>Between- and within-person comparisons of speech markers can be leveraged in detecting and monitoring bipolar depression. We demonstrate the feasibility of applying Linguistic Inquiry and Word Count-based psycholinguistic analysis to machine-transcribed and translated speech, supporting the replicability of this approach across languages. Automated multimodal voice analysis can be integrated into digital health platforms, providing a scalable and effective approach for accessing mental health monitoring and ca
背景:双相情感障碍中的抑郁发作是频繁的、持续的,并且在很大程度上导致功能障碍和生活质量下降。因此,早期客观发现双相抑郁对于及时干预和改善预后至关重要。多模态言语分析有望捕获与双相抑郁症相关的精神运动、认知和情感变化。目的:本研究旨在建立人与人之间和人与人之间的分类器来筛查双相情感障碍和监测纵向变化,以检测双相情感障碍患者的抑郁复发。第二个目标是比较不同语音模式的预测性能。方法:我们收集了92名双相情感障碍患者在1年半结构化访谈中获得的304份语音录音。采用汉密尔顿抑郁评定量表评估抑郁严重程度。声学特征提取使用openSMILE工具包,语言特征提取使用语言查询和单词计数框架,然后进行自动语音识别和机器翻译。混合效应多变量线性回归评估了语音标记与汉密尔顿抑郁评定量表评分之间的关系,调整了人口统计学变量、诊断和特征特异性协变量。使用极端梯度增强和光梯度增强作为基础学习器。我们开发了一个人与人之间的分类器来检测中度到重度抑郁症和一个人与人之间的分类器来检测复发。使用自举偏差校正交叉验证(k=5)方法结合网格搜索进行超参数调整和95% CI估计。特征贡献用Shapley加性解释解释。结果:抑郁症患者表现出能量调节减弱、单调感延长、死亡和负面情绪相关词汇使用频率增加。结合声学和语言特征的人之间分类器检测到中度至重度抑郁症,曲线下面积为0.76,而人口统计学模型的曲线下面积为0.54。基于语音特征的人内分类器检测到抑郁症复发的曲线下面积为0.70,而人口统计学模型的曲线下面积为0.55。结论:人与人之间的言语标记比较可用于检测和监测双相抑郁症。我们展示了将语言探究和基于词计数的心理语言学分析应用于机器转录和翻译语音的可行性,支持这种方法在不同语言之间的可复制性。自动化多模态语音分析可以集成到数字健康平台中,为获取精神健康监测和护理提供可扩展和有效的方法。
{"title":"Automated Speech Analysis for Screening and Monitoring Bipolar Depression: Machine Learning Model Development and Interpretation Study.","authors":"Sooyeon Min, Tae-Sung Yeum, Daun Shin, Sang Jin Rhee, Hyunju Lee, Han-Sung Lee, Seongmin Park, Jihwa Lee, Yong Min Ahn","doi":"10.2196/79093","DOIUrl":"10.2196/79093","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Depressive episodes in bipolar disorder are frequent, prolonged, and contribute substantially to functional impairment and reduced quality of life. Therefore, early and objective detection of bipolar depression is critical for timely intervention and improved outcomes. Multimodal speech analyses hold promise for capturing psychomotor, cognitive, and affective changes associated with bipolar depression.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to develop between- and within-person classifiers to screen for bipolar depression and monitor longitudinal changes to detect depressive recurrence in patients with bipolar disorder. A secondary objective was to compare the predictive performance across speech modalities.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We collected 304 voice audio recordings obtained during semistructured interviews with 92 patients diagnosed with bipolar disorder over a 1-year period. Depression severity was assessed using the Hamilton Depression Rating Scale. Acoustic features were extracted using the openSMILE toolkit, and linguistic features were extracted using the Linguistic Inquiry and Word Count frameworks following automatic speech recognition and machine translation. Mixed-effects multivariate linear regression evaluated the associations between speech markers and Hamilton Depression Rating Scale scores adjusting for demographic variables, diagnosis, and feature-specific covariates. Extreme gradient boosting and light gradient boosting were used as base learners. We developed a between-person classifier to detect moderate to severe depression and a within-person classifier to detect recurrence. Hyperparameter tuning and 95% CI estimation were performed using a bootstrap bias-corrected cross-validation (k=5) approach combined with a grid search. Feature contributions were interpreted using Shapley additive explanations.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;Patients with depression showed reduced energy modulation, prolonged monotony, and more frequent use of words related to death and negative emotions. The between-person classifier combining acoustic and linguistic features detected moderate to severe depression with an area under the curve of 0.76 compared to 0.54 for the demographic model. The within-person classifier based on speech features detected depression recurrence with an area under the curve of 0.70 compared to 0.55 for the demographic model.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Between- and within-person comparisons of speech markers can be leveraged in detecting and monitoring bipolar depression. We demonstrate the feasibility of applying Linguistic Inquiry and Word Count-based psycholinguistic analysis to machine-transcribed and translated speech, supporting the replicability of this approach across languages. Automated multimodal voice analysis can be integrated into digital health platforms, providing a scalable and effective approach for accessing mental health monitoring and ca","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e79093"},"PeriodicalIF":3.8,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715464/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145678683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Involving Health, Technology, and Financial Stakeholders in Co-Designing Digital Pathways for Value-Based Care. 让健康、技术和财务利益相关者共同设计基于价值的护理的数字途径。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-04 DOI: 10.2196/84885
Pieter Vandekerckhove, Benjamin H L Harris, Louis J Koizia, Steven Howard
{"title":"Involving Health, Technology, and Financial Stakeholders in Co-Designing Digital Pathways for Value-Based Care.","authors":"Pieter Vandekerckhove, Benjamin H L Harris, Louis J Koizia, Steven Howard","doi":"10.2196/84885","DOIUrl":"10.2196/84885","url":null,"abstract":"","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e84885"},"PeriodicalIF":3.8,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12677727/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145678722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online Clinical Calculator for Predicting 28-Day Mortality in Older Adult Patients With Sepsis-Associated Encephalopathy: Retrospective Study Using MIMIC-IV. 预测老年败血症相关脑病患者28天死亡率的在线临床计算器:使用MIMIC-IV的回顾性研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-04 DOI: 10.2196/76417
Guangyong Jin, Menglu Zhou, Jiayi Chen, Mengyuan Diao, Wei Hu
<p><strong>Background: </strong>Sepsis-associated encephalopathy (SAE) represents a critical complication of sepsis, especially among older adults. Despite its clinical relevance, there remains a lack of accessible and practical tools specifically designed to predict 28-day mortality in this vulnerable population.</p><p><strong>Objective: </strong>We aimed to enhance the practical applicability of the model by creating a web-based tool that allows real-time, individualized mortality risk prediction, facilitating early intervention and informed decision-making in clinical practice.</p><p><strong>Methods: </strong>Using data extracted from the MIMIC-IV (Medical Information Mart for Intensive Care IV) database, we identified older patients (≥65 years) with SAE (n=2165) and divided them into a development cohort (n=1531) and a validation cohort (n=634). Key risk factors associated with 28-day mortality were identified, and a predictive nomogram was constructed. Model performance was evaluated using the concordance index, integrated discrimination improvement, net reclassification index, and calibration curve analysis. Clinical applicability was assessed through decision curve analysis and benchmarked against traditional intensive care unit (ICU) scoring systems. Furthermore, the nomogram was deployed as a web-based application, enabling clinicians to input data and generate individualized mortality predictions.</p><p><strong>Results: </strong>A total of 2165 older patients with SAE were included, among whom 290 (13.4%) died within 28 days of ICU admission. Multivariable logistic regression identified lower body weight (odds ratio [OR] 0.985, 95% CI 0.975-0.994; P=.001), lower systolic blood pressure (OR 0.972, 95% CI 0.957-0.986; P<.001), lower hemoglobin (OR 0.984, 95% CI 0.974-0.995; P=.005), lower PaO2 (OR 0.996, 95% CI 0.994-0.997; P<.001), and lower Glasgow Coma Scale score (OR 0.825, 95% CI 0.786-0.864; P<.001) as mortality risk factors. Higher respiratory rate (OR 1.083, 95% CI 1.029-1.141; P=.002), increased anion gap (OR 1.081, 95% CI 1.031-1.135; P=.001), elevated blood urea nitrogen (OR 1.045, 95% CI 1.016-1.076; P=.002), prolonged partial thromboplastin time (OR 1.033, 95% CI 1.016-1.050; P<.001), and reduced urine output (OR>0.99, 95% CI 0.999-1.000; P=.002) were also predictive. Patients admitted to "other" ICU types had lower mortality compared with the medical ICU reference group (OR 0.327, 95% CI 0.176-0.609; P<.001). The nomogram achieved concordance index values of 0.899 (development) and 0.897 (validation), outperforming sequential organ failure assessment (0.692), Acute Physiology Score III (0.804), Logistic Organ Dysfunction System (0.771), Simplified Acute Physiology Score II (0.704), and Oxford Acute Severity of Illness Score (0.753), with significant integrated discrimination improvement and net reclassification index improvements (all P<.001). Calibration curves confirmed good agreement between predicted and observed outcome
背景:脓毒症相关脑病(SAE)是脓毒症的一种重要并发症,尤其是在老年人中。尽管它具有临床意义,但仍然缺乏专门用于预测这一弱势群体28天死亡率的可获得和实用的工具。目的:我们旨在通过创建一个基于网络的工具来增强模型的实际适用性,该工具可以实时、个性化地预测死亡风险,促进临床实践中的早期干预和知情决策。方法:使用从MIMIC-IV(重症医疗信息市场IV)数据库中提取的数据,我们确定老年SAE患者(n=2165),并将其分为发展队列(n=1531)和验证队列(n=634)。确定与28天死亡率相关的关键危险因素,并构建预测nomogram。采用一致性指数、综合判别改进、净重分类指数和校准曲线分析对模型性能进行评价。通过决策曲线分析评估临床适用性,并以传统的重症监护病房(ICU)评分系统为基准。此外,nomogram作为一个基于网络的应用程序部署,使临床医生能够输入数据并生成个性化的死亡率预测。结果:共纳入2165例老年SAE患者,其中290例(13.4%)在入院后28天内死亡。多变量logistic回归发现,较低的体重(比值比[OR] 0.985, 95% CI 0.975-0.994; P=.001)、较低的收缩压(OR 0.972, 95% CI 0.957-0.986; P0.99, 95% CI 0.999-1.000; P=.002)也是预测因素。与内科ICU参照组相比,入住“其他”ICU类型的患者死亡率更低(OR 0.327, 95% CI 0.176-0.609)。结论:本研究结合常规临床数据,提出了一种新的、有效的预测老年SAE患者28天死亡率的nomogram。该模型作为数字工具的部署增强了其可访问性和可用性,为临床医生提供了风险分层和个性化患者管理的实用资源。
{"title":"Online Clinical Calculator for Predicting 28-Day Mortality in Older Adult Patients With Sepsis-Associated Encephalopathy: Retrospective Study Using MIMIC-IV.","authors":"Guangyong Jin, Menglu Zhou, Jiayi Chen, Mengyuan Diao, Wei Hu","doi":"10.2196/76417","DOIUrl":"10.2196/76417","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Sepsis-associated encephalopathy (SAE) represents a critical complication of sepsis, especially among older adults. Despite its clinical relevance, there remains a lack of accessible and practical tools specifically designed to predict 28-day mortality in this vulnerable population.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;We aimed to enhance the practical applicability of the model by creating a web-based tool that allows real-time, individualized mortality risk prediction, facilitating early intervention and informed decision-making in clinical practice.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;Using data extracted from the MIMIC-IV (Medical Information Mart for Intensive Care IV) database, we identified older patients (≥65 years) with SAE (n=2165) and divided them into a development cohort (n=1531) and a validation cohort (n=634). Key risk factors associated with 28-day mortality were identified, and a predictive nomogram was constructed. Model performance was evaluated using the concordance index, integrated discrimination improvement, net reclassification index, and calibration curve analysis. Clinical applicability was assessed through decision curve analysis and benchmarked against traditional intensive care unit (ICU) scoring systems. Furthermore, the nomogram was deployed as a web-based application, enabling clinicians to input data and generate individualized mortality predictions.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;A total of 2165 older patients with SAE were included, among whom 290 (13.4%) died within 28 days of ICU admission. Multivariable logistic regression identified lower body weight (odds ratio [OR] 0.985, 95% CI 0.975-0.994; P=.001), lower systolic blood pressure (OR 0.972, 95% CI 0.957-0.986; P&lt;.001), lower hemoglobin (OR 0.984, 95% CI 0.974-0.995; P=.005), lower PaO2 (OR 0.996, 95% CI 0.994-0.997; P&lt;.001), and lower Glasgow Coma Scale score (OR 0.825, 95% CI 0.786-0.864; P&lt;.001) as mortality risk factors. Higher respiratory rate (OR 1.083, 95% CI 1.029-1.141; P=.002), increased anion gap (OR 1.081, 95% CI 1.031-1.135; P=.001), elevated blood urea nitrogen (OR 1.045, 95% CI 1.016-1.076; P=.002), prolonged partial thromboplastin time (OR 1.033, 95% CI 1.016-1.050; P&lt;.001), and reduced urine output (OR&gt;0.99, 95% CI 0.999-1.000; P=.002) were also predictive. Patients admitted to \"other\" ICU types had lower mortality compared with the medical ICU reference group (OR 0.327, 95% CI 0.176-0.609; P&lt;.001). The nomogram achieved concordance index values of 0.899 (development) and 0.897 (validation), outperforming sequential organ failure assessment (0.692), Acute Physiology Score III (0.804), Logistic Organ Dysfunction System (0.771), Simplified Acute Physiology Score II (0.704), and Oxford Acute Severity of Illness Score (0.753), with significant integrated discrimination improvement and net reclassification index improvements (all P&lt;.001). Calibration curves confirmed good agreement between predicted and observed outcome","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e76417"},"PeriodicalIF":3.8,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning-Based Prediction of In-Hospital Falls in Adult Inpatients: Retrospective Observational Multicenter Study. 基于机器学习的成人住院患者住院跌倒预测:回顾性观察性多中心研究
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-04 DOI: 10.2196/75958
Takuya Nishino, Kotone Matsuyama, Yasuo Miyagi, Nari Tanabe, Fumiko Yamaguchi, Hiroki Ito, Shizuka Soh, Ayako Yano, Masako Mizuno, Katsuhito Kato, Hiroshige Jinnouchi, Chol Kim, Yosuke Ishii, Hiroki Yamaguchi, Yukihiro Kondo

Background: Falls among hospitalized patients are a critical issue that often leads to prolonged hospital stays and increased health care costs. Traditional fall risk assessments typically rely on standardized scoring systems; however, these may fail to capture the complex and multifactorial nature of fall risk factors.

Objective: This retrospective observational multicenter study aimed to develop and validate a machine learning-based model to predict in-hospital falls and to evaluate its performance in terms of discrimination and calibration.

Methods: We analyzed the data of 83,917 inpatients aged 65 years and older with a hospital stay of at least 3 days. Using Diagnosis Procedure Combination data and laboratory results, we extracted demographic, clinical, functional, and pharmacological variables. Following the selection of 30 key features, 4 predictive models were constructed: logistic regression, extreme gradient boosting, light gradient boosting machine (LGBM), and categorical boosting (CatBoost). The synthetic minority oversampling technique and isotonic regression calibration were applied to improve the prediction quality and address class imbalance.

Results: Falls occurred in 2173 (2.6%) patients. CatBoost achieved the highest F1-score (0.189, 95% CI 0.162-0.215) and area under the precision-recall curve (0.112, 95% CI 0.091-0.136), whereas LGBM had the best calibration slope (0.964, 95% CI 0.858-1.070) with good discrimination (F1-score 0.182, 95% CI 0.156-0.209; area under the precision-recall curve 0.094, 95% CI 0.078-0.113). Logistic regression had the lowest discrimination (F1-score 0.120, 95% CI 0.100-0.143). Shapley Additive Explanations analysis consistently identified low albumin, impaired transfer ability, and the use of sedative-hypnotics or diabetes medications as major contributors to fall risk. In incident report analysis (n=435), 49.2% of falls were toileting-related, peaking between 4 and 6 AM, with bedside falls predominating in high or very high risk groups.

Conclusions: CatBoost and LGBM offer clinically valuable prediction performance, with CatBoost favored for high-risk patient identification and LGBM for probability-based intervention thresholds. Integrating such models into electronic health records could enable real-time risk scoring and trigger targeted interventions (eg, toileting assistance and mobility support). Future work should incorporate dynamic, time-varying patient data to improve real-time risk prediction.

背景:住院患者跌倒是一个关键问题,经常导致住院时间延长和医疗保健费用增加。传统的跌倒风险评估通常依赖于标准化的评分系统;然而,这些可能无法捕捉到跌倒危险因素的复杂性和多因素性质。目的:本回顾性观察性多中心研究旨在开发和验证基于机器学习的住院跌倒预测模型,并评估其在判别和校准方面的性能。方法:对83917例65岁及以上住院3天以上患者资料进行分析。使用诊断程序组合数据和实验室结果,我们提取了人口学、临床、功能和药理学变量。在选取30个关键特征的基础上,构建了逻辑回归、极端梯度增强、轻梯度增强机(LGBM)和分类增强(CatBoost) 4个预测模型。采用合成少数过采样技术和等渗回归校正,提高了预测质量,解决了类不平衡问题。结果:2173例(2.6%)患者发生跌倒。CatBoost具有最高的f1评分(0.189,95% CI 0.162 ~ 0.215)和精确召回曲线下面积(0.112,95% CI 0.091 ~ 0.136),而LGBM具有最佳的校准斜率(0.964,95% CI 0.858 ~ 1.070),具有良好的判别性(f1评分0.182,95% CI 0.156 ~ 0.209;精确召回曲线下面积0.094,95% CI 0.078 ~ 0.113)。Logistic回归的鉴别性最低(f1评分0.120,95% CI 0.100-0.143)。Shapley加性解释分析一致认为,低白蛋白、转运能力受损、镇静催眠药或糖尿病药物的使用是导致跌倒风险的主要因素。在事件报告分析(n=435)中,49.2%的跌倒与如厕有关,高峰发生在早上4点至6点之间,在高或极高风险人群中,床边跌倒占主导地位。结论:CatBoost和LGBM具有具有临床价值的预测性能,CatBoost适用于高风险患者识别,LGBM适用于基于概率的干预阈值。将这些模型纳入电子健康记录可以实现实时风险评分并触发有针对性的干预措施(例如,如厕协助和行动支助)。未来的工作应纳入动态的、时变的患者数据,以提高实时风险预测。
{"title":"Machine Learning-Based Prediction of In-Hospital Falls in Adult Inpatients: Retrospective Observational Multicenter Study.","authors":"Takuya Nishino, Kotone Matsuyama, Yasuo Miyagi, Nari Tanabe, Fumiko Yamaguchi, Hiroki Ito, Shizuka Soh, Ayako Yano, Masako Mizuno, Katsuhito Kato, Hiroshige Jinnouchi, Chol Kim, Yosuke Ishii, Hiroki Yamaguchi, Yukihiro Kondo","doi":"10.2196/75958","DOIUrl":"10.2196/75958","url":null,"abstract":"<p><strong>Background: </strong>Falls among hospitalized patients are a critical issue that often leads to prolonged hospital stays and increased health care costs. Traditional fall risk assessments typically rely on standardized scoring systems; however, these may fail to capture the complex and multifactorial nature of fall risk factors.</p><p><strong>Objective: </strong>This retrospective observational multicenter study aimed to develop and validate a machine learning-based model to predict in-hospital falls and to evaluate its performance in terms of discrimination and calibration.</p><p><strong>Methods: </strong>We analyzed the data of 83,917 inpatients aged 65 years and older with a hospital stay of at least 3 days. Using Diagnosis Procedure Combination data and laboratory results, we extracted demographic, clinical, functional, and pharmacological variables. Following the selection of 30 key features, 4 predictive models were constructed: logistic regression, extreme gradient boosting, light gradient boosting machine (LGBM), and categorical boosting (CatBoost). The synthetic minority oversampling technique and isotonic regression calibration were applied to improve the prediction quality and address class imbalance.</p><p><strong>Results: </strong>Falls occurred in 2173 (2.6%) patients. CatBoost achieved the highest F<sub>1</sub>-score (0.189, 95% CI 0.162-0.215) and area under the precision-recall curve (0.112, 95% CI 0.091-0.136), whereas LGBM had the best calibration slope (0.964, 95% CI 0.858-1.070) with good discrimination (F<sub>1</sub>-score 0.182, 95% CI 0.156-0.209; area under the precision-recall curve 0.094, 95% CI 0.078-0.113). Logistic regression had the lowest discrimination (F<sub>1</sub>-score 0.120, 95% CI 0.100-0.143). Shapley Additive Explanations analysis consistently identified low albumin, impaired transfer ability, and the use of sedative-hypnotics or diabetes medications as major contributors to fall risk. In incident report analysis (n=435), 49.2% of falls were toileting-related, peaking between 4 and 6 AM, with bedside falls predominating in high or very high risk groups.</p><p><strong>Conclusions: </strong>CatBoost and LGBM offer clinically valuable prediction performance, with CatBoost favored for high-risk patient identification and LGBM for probability-based intervention thresholds. Integrating such models into electronic health records could enable real-time risk scoring and trigger targeted interventions (eg, toileting assistance and mobility support). Future work should incorporate dynamic, time-varying patient data to improve real-time risk prediction.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e75958"},"PeriodicalIF":3.8,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715471/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145679326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Authors' Reply: Involving Health, Technology, and Financial Stakeholders in Co-Designing Digital Pathways for Value-Based Care. 作者回复:让健康、技术和财务利益相关者共同设计基于价值的护理的数字途径。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-04 DOI: 10.2196/86837
Jinsong Chen, Christopher Bullen, Lan Zhang
{"title":"Authors' Reply: Involving Health, Technology, and Financial Stakeholders in Co-Designing Digital Pathways for Value-Based Care.","authors":"Jinsong Chen, Christopher Bullen, Lan Zhang","doi":"10.2196/86837","DOIUrl":"10.2196/86837","url":null,"abstract":"","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e86837"},"PeriodicalIF":3.8,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12714545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Characterization of Temporal Dataset Shifts as an Early Indicator of AI Performance Variations: Evaluation Study Using the Medical Information Mart for Intensive Care-IV Dataset. 时间数据集移位的无监督表征作为人工智能性能变化的早期指标:使用重症监护- iv数据集的医疗信息集市的评估研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-03 DOI: 10.2196/78309
David Fernández-Narro, Pablo Ferri, Alba Gutiérrez-Sacristán, Juan M García-Gómez, Carlos Sáez

Background: Reusing long-term data from electronic health records is essential for training reliable and effective health artificial intelligence (AI). However, intrinsic changes in health data distributions over time-known as dataset shifts, which include concept, covariate, and prior shifts-can compromise model performance, leading to model obsolescence and inaccurate decisions.

Objective: In this study, we investigate whether unsupervised, model-agnostic characterization of temporal dataset shifts using data distribution analyses through Information Geometric Temporal (IGT) projections is an early indicator of potential AI performance variations before model development.

Methods: Using the real-world Medical Information Mart for Intensive Care-IV (MIMIC-IV) electronic health record database, encompassing data from over 40,000 patients from 2008 to 2019, we characterized its inherent dataset shift patterns through an unsupervised approach using IGT projections and data temporal heatmaps. We trained and evaluated annually a set of random forests and gradient boosting models to predict in-hospital mortality. To assess the impact of shifts on model performance, we checked the association between the temporal clusters found in both IGT projections and the intertime embedding of model performances using the Fisher exact test.

Results: Our results demonstrate a significant relationship between the unsupervised temporal shift patterns, specifically covariate and concept shifts, identified using the IGT projection method and the performance of the random forest and gradient boosting models (P<.05). We identified 2 primary temporal clusters that correspond to the periods before and after ICD-10 (International Statistical Classification of Diseases, Tenth Revision) implementation. The transition from ICD-9 (International Classification of Diseases, Ninth Revision) to ICD-10 was a major source of dataset shift, associated with a performance degradation.

Conclusions: Unsupervised, model-agnostic characterization of temporal shifts via IGT projections can serve as a proactive monitoring tool to anticipate performance shifts in clinical AI models. By incorporating early shift detection into the development pipeline, we can enhance decision-making during the training and maintenance of these models. This approach paves the way for more robust, trustworthy, and self-adapting AI systems in health care.

背景:重用电子健康记录中的长期数据对于训练可靠和有效的卫生人工智能(AI)至关重要。然而,随着时间的推移,健康数据分布的内在变化(即数据集移位,包括概念、协变量和先验移位)会损害模型性能,导致模型过时和决策不准确。目的:在本研究中,我们研究了通过信息几何时间(IGT)预测使用数据分布分析对时间数据集移动进行无监督、模型不可知的表征是否是在模型开发之前潜在的人工智能性能变化的早期指标。方法:利用真实世界的重症监护医疗信息市场- iv (MIMIC-IV)电子健康记录数据库,包括2008年至2019年超过40,000名患者的数据,我们通过使用IGT预测和数据时间热图的无监督方法表征了其固有的数据集转移模式。我们每年训练和评估一组随机森林和梯度增强模型来预测住院死亡率。为了评估变化对模型性能的影响,我们使用Fisher精确检验检查了在两个IGT预测中发现的时间聚类与模型性能的间期嵌入之间的关联。结果:我们的研究结果表明,使用IGT投影方法识别的无监督时间转移模式(特别是协变量和概念转移)与随机森林和梯度增强模型的性能之间存在显著关系(p结论:通过IGT投影识别的无监督、与模型无关的时间转移特征可以作为预测临床人工智能模型性能变化的主动监测工具)。通过将早期的转移检测合并到开发管道中,我们可以在培训和维护这些模型期间增强决策。这种方法为医疗保健领域更强大、更值得信赖和自适应的人工智能系统铺平了道路。
{"title":"Unsupervised Characterization of Temporal Dataset Shifts as an Early Indicator of AI Performance Variations: Evaluation Study Using the Medical Information Mart for Intensive Care-IV Dataset.","authors":"David Fernández-Narro, Pablo Ferri, Alba Gutiérrez-Sacristán, Juan M García-Gómez, Carlos Sáez","doi":"10.2196/78309","DOIUrl":"10.2196/78309","url":null,"abstract":"<p><strong>Background: </strong>Reusing long-term data from electronic health records is essential for training reliable and effective health artificial intelligence (AI). However, intrinsic changes in health data distributions over time-known as dataset shifts, which include concept, covariate, and prior shifts-can compromise model performance, leading to model obsolescence and inaccurate decisions.</p><p><strong>Objective: </strong>In this study, we investigate whether unsupervised, model-agnostic characterization of temporal dataset shifts using data distribution analyses through Information Geometric Temporal (IGT) projections is an early indicator of potential AI performance variations before model development.</p><p><strong>Methods: </strong>Using the real-world Medical Information Mart for Intensive Care-IV (MIMIC-IV) electronic health record database, encompassing data from over 40,000 patients from 2008 to 2019, we characterized its inherent dataset shift patterns through an unsupervised approach using IGT projections and data temporal heatmaps. We trained and evaluated annually a set of random forests and gradient boosting models to predict in-hospital mortality. To assess the impact of shifts on model performance, we checked the association between the temporal clusters found in both IGT projections and the intertime embedding of model performances using the Fisher exact test.</p><p><strong>Results: </strong>Our results demonstrate a significant relationship between the unsupervised temporal shift patterns, specifically covariate and concept shifts, identified using the IGT projection method and the performance of the random forest and gradient boosting models (P<.05). We identified 2 primary temporal clusters that correspond to the periods before and after ICD-10 (International Statistical Classification of Diseases, Tenth Revision) implementation. The transition from ICD-9 (International Classification of Diseases, Ninth Revision) to ICD-10 was a major source of dataset shift, associated with a performance degradation.</p><p><strong>Conclusions: </strong>Unsupervised, model-agnostic characterization of temporal shifts via IGT projections can serve as a proactive monitoring tool to anticipate performance shifts in clinical AI models. By incorporating early shift detection into the development pipeline, we can enhance decision-making during the training and maintenance of these models. This approach paves the way for more robust, trustworthy, and self-adapting AI systems in health care.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e78309"},"PeriodicalIF":3.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712564/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145671034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework. 从临床检查笔记中提取医学特征:两阶段大型语言模型框架的开发和评估。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-03 DOI: 10.2196/78432
Manal Abumelha, Abdullah Al-Malaise Al-Ghamdi, Ayman Fayoumi, Mahmoud Ragab
<p><strong>Background: </strong>Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.</p><p><strong>Objective: </strong>This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.</p><p><strong>Methods: </strong>We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F<sub>1</sub>-scores, with error analysis conducted on predicted features from the private test set.</p><p><strong>Results: </strong>The framework achieved an F<sub>1</sub>-score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F<sub>1</sub>=0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F<sub>1</sub>=0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F<sub>1</sub> improvements from 0.819 to 0.986.</p><p><strong>Conclusions: </strong>Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F<sub>1</sub>=0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment appli
背景:由于有限的数据可用性、医学术语的可变性以及对可靠输出的迫切需求,从临床文本中提取医学特征具有挑战性。大型语言模型(llm)提供了有前景的功能,但面临着幻觉的关键挑战。目的:本研究旨在开发一个强大的医学特征提取框架,即使在训练数据有限的情况下,也可以通过最小化幻觉风险来提高准确性。方法:采用两阶段训练方法。第一阶段采用指导微调来教授特征提取。第二阶段引入了带有损失函数的信心正则化微调,用于惩罚过度自信的错误预测,这些预测使用针对幻觉和缺失特征的双向匹配来捕获。该模型使用了700个病人记录的全部数据,以及少数100个病人的记录进行了训练。我们在USMLE step2临床技能数据集上评估了该框架,测试了200个患者笔记的公开分割和1839个患者笔记的私人分割。使用精确度、召回率和F1分数评估性能,并对来自私有测试集的预测特征进行误差分析。结果:该框架在完整数据集(700份病历)上的F1得分为0.968 ~ 0.983,在较少数据集(700份病历)上的F1得分为0.960 ~ 0.973(14.2%),优于INCITE (F1 = 0.883)和DeBERTa (F1 = 0.958)。与具有少量上下文学习ICL的基线LLM模型相比,它在私有数据集中减少了89.9%的幻觉(从3081个特征到311个特征)和88.9%的缺失特征(从6376个到708个)。少针训练(100例患者笔记)的校准评价显示,预期校准误差ECE从0.060增加到0.147,Brier评分从0.087提高到0.036。值得注意的是,尽管F1从0.819提高到0.986,但平均模型置信度仍然稳定在0.84(±0.003)。结论:我们的两阶段LLM框架成功解决了自动化医学特征提取的关键挑战,在减少幻觉和特征缺失的同时实现了最先进的性能。该框架能够以最少的训练数据实现高性能(100个样本的F1=0.960-0.973),这证明了强大的泛化能力对于资源受限的医学教育至关重要。虽然传统的校准指标显示不一致,但置信度注入的实际好处可以减少误差,并且推断时间过滤也提供适合自动临床评估应用的可靠输出。临床试验:不适用。这项研究没有涉及临床试验或人类参与者的前瞻性登记。仅使用回顾性的、完全去识别的数据。
{"title":"Medical Feature Extraction From Clinical Examination Notes: Development and Evaluation of a Two-Phase Large Language Model Framework.","authors":"Manal Abumelha, Abdullah Al-Malaise Al-Ghamdi, Ayman Fayoumi, Mahmoud Ragab","doi":"10.2196/78432","DOIUrl":"10.2196/78432","url":null,"abstract":"&lt;p&gt;&lt;strong&gt;Background: &lt;/strong&gt;Medical feature extraction from clinical text is challenging because of limited data availability, variability in medical terminology, and the critical need for trustworthy outputs. Large language models (LLMs) offer promising capabilities but face critical challenges with hallucination.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Objective: &lt;/strong&gt;This study aims to develop a robust framework for medical feature extraction that enhances accuracy by minimizing the risk of hallucination, even with limited training data.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Methods: &lt;/strong&gt;We developed a two-phase training approach. Phase 1 used instructing fine-tuning to teach feature extraction. Phase 2 introduced confidence-regularization fine-tuning with loss functions penalizing overconfident incorrect predictions, which were captured using bidirectional matching targeting hallucination and missing features. The model was trained using the full data of 700 patient notes and on few-shot 100 patient notes. We evaluated the framework on the United States Medical Licensing Examination Step-2 Clinical Skills dataset, testing on a public split of 200 patient notes and a private split of 1839 patient notes. Performance was assessed using precision, recall, and F&lt;sub&gt;1&lt;/sub&gt;-scores, with error analysis conducted on predicted features from the private test set.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Results: &lt;/strong&gt;The framework achieved an F&lt;sub&gt;1&lt;/sub&gt;-score of 0.968-0.983 on the full dataset of 700 patient notes and 0.960-0.973 with a few-shot subset of 100 of 700 patient notes (14.2%), outperforming INCITE (intelligent clinical text evaluator; F&lt;sub&gt;1&lt;/sub&gt;=0.883) and DeBERTa (decoding-enhanced bidirectional encoder representations from transformers with disentangled attention; F&lt;sub&gt;1&lt;/sub&gt;=0.958). It reduced hallucinations by 89.9% (from 3081 to 311 features) and missing features by 88.9% (from 6376 to 708) on the private dataset compared with the baseline LLM with few-shot in-context learning. Calibration evaluation on few-shot training (100 patient notes) showed that the expected calibration error increased from 0.060 to 0.147, whereas the Brier score improved from 0.087 to 0.036. Notably, the average model confidence remained stable at 0.84 (SD 0.003) despite F&lt;sub&gt;1&lt;/sub&gt; improvements from 0.819 to 0.986.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Conclusions: &lt;/strong&gt;Our two-phase LLM framework successfully addresses critical challenges in automated medical feature extraction, achieving state-of-the-art performance while reducing hallucination and missing features. The framework's ability to achieve high performance with minimal training data (F&lt;sub&gt;1&lt;/sub&gt;=0.960-0.973 with 100 samples) demonstrates strong generalization capabilities essential for resource-constrained settings in medical education. While traditional calibration metrics show misalignment, the practical benefits of confidence injection led to reduced errors, and inference-time filtering provided reliable outputs suitable for automated clinical assessment appli","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":" ","pages":"e78432"},"PeriodicalIF":3.8,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12712565/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying Key Variances in Clinical Pathways Associated With Prolonged Hospital Stays Using Machine Learning and ePath Real-World Data: Model Development and Validation Study. 使用机器学习和ePath真实世界数据识别与延长住院时间相关的临床路径的关键差异:模型开发和验证研究。
IF 3.8 3区 医学 Q2 MEDICAL INFORMATICS Pub Date : 2025-12-01 DOI: 10.2196/71617
Saori Tou, Koutarou Matsumoto, Asato Hashinokuchi, Fumihiko Kinoshita, Yasunobu Nohara, Takanori Yamashita, Yoshifumi Wakata, Tomoyoshi Takenaka, Hidehisa Soejima, Tomoharu Yoshizumi, Naoki Nakashima, Masahiro Kamouchi

Background: Prolonged hospital stays can lead to inefficiencies in health care delivery and unnecessary consumption of medical resources.

Objective: This study aimed to identify key clinical variances associated with prolonged length of stay (PLOS) in clinical pathways using a machine learning model trained on real-world data from the ePath system.

Methods: We analyzed data from 480 patients with lung cancer (age: mean 68.3, SD 11.2 years; n=263, 54.8% men) who underwent video-assisted thoracoscopic surgery at a university hospital between 2019 and 2023. PLOS was defined as a hospital stay exceeding 9 days after video-assisted thoracoscopic surgery. The variables collected between admission and 4 days after surgery were examined, and those that showed a significant association with PLOS in univariate analyses (P<.01) were selected as predictors. Predictive models were developed using sparse linear regression methods (Lasso, ridge, and elastic net) and decision tree ensembles (random forest and extreme gradient boosting). The data were divided into derivation (earlier study period) and testing (later period) cohorts for temporal validation. The model performance was assessed using the area under the receiver operating characteristic curve, Brier score, and calibration plots. Counterfactual analysis was used to identify key clinical factors influencing PLOS.

Results: A 3D heatmap illustrated the temporal relationships between clinical factors and PLOS based on patient demographics, comorbidities, functional status, surgical details, care processes, medications, and variances recorded from admission to 4 days after surgery. Among the 5 algorithms evaluated, the ridge regression model demonstrated the best performance in terms of both discrimination and calibration. Specifically, it achieved area under the receiver operating characteristic curve values of 0.84 and 0.82 and Brier scores of 0.16 and 0.17 in the derivation and test cohorts, respectively. In the final model, a range of variables, including blood tests, care, patient background, procedures, and clinical variances, were associated with PLOS. Among these, particular emphasis was placed on clinical variances. Counterfactual analysis using the ridge regression model identified 6 key variables strongly linked to PLOS. In order of impact, these were abnormal respiratory sounds, postoperative fever, arrhythmia, impaired ambulation, complications after drain removal, and pulmonary air leaks.

Conclusions: A machine learning-based model using ePath data effectively identified critical variances in the clinical pathways associated with PLOS. This automated tool may enhance clinical decision-making and improve patient management.

背景:延长住院时间会导致医疗服务效率低下和不必要的医疗资源消耗。目的:本研究旨在利用ePath系统的真实世界数据训练的机器学习模型,确定与临床路径中延长住院时间(PLOS)相关的关键临床差异。方法:我们分析了2019年至2023年间在一所大学医院接受视频辅助胸腔镜手术的480例肺癌患者(平均68.3岁,SD 11.2岁;n=263,男性54.8%)的数据。PLOS被定义为电视胸腔镜手术后住院时间超过9天。研究了入院至术后4天收集的变量,以及在单变量分析中显示与PLOS显著相关的变量(结果:3D热图显示了临床因素与PLOS之间的时间关系,该关系基于患者人口统计学、合并症、功能状态、手术细节、护理过程、药物以及入院至术后4天记录的差异)。在评估的5种算法中,脊回归模型在识别和校准方面表现最好。具体而言,推导组和检验组的受试者工作特征曲线下面积分别为0.84和0.82,Brier评分分别为0.16和0.17。在最后的模型中,一系列变量,包括血液测试、护理、患者背景、程序和临床差异,都与PLOS相关。其中,特别强调的是临床差异。使用脊回归模型进行反事实分析,确定了与PLOS密切相关的6个关键变量。影响因素依次为异常呼吸音、术后发热、心律失常、行动障碍、引流管取出后并发症和肺部漏气。结论:使用ePath数据的基于机器学习的模型有效地识别了与PLOS相关的临床路径的关键差异。这种自动化工具可以增强临床决策,改善患者管理。
{"title":"Identifying Key Variances in Clinical Pathways Associated With Prolonged Hospital Stays Using Machine Learning and ePath Real-World Data: Model Development and Validation Study.","authors":"Saori Tou, Koutarou Matsumoto, Asato Hashinokuchi, Fumihiko Kinoshita, Yasunobu Nohara, Takanori Yamashita, Yoshifumi Wakata, Tomoyoshi Takenaka, Hidehisa Soejima, Tomoharu Yoshizumi, Naoki Nakashima, Masahiro Kamouchi","doi":"10.2196/71617","DOIUrl":"10.2196/71617","url":null,"abstract":"<p><strong>Background: </strong>Prolonged hospital stays can lead to inefficiencies in health care delivery and unnecessary consumption of medical resources.</p><p><strong>Objective: </strong>This study aimed to identify key clinical variances associated with prolonged length of stay (PLOS) in clinical pathways using a machine learning model trained on real-world data from the ePath system.</p><p><strong>Methods: </strong>We analyzed data from 480 patients with lung cancer (age: mean 68.3, SD 11.2 years; n=263, 54.8% men) who underwent video-assisted thoracoscopic surgery at a university hospital between 2019 and 2023. PLOS was defined as a hospital stay exceeding 9 days after video-assisted thoracoscopic surgery. The variables collected between admission and 4 days after surgery were examined, and those that showed a significant association with PLOS in univariate analyses (P<.01) were selected as predictors. Predictive models were developed using sparse linear regression methods (Lasso, ridge, and elastic net) and decision tree ensembles (random forest and extreme gradient boosting). The data were divided into derivation (earlier study period) and testing (later period) cohorts for temporal validation. The model performance was assessed using the area under the receiver operating characteristic curve, Brier score, and calibration plots. Counterfactual analysis was used to identify key clinical factors influencing PLOS.</p><p><strong>Results: </strong>A 3D heatmap illustrated the temporal relationships between clinical factors and PLOS based on patient demographics, comorbidities, functional status, surgical details, care processes, medications, and variances recorded from admission to 4 days after surgery. Among the 5 algorithms evaluated, the ridge regression model demonstrated the best performance in terms of both discrimination and calibration. Specifically, it achieved area under the receiver operating characteristic curve values of 0.84 and 0.82 and Brier scores of 0.16 and 0.17 in the derivation and test cohorts, respectively. In the final model, a range of variables, including blood tests, care, patient background, procedures, and clinical variances, were associated with PLOS. Among these, particular emphasis was placed on clinical variances. Counterfactual analysis using the ridge regression model identified 6 key variables strongly linked to PLOS. In order of impact, these were abnormal respiratory sounds, postoperative fever, arrhythmia, impaired ambulation, complications after drain removal, and pulmonary air leaks.</p><p><strong>Conclusions: </strong>A machine learning-based model using ePath data effectively identified critical variances in the clinical pathways associated with PLOS. This automated tool may enhance clinical decision-making and improve patient management.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e71617"},"PeriodicalIF":3.8,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12706448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
JMIR Medical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1