首页 > 最新文献

International Journal of Medical Informatics最新文献

英文 中文
Development and validation of an Interpretable Machine learning model for Discriminating between benign and malignant breast cancer 鉴别乳腺癌良恶性的可解释机器学习模型的开发与验证。
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-20 DOI: 10.1016/j.ijmedinf.2026.106300
Zhichun Wang , Weixiang Liu , Lin Hua , Xiang Li , Guohui Xue

Objective

Breast cancer prognosis depends on early detection. We developed and externally validated a model using routine, readily available clinical and laboratory variables to discriminate malignant from benign breast lesions, aiming to reduce unnecessary biopsies and support early decision-making.

Methods

This retrospective two-center study included a development cohort 1from Jiujiang First People’s Hospital (N = 745; malignant 573, benign 172) and an external cohort2 from the First Affiliated Hospital of Nanchang University (N = 221; malignant 161, benign 60).Cohort 1 was randomly split into a 70:30 training and test set. Five-fold cross-validation was used to compare multiple algorithms and lock the model and hyperparameters; the locked model was evaluated on a fixed test set and the external cohort. The primary metric was AUC, with sensitivity, specificity, F1, Brier score, calibration curve, decision curve analysis (DCA), and SHAP for explanation.

Results

Logistic regression was selected, using Age, TT, APTT, CEA, and Ca. Cross-validated AUCs were 0.910 (training) and 0.905 (internal validation). The fixed test set yielded AUC 0.865 (sensitivity 0.802; specificity 0.712; F1 0.849; Brier 0.112). External validation achieved AUC 0.861, specificity 0.883, and PPV 0.934. DCA showed net benefit over “treat-all/none” across 20 %–95 % threshold probabilities. SHAP identified Age, TT, CEA, APTT and Ca as the dominant contributors.

Conclusions

A logistic model based on routine laboratory variables effectively distinguishes malignant from benign breast lesions, with robust external performance and clear clinical net benefit, enabling early risk stratification and fewer unnecessary biopsies.This study proposes a tool that quantifies breast tumor malignancy risk using only objective indicators, without subjective factors. Online tool: prediction-for-bc.shinyapps.io/dynnomapp/.
目的:乳腺癌的预后取决于早期发现。我们开发并外部验证了一个模型,该模型使用常规的、现成的临床和实验室变量来区分乳腺良性和恶性病变,旨在减少不必要的活检并支持早期决策。方法:本回顾性双中心研究包括来自九江第一人民医院的发展队列1 (N = 745,恶性573,良性172)和来自南昌大学第一附属医院的外部队列2 (N = 221,恶性161,良性60)。队列1随机分为70:30的训练集和测试集。采用五重交叉验证对多个算法进行比较,锁定模型和超参数;锁定模型在固定的测试集和外部队列上进行评估。主要指标为AUC,敏感性、特异性、F1、Brier评分、校准曲线、决策曲线分析(DCA)和SHAP进行解释。结果:选择Logistic回归,使用年龄、TT、APTT、CEA和Ca。交叉验证的auc分别为0.910(训练)和0.905(内部验证)。固定组的AUC为0.865(敏感性0.802,特异性0.712,F1为0.849,Brier为0.112)。外部验证的AUC为0.861,特异性为0.883,PPV为0.934。在20% - 95%的阈值概率范围内,DCA比“全部治疗/不治疗”显示出净效益。SHAP发现年龄、TT、CEA、APTT和Ca是主要的影响因子。结论:基于常规实验室变量的logistic模型能够有效区分乳腺恶性病变和良性病变,具有稳健的外部表现和明确的临床净收益,能够实现早期风险分层,减少不必要的活检。本研究提出了一种仅使用客观指标而不使用主观因素来量化乳腺肿瘤恶性风险的工具。在线工具:predictionforbc .shinyapps.io/dynnomapp/。
{"title":"Development and validation of an Interpretable Machine learning model for Discriminating between benign and malignant breast cancer","authors":"Zhichun Wang ,&nbsp;Weixiang Liu ,&nbsp;Lin Hua ,&nbsp;Xiang Li ,&nbsp;Guohui Xue","doi":"10.1016/j.ijmedinf.2026.106300","DOIUrl":"10.1016/j.ijmedinf.2026.106300","url":null,"abstract":"<div><h3>Objective</h3><div>Breast cancer prognosis depends on early detection. We developed and externally validated a model using routine, readily available clinical and laboratory variables to discriminate malignant from benign breast lesions, aiming to reduce unnecessary biopsies and support early decision-making.</div></div><div><h3>Methods</h3><div>This retrospective two-center study included a development cohort 1from Jiujiang First People’s Hospital (N = 745; malignant 573, benign 172) and an external cohort2 from the First Affiliated Hospital of Nanchang University (N = 221; malignant 161, benign 60).Cohort 1 was randomly split into a 70:30 training and test set. Five-fold cross-validation was used to compare multiple algorithms and lock the model and hyperparameters; the locked model was evaluated on a fixed test set and the external cohort. The primary metric was AUC, with sensitivity, specificity, F1, Brier score, calibration curve, decision curve analysis (DCA), and SHAP for explanation.</div></div><div><h3>Results</h3><div>Logistic regression was selected, using Age, TT, APTT, CEA, and Ca. Cross-validated AUCs were 0.910 (training) and 0.905 (internal validation). The fixed test set yielded AUC 0.865 (sensitivity 0.802; specificity 0.712; F1 0.849; Brier 0.112). External validation achieved AUC 0.861, specificity 0.883, and PPV 0.934. DCA showed net benefit over “treat-all/none” across 20 %–95 % threshold probabilities. SHAP identified Age, TT, CEA, APTT and Ca as the dominant contributors.</div></div><div><h3>Conclusions</h3><div>A logistic model based on routine laboratory variables effectively distinguishes malignant from benign breast lesions, with robust external performance and clear clinical net benefit, enabling early risk stratification and fewer unnecessary biopsies.This study proposes a tool that quantifies breast tumor malignancy risk using only objective indicators, without subjective factors. Online tool: <span><span>prediction-for-bc.shinyapps.io/dynnomapp/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"210 ","pages":"Article 106300"},"PeriodicalIF":4.1,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146067914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Case-based reasoning for clinical trial recruitment tools in oncology: When you need patients to find patients 肿瘤临床试验招募工具的病例推理:当你需要患者寻找患者时。
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-19 DOI: 10.1016/j.ijmedinf.2026.106301
Lou-Anne Guillotel , Thierry Lesimple , Oussama Zekri , Marc Cuggia , Boris Campillo-Gimenez

Background

Patient recruitment for clinical trials remains a major challenge, with 86% of trials failing to meet enrollment targets on time. In over 77% of cases, recruitment difficulties stem from matching problems between trials and patients. Case-Based Reasoning (CBR) offers a distinct patient-to-patient approach by determining eligibility through comparison with previously enrolled patients, yet this methodology remains underexplored in contemporary oncology trial matching despite its potential advantages.

Objective

To compare the performance of two CBR approaches—random forest (RF) and target patient similarity (TPS)—in predicting patient eligibility for recent oncology clinical trials using real-world electronic health record data.

Methods

We selected three breast cancer clinical trials (2019–2022) from our institutional registry. Patient data were extracted from our clinical data warehouse, including structured data (laboratory results, diagnosis codes, procedures, treatments) and unstructured clinical narratives processed using natural language processing. For each trial, we trained RF classifiers and TPS models using repeated hold-out validation (25 splits, 70/30 train-test). Performance was evaluated using discriminative metrics (AUC, positive precision, recall, F1-score) and ranking metrics (P@5, P@10, MAP, MRR, NDCG@5, NDCG@10). We analyzed model performance across varying numbers of eligible patients in training datasets (2 to 70% of the total number of eligible patients).

Results

Both approaches demonstrated strong discriminative performance across three trials, with average AUCs of 84.1 % for RF and 76.4 % for TPS, driven primarily by high recall (82.3 % and 77.7 %, respectively). However, positive precision remained low (13.3 % and 9.9 %), reflecting high false-positive rates due to class imbalance. RF showed superior ranking performance, particularly for the trial with the largest eligible cohort (n = 542; P@5 = 78.6 %, MRR = 88.0 %), compared to TPS (P@5 = 47.9 %, MRR = 69.2 %). Both approaches reached performance plateaus with only around 10 eligible patients in training datasets. Variable importance analysis revealed that treatment-related features, diagnostic codes, and procedures were consistently the most important predictors, with relevant patterns identified even with minimal training data.

Conclusions

CBR approaches can effectively support patient pre-screening for oncology clinical trials, with RF demonstrating moderately superior performance over TPS. Both methods show robust discriminative performance with small training datasets, though ranking performance varies substantially across trials. Our findings suggest that CBR approaches may benefit from integration with query-based or prompt-based methods during early recruitment phases when training data is scarce.
背景:临床试验的患者招募仍然是一个重大挑战,86%的试验未能按时达到入组目标。在超过77%的病例中,招募困难源于试验和患者之间的匹配问题。基于病例的推理(CBR)提供了一种独特的患者对患者的方法,通过与先前入组的患者进行比较来确定资格,然而,尽管这种方法具有潜在的优势,但在当代肿瘤试验匹配中仍未得到充分的探索。目的:比较随机森林(RF)和目标患者相似性(TPS)两种CBR方法在使用真实世界电子健康记录数据预测近期肿瘤临床试验患者资格方面的性能。方法:我们从我们的机构注册表中选择了三项乳腺癌临床试验(2019-2022)。从我们的临床数据仓库中提取患者数据,包括结构化数据(实验室结果、诊断代码、程序、治疗)和使用自然语言处理的非结构化临床叙述。对于每个试验,我们使用重复的保留验证(25次分割,70/30训练测试)训练RF分类器和TPS模型。使用判别指标(AUC、正准度、召回率、f1得分)和排名指标(P@5、P@10、MAP、MRR、NDCG@5、NDCG@10)对性能进行评估。我们分析了训练数据集中不同数量的合格患者(占合格患者总数的2 - 70%)的模型性能。结果:两种方法在三个试验中都表现出很强的判别性能,RF的平均auc为84.1%,TPS的平均auc为76.4%,主要是由于高召回率(分别为82.3%和77.7%)。然而,阳性准确率仍然很低(13.3%和9.9%),反映了由于类别不平衡导致的高假阳性率。与TPS (P@5 = 47.9%, MRR = 69.2%)相比,RF表现出更优越的排名表现,特别是对于最大符合条件的队列(n = 542; P@5 = 78.6%, MRR = 88.0%)的试验。这两种方法在训练数据集中只有大约10名符合条件的患者时达到了性能平台。变量重要性分析显示,与治疗相关的特征、诊断代码和程序始终是最重要的预测因素,即使使用最少的训练数据也能识别出相关模式。结论:CBR方法可以有效地支持肿瘤临床试验的患者预筛查,RF的表现略优于TPS。这两种方法在小型训练数据集上都显示出稳健的判别性能,尽管在不同的试验中排名性能差异很大。我们的研究结果表明,在培训数据稀缺的早期招聘阶段,CBR方法可能受益于与基于查询或基于提示的方法的集成。
{"title":"Case-based reasoning for clinical trial recruitment tools in oncology: When you need patients to find patients","authors":"Lou-Anne Guillotel ,&nbsp;Thierry Lesimple ,&nbsp;Oussama Zekri ,&nbsp;Marc Cuggia ,&nbsp;Boris Campillo-Gimenez","doi":"10.1016/j.ijmedinf.2026.106301","DOIUrl":"10.1016/j.ijmedinf.2026.106301","url":null,"abstract":"<div><h3>Background</h3><div>Patient recruitment for clinical trials remains a major challenge, with 86% of trials failing to meet enrollment targets on time. In over 77% of cases, recruitment difficulties stem from matching problems between trials and patients. Case-Based Reasoning (CBR) offers a distinct patient-to-patient approach by determining eligibility through comparison with previously enrolled patients, yet this methodology remains underexplored in contemporary oncology trial matching despite its potential advantages.</div></div><div><h3>Objective</h3><div>To compare the performance of two CBR approaches—random forest (RF) and target patient similarity (TPS)—in predicting patient eligibility for recent oncology clinical trials using real-world electronic health record data.</div></div><div><h3>Methods</h3><div>We selected three breast cancer clinical trials (2019–2022) from our institutional registry. Patient data were extracted from our clinical data warehouse, including structured data (laboratory results, diagnosis codes, procedures, treatments) and unstructured clinical narratives processed using natural language processing. For each trial, we trained RF classifiers and TPS models using repeated hold-out validation (25 splits, 70/30 train-test). Performance was evaluated using discriminative metrics (AUC, positive precision, recall, F1-score) and ranking metrics (P@5, P@10, MAP, MRR, NDCG@5, NDCG@10). We analyzed model performance across varying numbers of eligible patients in training datasets (2 to 70% of the total number of eligible patients).</div></div><div><h3>Results</h3><div>Both approaches demonstrated strong discriminative performance across three trials, with average AUCs of 84.1 % for RF and 76.4 % for TPS, driven primarily by high recall (82.3 % and 77.7 %, respectively). However, positive precision remained low (13.3 % and 9.9 %), reflecting high false-positive rates due to class imbalance. RF showed superior ranking performance, particularly for the trial with the largest eligible cohort (n = 542; P@5 = 78.6 %, MRR = 88.0 %), compared to TPS (P@5 = 47.9 %, MRR = 69.2 %). Both approaches reached performance plateaus with only around 10 eligible patients in training datasets. Variable importance analysis revealed that treatment-related features, diagnostic codes, and procedures were consistently the most important predictors, with relevant patterns identified even with minimal training data.</div></div><div><h3>Conclusions</h3><div>CBR approaches can effectively support patient pre-screening for oncology clinical trials, with RF demonstrating moderately superior performance over TPS. Both methods show robust discriminative performance with small training datasets, though ranking performance varies substantially across trials. Our findings suggest that CBR approaches may benefit from integration with query-based or prompt-based methods during early recruitment phases when training data is scarce.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"211 ","pages":"Article 106301"},"PeriodicalIF":4.1,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146137631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretable machine learning model for predicting in-hospital mortality in elderly acute pancreatitis: Development and validation in a multicenter cohort 用于预测老年急性胰腺炎住院死亡率的可解释机器学习模型:多中心队列的开发和验证
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-18 DOI: 10.1016/j.ijmedinf.2026.106299
Hao He , Li Luo , Lei Bai , Lei Luo , Kunming Tian , Xiaoyun Fu , Bao Fu

Background

Elderly acute pancreatitis (AP) patients face significantly higher in-hospital all-cause mortality, highlighting the need for effective risk stratification to support timely clinical decision-making.

Methods

We conducted a multicenter retrospective study that enrolled 2,728 elderly AP patients, with which we developed and validated a robust machine learning (ML) model for predicting in-hospital all-cause mortality. We first selected predictors of mortality using LASSO regression and random forest–based Boruta algorithms. Then, seven ML models incorporating the selected predictors were trained and evaluated using the area under the receiver operating characteristic curve (AUC).

Results

XGBoost demonstrated the highest predictive performance, achieving an AUC of 0.884 (95% CI: 0.823–0.945) in the external validation test, outperforming the conventional Ranson score in predicting in-hospital mortality. Shapley additive explanations ranked vasoactive drug, hospital length of stay, leukocyte count, noninvasive ventilation, and invasive mechanical ventilation as five key predictors. An interactive web-based tool based on the optimal XGBoost model has been available at https://appredction.shinyapps.io/acutepancreatitis_xgb/ to generate real-time risk predictions.

Conclusions

This study proposed a validated and interpretable ML model to support in-hospital risk stratification for elderly patients with AP, thereby facilitating clinical decision-making and optimizing intensive care unit resource allocation.
背景:老年急性胰腺炎(AP)患者面临着明显更高的院内全因死亡率,强调了有效的风险分层以支持及时的临床决策的必要性。方法:我们进行了一项多中心回顾性研究,纳入了2728例老年AP患者,我们开发并验证了一个强大的机器学习(ML)模型,用于预测院内全因死亡率。我们首先使用LASSO回归和基于随机森林的Boruta算法选择死亡率预测因子。然后,使用受试者工作特征曲线(AUC)下的面积对包含所选预测因子的七个ML模型进行训练和评估。结果:XGBoost表现出最高的预测性能,在外部验证检验中达到0.884 (95% CI: 0.823-0.945)的AUC,在预测院内死亡率方面优于传统的Ranson评分。Shapley加性解释将血管活性药物、住院时间、白细胞计数、无创通气和有创机械通气列为五个关键预测因素。基于最佳XGBoost模型的交互式网络工具可在https://appredction.shinyapps.io/acutepancreatitis_xgb/上获得,以生成实时风险预测。结论:本研究提出了一个经过验证且可解释的ML模型,支持老年AP患者的院内风险分层,从而促进临床决策,优化重症监护病房资源配置。
{"title":"Interpretable machine learning model for predicting in-hospital mortality in elderly acute pancreatitis: Development and validation in a multicenter cohort","authors":"Hao He ,&nbsp;Li Luo ,&nbsp;Lei Bai ,&nbsp;Lei Luo ,&nbsp;Kunming Tian ,&nbsp;Xiaoyun Fu ,&nbsp;Bao Fu","doi":"10.1016/j.ijmedinf.2026.106299","DOIUrl":"10.1016/j.ijmedinf.2026.106299","url":null,"abstract":"<div><h3>Background</h3><div>Elderly acute pancreatitis (AP) patients face significantly higher in-hospital all-cause mortality, highlighting the need for effective risk stratification to support timely clinical decision-making.</div></div><div><h3>Methods</h3><div>We conducted a multicenter retrospective study that enrolled 2,728 elderly AP patients, with which we developed and validated a robust machine learning (ML) model for predicting in-hospital all-cause mortality. We first selected predictors of mortality using LASSO regression and random forest–based Boruta algorithms. Then, seven ML models incorporating the selected predictors were trained and evaluated using the area under the receiver operating characteristic curve (AUC).</div></div><div><h3>Results</h3><div>XGBoost demonstrated the highest predictive performance, achieving an AUC of 0.884 (95% CI: 0.823–0.945) in the external validation test, outperforming the conventional Ranson score in predicting in-hospital mortality. Shapley additive explanations ranked vasoactive drug, hospital length of stay, leukocyte count, noninvasive ventilation, and invasive mechanical ventilation as five key predictors. An interactive web-based tool based on the optimal XGBoost model has been available at <span><span>https://appredction.shinyapps.io/acutepancreatitis_xgb/</span><svg><path></path></svg></span> to generate real-time risk predictions.</div></div><div><h3>Conclusions</h3><div>This study proposed a validated and interpretable ML model to support in-hospital risk stratification for elderly patients with AP, thereby facilitating clinical decision-making and optimizing intensive care unit resource allocation.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106299"},"PeriodicalIF":4.1,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A structured decision-support framework for selecting imputation methods in clinical structured datasets: A secondary analysis 在临床结构化数据集中选择输入方法的结构化决策支持框架:二次分析
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-18 DOI: 10.1016/j.ijmedinf.2026.106298
Marziyeh Afkanpour , Mehri Momeni , Hamed Tabesh

Objective

Missing values are a common challenge in healthcare data analysis, and inadequate handling can introduce bias and undermine the validity of findings. Imputation methods offer a practical solution, but selecting an appropriate approach depends on multiple dataset-specific factors. This study proposes a structured decision-support framework that defines key prerequisites for choosing suitable imputation methods during the preprocessing of clinically structured datasets.

Methods

A secondary analysis of a previous systematic review was conducted, covering 69 studies to identify factors influencing imputation method selection. Domain experts evaluated assumptions regarding missing data characteristics and dataset structure, reaching consensus on the most relevant factors. These factors were synthesized into a structured framework designed to guide systematic and transparent imputation method selection in clinical data preprocessing workflows.

Results

Nine key factors were identified as essential for determining an appropriate imputation method. These include missing data characteristics, mechanism, pattern, and ratio and dataset attributes such as data type, variable role, distribution, and correlation. The ratio of missingness was the most influential factor, followed by variable role and missing value mechanism. Most studies emphasized the combined importance of both missing data properties and dataset features in imputation selection.

Conclusions

Understanding the characteristics of missing values and dataset structure is crucial for selecting appropriate imputation methods. The proposed structured decision-support framework provides an evidence-based checklist to enhance transparency, reproducibility, and reliability in preprocessing clinical datasets within medical informatics workflows.
在医疗保健数据分析中,价值缺失是一个常见的挑战,处理不当可能会引入偏见并破坏结果的有效性。插值方法提供了一个实用的解决方案,但是选择一个合适的方法取决于多个数据集特定的因素。本研究提出了一个结构化决策支持框架,该框架定义了在临床结构化数据集预处理过程中选择合适的植入方法的关键先决条件。方法对已有的69项研究的系统综述进行二次分析,以确定影响imputation方法选择的因素。领域专家评估关于缺失数据特征和数据集结构的假设,在最相关的因素上达成共识。这些因素综合成一个结构化的框架,旨在指导临床数据预处理工作流程中系统透明的输入方法选择。结果确定了9个关键因素,确定了合适的归算方法。其中包括缺失的数据特征、机制、模式、比率和数据集属性,如数据类型、变量角色、分布和相关性。缺失率的影响最大,其次是变量作用和缺失价值机制。大多数研究都强调缺失数据属性和数据集特征在imputation选择中的综合重要性。结论了解缺失值的特征和数据集结构对选择合适的插值方法至关重要。提出的结构化决策支持框架提供了一个基于证据的清单,以提高医疗信息学工作流程中预处理临床数据集的透明度、可重复性和可靠性。
{"title":"A structured decision-support framework for selecting imputation methods in clinical structured datasets: A secondary analysis","authors":"Marziyeh Afkanpour ,&nbsp;Mehri Momeni ,&nbsp;Hamed Tabesh","doi":"10.1016/j.ijmedinf.2026.106298","DOIUrl":"10.1016/j.ijmedinf.2026.106298","url":null,"abstract":"<div><h3>Objective</h3><div>Missing values are a common challenge in healthcare data analysis, and inadequate handling can introduce bias and undermine the validity of findings. Imputation methods offer a practical solution, but selecting an appropriate approach depends on multiple dataset-specific factors. This study proposes a structured decision-support framework that defines key prerequisites for choosing suitable imputation methods during the preprocessing of clinically structured datasets.</div></div><div><h3>Methods</h3><div>A secondary analysis of a previous systematic review was conducted, covering 69 studies to identify factors influencing imputation method selection. Domain experts evaluated assumptions regarding missing data characteristics and dataset structure, reaching consensus on the most relevant factors. These factors were synthesized into a structured framework designed to guide systematic and transparent imputation method selection in clinical data preprocessing workflows.</div></div><div><h3>Results</h3><div>Nine key factors were identified as essential for determining an appropriate imputation method. These include missing data characteristics, mechanism, pattern, and ratio and dataset attributes such as data type, variable role, distribution, and correlation. The ratio of missingness was the most influential factor, followed by variable role and missing value mechanism. Most studies emphasized the combined importance of both missing data properties and dataset features in imputation selection.</div></div><div><h3>Conclusions</h3><div>Understanding the characteristics of missing values and dataset structure is crucial for selecting appropriate imputation methods. The proposed structured decision-support framework provides an evidence-based checklist to enhance transparency, reproducibility, and reliability in preprocessing clinical datasets within medical informatics workflows.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"210 ","pages":"Article 106298"},"PeriodicalIF":4.1,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clinician preferences for explainable AI in critical care: a comparative study of interpretable models and visualizations for intubation decision support 临床医生在重症监护中对可解释人工智能的偏好:可解释模型和插管决策支持可视化的比较研究
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-18 DOI: 10.1016/j.ijmedinf.2026.106287
Tiantian Xian , Nikolay Mehandjiev , Panos Constantinides , Yu-wang Chen , Qudamah Quboa , Gareth Kitchen

Background:

The complexity of many AI models hinders their clinical adoption because the clinicians using them do not regard them as transparent. This study addresses the lack of clinician-centered explainable AI (XAI) interfaces by designing and evaluating intuitive visual explanations for intubation prediction, testing the hypothesis that workflow-compatible designs enhance acceptance.

Objective:

This study compares three, time-aware, visual explanations for XAI-based intubation prediction and evaluate their acceptance, comprehension, and perceived utility among clinicians.

Methods:

We developed machine learning models to estimate the near-term risk of deterioration in the patient’s condition which may lead to mechanical intubation using ICU time-series data. We generated global and local explanations using SHAP and designed three customized visual formats—a temporal force plot, a temporal bar chart, and a dual-encoded SHAP heatmap. Clinicians (n = 206) evaluated comprehension and usability using objective questions and a Likert-based survey.

Results:

Based on 4608 critically ill patients with 10 medical variables over 7 hours of data for each patient, the Random Forest (RF) model achieved the highest area under the curve (AUC): 0.94. Furthermore, the local explanations were customized and evaluated by 206 clinicians through a survey conducted on the Prolific platform. A customized heatmap representation was selected as the visualization with the highest perceived clinical utility and alignment with clinical workflows.

Discussion:

The reported findings support the need for explanation formats to be tailored to clinical reasoning and task context, supporting the concept of cognitive fit. The heatmap’s close alignment with clinicians’ mental models and its graphical integrity enhances interpretability and trust. This study demonstrates that explanation effectiveness depends on contextual relevance, rather than a universal standard, and that the presentation format itself significantly shapes clinicians’ trust in XAI systems.

Conclusion:

This study advances clinical XAI by introducing a time-aware explanation framework for ICU intubation decisions. By integrating temporal trends with model reasoning, our visualizations closely align with clinicians’ cognitive workflows. Rigorous clinician-centered evaluation identified the dual-encoded SHAP heatmap as the most useful and workflow-compatible visualization, highlighting the importance of explanation design alongside predictive accuracy for clinical adoption.
背景:许多人工智能模型的复杂性阻碍了它们的临床应用,因为使用它们的临床医生并不认为它们是透明的。本研究通过设计和评估插管预测的直观视觉解释,解决了缺乏以临床为中心的可解释AI (XAI)界面的问题,验证了工作流兼容设计提高接受度的假设。目的:本研究比较了基于xai的插管预测的三种具有时间意识的视觉解释,并评估了它们在临床医生中的接受程度、理解程度和感知效用。方法:我们开发了机器学习模型来估计患者病情恶化的近期风险,这可能导致使用ICU时间序列数据进行机械插管。我们使用SHAP生成了全局和局部解释,并设计了三种定制的视觉格式——时间力图、时间条形图和双编码SHAP热图。临床医生(n = 206)使用客观问题和李克特调查评估理解和可用性。结果:基于4608例危重患者,10个医学变量,每个患者7小时的数据,随机森林(Random Forest, RF)模型的曲线下面积(AUC)最高,为0.94。此外,通过在多产平台上进行的调查,206名临床医生对当地的解释进行了定制和评估。选择自定义热图表示作为具有最高临床效用和与临床工作流程一致的可视化。讨论:报告的研究结果支持需要根据临床推理和任务背景量身定制解释格式,支持认知契合的概念。热图与临床医生的心理模型密切一致,其图形完整性增强了可解释性和信任度。本研究表明,解释的有效性取决于上下文相关性,而不是通用标准,并且演示格式本身显著地影响了临床医生对XAI系统的信任。结论:本研究通过引入ICU插管决策的时间意识解释框架来推进临床XAI。通过将时间趋势与模型推理相结合,我们的可视化与临床医生的认知工作流程紧密结合。严格的以临床医生为中心的评估确定了双编码的SHAP热图是最有用的和工作流程兼容的可视化,强调了解释设计和临床采用预测准确性的重要性。
{"title":"Clinician preferences for explainable AI in critical care: a comparative study of interpretable models and visualizations for intubation decision support","authors":"Tiantian Xian ,&nbsp;Nikolay Mehandjiev ,&nbsp;Panos Constantinides ,&nbsp;Yu-wang Chen ,&nbsp;Qudamah Quboa ,&nbsp;Gareth Kitchen","doi":"10.1016/j.ijmedinf.2026.106287","DOIUrl":"10.1016/j.ijmedinf.2026.106287","url":null,"abstract":"<div><h3>Background:</h3><div>The complexity of many AI models hinders their clinical adoption because the clinicians using them do not regard them as transparent. This study addresses the lack of clinician-centered explainable AI (XAI) interfaces by designing and evaluating intuitive visual explanations for intubation prediction, testing the hypothesis that workflow-compatible designs enhance acceptance.</div></div><div><h3>Objective:</h3><div>This study compares three, time-aware, visual explanations for XAI-based intubation prediction and evaluate their acceptance, comprehension, and perceived utility among clinicians.</div></div><div><h3>Methods:</h3><div>We developed machine learning models to estimate the near-term risk of deterioration in the patient’s condition which may lead to mechanical intubation using ICU time-series data. We generated global and local explanations using SHAP and designed three customized visual formats—a temporal force plot, a temporal bar chart, and a dual-encoded SHAP heatmap. Clinicians (<em>n</em> = 206) evaluated comprehension and usability using objective questions and a Likert-based survey.</div></div><div><h3>Results:</h3><div>Based on 4608 critically ill patients with 10 medical variables over 7 hours of data for each patient, the Random Forest (RF) model achieved the highest area under the curve (AUC): 0.94. Furthermore, the local explanations were customized and evaluated by 206 clinicians through a survey conducted on the Prolific platform. A customized heatmap representation was selected as the visualization with the highest perceived clinical utility and alignment with clinical workflows.</div></div><div><h3>Discussion:</h3><div>The reported findings support the need for explanation formats to be tailored to clinical reasoning and task context, supporting the concept of cognitive fit. The heatmap’s close alignment with clinicians’ mental models and its graphical integrity enhances interpretability and trust. This study demonstrates that explanation effectiveness depends on contextual relevance, rather than a universal standard, and that the presentation format itself significantly shapes clinicians’ trust in XAI systems.</div></div><div><h3>Conclusion:</h3><div>This study advances clinical XAI by introducing a time-aware explanation framework for ICU intubation decisions. By integrating temporal trends with model reasoning, our visualizations closely align with clinicians’ cognitive workflows. Rigorous clinician-centered evaluation identified the dual-encoded SHAP heatmap as the most useful and workflow-compatible visualization, highlighting the importance of explanation design alongside predictive accuracy for clinical adoption.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"210 ","pages":"Article 106287"},"PeriodicalIF":4.1,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rule-augmented constraint learning for semantic error detection in MIMIC-III knowledge graph 基于规则增强约束学习的MIMIC-III知识图语义错误检测
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.ijmedinf.2026.106297
Özge Noben, Ömer Durukan Kılıç, Tjitze Rienstra, Michel Dumontier, Remzi Celebi
High-quality, error-free data is essential for developing reliable data-driven models, particularly in clinical decision support systems where inaccurate predictions can have serious consequences. While KGs offer a structured and semantically rich representation for clinical data, ensuring their consistency and correctness remains a challenge. Existing rule mining techniques provide solutions for the automatic extraction of logical constraints from KGs, but they often produce redundant or clinically irrelevant rules, especially when dealing with numeric or categorical literals such as age or lab values. KG constraints—rules intended to capture implausible or conflicting facts in the KG—can be used to spot semantic errors: facts that might conform to the underlying schema but contradict domain knowledge. In this work, we propose a novel framework for constraint learning in clinical KGs that identifies and transforms high-confidence rules into clinically plausible constraints. We propose two approaches, based on class disjointness and literal clustering combined with rule mining. We validate the clinical relevance of these generated rules using expert-curated constraints and large language models (LLMs). The results on the MIMIC-III clinical dataset show that rule filtering based constraint learning effectively preserves clinically meaningful rules that align with established medical knowledge. For numeric data, we achieve reliable value groupings through our clustering-based method, and the rules derived from these groupings were validated by LLMs. Their outputs confirm the clinical relevance of a portion of those discovered rules. By providing interpretable and scalable solutions to semantic inconsistencies in KGs, this study contributes to increasing the KG trustworthiness and its clinical usability.
高质量、无差错的数据对于开发可靠的数据驱动模型至关重要,特别是在临床决策支持系统中,不准确的预测可能会产生严重后果。虽然KGs为临床数据提供了结构化和语义丰富的表示,但确保它们的一致性和正确性仍然是一个挑战。现有的规则挖掘技术为从KGs中自动提取逻辑约束提供了解决方案,但它们经常产生冗余或临床无关的规则,特别是在处理数字或分类文字(如年龄或实验室值)时。KG约束—旨在捕获KG中不可信或冲突事实的规则—可用于发现语义错误:可能符合底层模式但与领域知识相矛盾的事实。在这项工作中,我们提出了一个新的框架,用于临床KGs的约束学习,该框架识别并将高置信度规则转化为临床合理的约束。我们提出了两种方法,基于类脱节和文字聚类结合规则挖掘。我们使用专家策划的约束和大型语言模型(llm)验证这些生成规则的临床相关性。MIMIC-III临床数据集的结果表明,基于规则过滤的约束学习有效地保留了与已建立的医学知识相一致的临床有意义的规则。对于数值数据,我们通过基于聚类的方法实现了可靠的值分组,并通过llm验证了从这些分组中导出的规则。他们的结果证实了这些发现的规则的一部分的临床相关性。通过提供可解释和可扩展的解决方案来解决KG的语义不一致,本研究有助于提高KG的可信度和临床可用性。
{"title":"Rule-augmented constraint learning for semantic error detection in MIMIC-III knowledge graph","authors":"Özge Noben,&nbsp;Ömer Durukan Kılıç,&nbsp;Tjitze Rienstra,&nbsp;Michel Dumontier,&nbsp;Remzi Celebi","doi":"10.1016/j.ijmedinf.2026.106297","DOIUrl":"10.1016/j.ijmedinf.2026.106297","url":null,"abstract":"<div><div>High-quality, error-free data is essential for developing reliable data-driven models, particularly in clinical decision support systems where inaccurate predictions can have serious consequences. While KGs offer a structured and semantically rich representation for clinical data, ensuring their consistency and correctness remains a challenge. Existing rule mining techniques provide solutions for the automatic extraction of logical constraints from KGs, but they often produce redundant or clinically irrelevant rules, especially when dealing with numeric or categorical literals such as age or lab values. KG constraints—rules intended to capture implausible or conflicting facts in the KG—can be used to spot semantic errors: facts that might conform to the underlying schema but contradict domain knowledge. In this work, we propose a novel framework for constraint learning in clinical KGs that identifies and transforms high-confidence rules into clinically plausible constraints. We propose two approaches, based on class disjointness and literal clustering combined with rule mining. We validate the clinical relevance of these generated rules using expert-curated constraints and large language models (LLMs). The results on the MIMIC-III clinical dataset show that rule filtering based constraint learning effectively preserves clinically meaningful rules that align with established medical knowledge. For numeric data, we achieve reliable value groupings through our clustering-based method, and the rules derived from these groupings were validated by LLMs. Their outputs confirm the clinical relevance of a portion of those discovered rules. By providing interpretable and scalable solutions to semantic inconsistencies in KGs, this study contributes to increasing the KG trustworthiness and its clinical usability.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"210 ","pages":"Article 106297"},"PeriodicalIF":4.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing diabetes monitoring systems’ reports: A novel integrated diabetes report (IDR) 加强糖尿病监测系统报告:一种新的糖尿病综合报告(IDR)。
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.ijmedinf.2026.106288
Tahmineh Aldaghi , Robert Bem , Jan Muzik

Aim

Individuals with diabetes require continuous self-management. Diabetes monitoring systems generate structured reports that help individuals and healthcare providers interpret data and optimize treatment strategies. To design and validate an Integrated Diabetes Report (IDR) that improves the clarity, usability, and clinical relevance of diabetes data visualizations.

Method

A review of 13 diabetes monitoring systems revealed five main report categories: overlay, logbook, device-specific, daily, and overview reports. While the overview report was the most frequently used, it lacked comprehensive visualization and essential clinical metrics. To address these gaps, a multidisciplinary panel of four experts collaborated to design a more integrated reporting framework.

Results

Across systems, glucose statistics were included in all reports, followed by insulin data (in 12 systems), carbohydrate intake (in 6 systems), hypo-hyperglycemic indices (in 2 systems), sleep indices (in 2 systems), and medication details (in 1 system). Key gaps included minimal data on physical activity, limited documentation of carbohydrates, and the absence of consolidated insulin visualization. The IDR introduces a complications section, an integrated graph combining AGP with basal and bolus insulin, and an advanced insulin profile comparing seven calculated indices.

Conclusion

The IDR improves clinical interpretation, supports treatment decisions, and enhances risk assessment for diabetes management.
目的:糖尿病患者需要持续的自我管理。糖尿病监测系统生成结构化报告,帮助个人和医疗保健提供者解释数据并优化治疗策略。设计并验证糖尿病综合报告(IDR),以提高糖尿病数据可视化的清晰度、可用性和临床相关性。方法:对13个糖尿病监测系统的回顾揭示了五种主要报告类别:覆盖报告、日志报告、特定设备报告、每日报告和概述报告。虽然概述报告是最常用的,但它缺乏全面的可视化和必要的临床指标。为了解决这些差距,一个由四名专家组成的多学科小组合作设计了一个更加综合的报告框架。结果:在各个系统中,所有报告均包含葡萄糖统计数据,其次是胰岛素数据(12个系统)、碳水化合物摄入量(6个系统)、低血糖指数(2个系统)、睡眠指数(2个系统)和用药细节(1个系统)。主要的差距包括:关于身体活动的数据很少,关于碳水化合物的记录有限,以及缺乏整合的胰岛素可视化。IDR引入了并发症部分,将AGP与基础胰岛素和大剂量胰岛素结合起来的综合图表,以及比较七个计算指标的高级胰岛素概况。结论:IDR改善了临床解释,支持了治疗决策,并加强了糖尿病管理的风险评估。
{"title":"Enhancing diabetes monitoring systems’ reports: A novel integrated diabetes report (IDR)","authors":"Tahmineh Aldaghi ,&nbsp;Robert Bem ,&nbsp;Jan Muzik","doi":"10.1016/j.ijmedinf.2026.106288","DOIUrl":"10.1016/j.ijmedinf.2026.106288","url":null,"abstract":"<div><h3>Aim</h3><div>Individuals with diabetes require continuous self-management. Diabetes monitoring systems generate structured reports that help individuals and healthcare providers interpret data and optimize treatment strategies. To design and validate an Integrated Diabetes Report (IDR) that improves the clarity, usability, and clinical relevance of diabetes data visualizations.</div></div><div><h3>Method</h3><div>A review of 13 diabetes monitoring systems revealed five main report categories: overlay, logbook, device-specific, daily, and overview reports. While the overview report was the most frequently used, it lacked comprehensive visualization and essential clinical metrics. To address these gaps, a multidisciplinary panel of four experts collaborated to design a more integrated reporting framework.</div></div><div><h3>Results</h3><div>Across systems, glucose statistics were included in all reports, followed by insulin data (in 12 systems), carbohydrate intake (in 6 systems), hypo-hyperglycemic indices (in 2 systems), sleep indices (in 2 systems), and medication details (in 1 system). Key gaps included minimal data on physical activity, limited documentation of carbohydrates, and the absence of consolidated insulin visualization. The IDR introduces a complications section, an integrated graph combining AGP with basal and bolus insulin, and an advanced insulin profile comparing seven calculated indices.</div></div><div><h3>Conclusion</h3><div>The IDR improves clinical interpretation, supports treatment decisions, and enhances risk assessment for diabetes management.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106288"},"PeriodicalIF":4.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond binary diagnosis: Key questions on AI accuracy, real-world applicability, and safety in clinical decision support 超越二元诊断:人工智能准确性、现实世界适用性和临床决策支持安全性的关键问题。
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.ijmedinf.2026.106292
Jin Ye
This comment relates to Kücking et al.’s (2026) study on the bidirectional effects of artificial intelligence recommendations and healthcare provider related factors on the accuracy of wound impregnation diagnosis. While acknowledging the valuable contributions of this research, including distinguishing between correct/incorrect artificial intelligence outputs, rigorous simulation design, and emphasis on clinical safety, we have raised key questions to enhance the interpretation of results and real-world translation. The main focuses include the moderating role of artificial intelligence system accuracy in automation bias, external effectiveness in real clinical environments, potential mechanisms for gender differences in diagnostic performance, the impact of visual cue design on decision-making, and the potential of explainable artificial intelligence (XAI) in risk mitigation. This review aims to promote further research and facilitate the safe and effective integration of artificial intelligence based clinical decision support systems (CDSS) into clinical practice.
这一评论涉及k cking等人(2026)关于人工智能推荐和医疗保健提供者相关因素对伤口浸渍诊断准确性的双向影响的研究。在承认这项研究的宝贵贡献的同时,包括区分正确/不正确的人工智能输出,严格的模拟设计,以及对临床安全性的强调,我们提出了一些关键问题,以加强对结果的解释和现实世界的翻译。主要重点包括人工智能系统准确性在自动化偏差中的调节作用,真实临床环境中的外部有效性,诊断表现性别差异的潜在机制,视觉线索设计对决策的影响,以及可解释人工智能(XAI)在风险缓解中的潜力。本文综述旨在促进进一步的研究,促进基于人工智能的临床决策支持系统(CDSS)安全有效地整合到临床实践中。
{"title":"Beyond binary diagnosis: Key questions on AI accuracy, real-world applicability, and safety in clinical decision support","authors":"Jin Ye","doi":"10.1016/j.ijmedinf.2026.106292","DOIUrl":"10.1016/j.ijmedinf.2026.106292","url":null,"abstract":"<div><div>This comment relates to Kücking et al.’s (2026) study on the bidirectional effects of artificial intelligence recommendations and healthcare provider related factors on the accuracy of wound impregnation diagnosis. While acknowledging the valuable contributions of this research, including distinguishing between correct/incorrect artificial intelligence outputs, rigorous simulation design, and emphasis on clinical safety, we have raised key questions to enhance the interpretation of results and real-world translation. The main focuses include the moderating role of artificial intelligence system accuracy in automation bias, external effectiveness in real clinical environments, potential mechanisms for gender differences in diagnostic performance, the impact of visual cue design on decision-making, and the potential of explainable artificial intelligence (XAI) in risk mitigation. This review aims to promote further research and facilitate the safe and effective integration of artificial intelligence based clinical decision support systems (CDSS) into clinical practice.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"209 ","pages":"Article 106292"},"PeriodicalIF":4.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Less time Coding, more time Caring: Performance evaluation of ChatGPT-5 for ICD-10 coding of radiology reports 少时间编码,多时间关怀:ChatGPT-5对放射学报告ICD-10编码的性能评价
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-17 DOI: 10.1016/j.ijmedinf.2026.106296
Tristan Ruhwedel , Julian M.M. Rogasch , Paul Martin Dahlke , Seyd Shnayien , Christian Furth , Christoph Wetz , Holger Amthauer , Imke Schatka , Nick Lasse Beetz

Introduction

Worldwide radiologists are facing a high administrative workload. ICD-10 coding is mandatory for reimbursement in many health systems and a frequent source of billing errors. Large language models have shown promise in supporting coding related tasks, but previous studies with earlier ChatGPT versions reported mixed results and evidence specific to radiology reports remains scarce. We therefore aimed to investigate whether ChatGPT-5 can be consulted when assigning ICD-10 codes to radiology reports and whether this leads to a measurable time advantage.

Methods

2,738 fictious radiology reports across multiple modalities were derived from the PARROT database. Additionally, 100 fictitious PET/CT reports were created. Each report was assigned a single, most relevant ICD-10 code using ChatGPT-5. For PARROT, ChatGPT-derived codes were compared with predefined database reference labels. For PET/CT, ChatGPT-derived codes were compared with codes assigned by an independent manual coder. Exact and character-level concordance were assessed. In cases of discordance, a blinded adjudicator selected the most accurate ICD-10 code. Coding efficiency was evaluated for PET/CT reports by measuring coding time per report.

Results

For PARROT, exact-code concordance was 1,590/2,738 (58.1 %). In a random subset of 200 mismatches, blinded adjudication preferred the ChatGPT derived code in 123 and the reference label in 77 cases (p = 0.0015). Coding non-English reports resulted in significantly lower concordance (first character: p = 0.002; second/third characters: p < 0.001; last characters: p = 0.012) and longer coding times than English reports (p = 0.002). Regarding PET/CT reports, median coding time was 8 s with ChatGPT and 135 s without. The median time saved was 127 s per report.

Conclusion

Applied to daily clinical care, higher code correctness might reduce billing errors, while saved time could be reallocated to patient care. Radiologists should collaborate with developers to create versions of LLMs that operate within data-secure environments.
世界各地的放射科医生都面临着很高的行政工作量。在许多卫生系统中,ICD-10编码是报销的强制性规定,也是账单错误的常见来源。大型语言模型在支持编码相关任务方面显示出了希望,但是先前对早期ChatGPT版本的研究报告了混合的结果,并且针对放射学报告的证据仍然很少。因此,我们的目的是研究在将ICD-10代码分配给放射学报告时是否可以咨询ChatGPT-5,以及这是否会带来可测量的时间优势。方法从PARROT数据库中提取2,738份不同模式的虚构放射学报告。此外,还创建了100个虚构的PET/CT报告。每个报告使用ChatGPT-5分配一个最相关的ICD-10代码。对于PARROT, chatgpt衍生的代码与预定义的数据库参考标签进行了比较。对于PET/CT, chatgpt衍生代码与独立手动编码器分配的代码进行比较。准确和字符水平的一致性进行了评估。在不一致的情况下,盲法裁判选择最准确的ICD-10代码。通过测量每个报告的编码时间来评估PET/CT报告的编码效率。结果PARROT的准确编码一致性为1590 / 2738(58.1%)。在200个不匹配的随机子集中,盲法判决倾向于123例ChatGPT衍生代码和77例参考标签(p = 0.0015)。编码非英语报告的一致性显著低于英语报告(第一个字符:p = 0.002;第二/第三个字符:p <; 0.001;最后一个字符:p = 0.012),编码时间较长(p = 0.002)。关于PET/CT报告,ChatGPT的中位编码时间为8秒,未ChatGPT的中位编码时间为135秒。每个报告节省的平均时间为127秒。结论应用于临床日常护理中,提高编码正确性可减少计费错误,节省的时间可重新分配给患者护理。放射科医生应该与开发人员合作,创建在数据安全环境中运行的llm版本。
{"title":"Less time Coding, more time Caring: Performance evaluation of ChatGPT-5 for ICD-10 coding of radiology reports","authors":"Tristan Ruhwedel ,&nbsp;Julian M.M. Rogasch ,&nbsp;Paul Martin Dahlke ,&nbsp;Seyd Shnayien ,&nbsp;Christian Furth ,&nbsp;Christoph Wetz ,&nbsp;Holger Amthauer ,&nbsp;Imke Schatka ,&nbsp;Nick Lasse Beetz","doi":"10.1016/j.ijmedinf.2026.106296","DOIUrl":"10.1016/j.ijmedinf.2026.106296","url":null,"abstract":"<div><h3>Introduction</h3><div>Worldwide radiologists are facing a high administrative workload. ICD-10 coding is mandatory for reimbursement in many health systems and a frequent source of billing errors. Large language models have shown promise in supporting coding related tasks, but previous studies with earlier ChatGPT versions reported mixed results and evidence specific to radiology reports remains scarce. We therefore aimed to investigate whether ChatGPT-5 can be consulted when assigning ICD-10 codes to radiology reports and whether this leads to a measurable time advantage.</div></div><div><h3>Methods</h3><div>2,738 fictious radiology reports across multiple modalities were derived from the PARROT database. Additionally, 100 fictitious PET/CT reports were created. Each report was assigned a single, most relevant ICD-10 code using ChatGPT-5. For PARROT, ChatGPT-derived codes were compared with predefined database reference labels. For PET/CT, ChatGPT-derived codes were compared with codes assigned by an independent manual coder. Exact and character-level concordance were assessed. In cases of discordance, a blinded adjudicator selected the most accurate ICD-10 code. Coding efficiency was evaluated for PET/CT reports by measuring coding time per report.</div></div><div><h3>Results</h3><div>For PARROT, exact-code concordance was 1,590/2,738 (58.1 %). In a random subset of 200 mismatches, blinded adjudication preferred the ChatGPT derived code in 123 and the reference label in 77 cases (p = 0.0015). Coding non-English reports resulted in significantly lower concordance (first character: p = 0.002; second/third characters: p &lt; 0.001; last characters: p = 0.012) and longer coding times than English reports (p = 0.002). Regarding PET/CT reports, median coding time was 8 s with ChatGPT and 135 s without. The median time saved was 127 s per report.</div></div><div><h3>Conclusion</h3><div>Applied to daily clinical care, higher code correctness might reduce billing errors, while saved time could be reallocated to patient care. Radiologists should collaborate with developers to create versions of LLMs that operate within data-secure environments.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"210 ","pages":"Article 106296"},"PeriodicalIF":4.1,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Biometric Data in Post-Traumatic Stress Disorder Detection: A Scoping Review of Digital Health Applications 创伤后应激障碍检测中的生物特征数据:数字健康应用的范围审查。
IF 4.1 2区 医学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-01-15 DOI: 10.1016/j.ijmedinf.2026.106289
Phue Thet Khaing, Masaharu Nakayama

Context

Post-traumatic stress disorder (PTSD) is mainly assessed through self-reports and clinician interviews, which can delay recognition and limit reach. Biometric markers captured using digital technologies may enable earlier and more objective detections.

Purpose

To map biometric modalities used for PTSD detection in digital health, identify underused markers, characterise machine learning (ML)/artificial intelligence (AI) approaches, and assess sex-related analyses.

Methods

Guided by PRISMA-ScR, a protocol on the Open Science Framework was pre-registered and searches in PubMed, IEEE Xplore, and Google Scholar (2015–2025) were conducted. The full search string was: (“post-traumatic stress disorder” OR “PTSD”) AND (“biometric data” OR “biosensor” OR “wearable technology”) AND (“detection” OR “screening” OR “diagnosis” OR “monitoring”) AND (“digital health” OR “mobile health” OR “AI-based” OR “machine learning”). Peer-reviewed human studies using biometric data with digital tools and/or ML/AI for PTSD detection were eligible. Of 3,312 records, 89 underwent full-text review, and 18 studies met the inclusion criteria.

Analysis

Data were categorised by biometric modality, digital platform (wearable devices, mobile applications, ML/AI systems), study population, and performance metrics (area under the curve, sensitivity, specificity). Findings were grouped thematically (physiological, neuroimaging, behavioural, genetic, multimodal) and synthesised narratively to identify trends, gaps, and the application of sex-stratified modelling.

Results

Most studies focused on physiological (e.g., heart rate, sleep) and neuroimaging (functional magnetic resonance imaging, electroencephalography) signals; behavioural and genetic modalities were underexplored. Data were frequently captured via wearables and mobile platforms, with ML commonly applied. Performance reporting was uneven, sex-stratified analyses were rare, and several promising modalities (e.g., eye-tracking, electrodermal activity) remain underused.

Conclusion

Digital biometric approaches can detect PTSD; however, progress has been slowed by heterogeneous study designs, inconsistent reporting, and limited attention to sex differences. Establishing common reporting standards, evaluating multimodal models in real-world settings, and developing algorithms incorporating sex for more equitable screening are warranted.
背景:创伤后应激障碍(PTSD)的评估主要通过自我报告和临床医生访谈,这可能会延迟识别和限制到达。使用数字技术捕获的生物特征标记可以实现更早和更客观的检测。目的:绘制用于数字健康中PTSD检测的生物识别模式,识别未充分利用的标记,表征机器学习(ML)/人工智能(AI)方法,并评估与性别相关的分析。方法:在PRISMA-ScR的指导下,预注册开放科学框架协议,并在PubMed、IEEE Xplore和谷歌Scholar(2015-2025)中进行检索。完整的搜索字符串是:(“创伤后应激障碍”或“PTSD”)和(“生物特征数据”或“生物传感器”或“可穿戴技术”)和(“检测”或“筛查”或“诊断”或“监测”)和(“数字健康”或“移动健康”或“基于人工智能”或“机器学习”)。使用生物特征数据与数字工具和/或ML/AI进行创伤后应激障碍检测的同行评审人类研究符合条件。在3312项记录中,89项进行了全文审查,18项研究符合纳入标准。分析:根据生物识别模式、数字平台(可穿戴设备、移动应用程序、ML/AI系统)、研究人群和性能指标(曲线下面积、灵敏度、特异性)对数据进行分类。研究结果按主题分组(生理、神经影像学、行为、遗传、多模态),并以叙事方式综合,以确定趋势、差距和性别分层模型的应用。结果:大多数研究集中在生理(如心率、睡眠)和神经影像学(功能磁共振成像、脑电图)信号;行为和遗传模式尚未得到充分探索。数据经常通过可穿戴设备和移动平台捕获,通常使用ML。绩效报告不平衡,性别分层分析很少,一些有前途的模式(如眼动追踪,皮肤电活动)仍未得到充分利用。结论:数字生物识别方法可以检测创伤后应激障碍;然而,异质性研究设计、不一致的报告以及对性别差异的关注有限,延缓了研究进展。有必要建立共同的报告标准,在现实环境中评估多模式模型,并开发包含性别的算法,以实现更公平的筛查。
{"title":"Biometric Data in Post-Traumatic Stress Disorder Detection: A Scoping Review of Digital Health Applications","authors":"Phue Thet Khaing,&nbsp;Masaharu Nakayama","doi":"10.1016/j.ijmedinf.2026.106289","DOIUrl":"10.1016/j.ijmedinf.2026.106289","url":null,"abstract":"<div><h3>Context</h3><div>Post-traumatic stress disorder (PTSD) is mainly assessed through self-reports and clinician interviews, which can delay recognition and limit reach. Biometric markers captured using digital technologies may enable earlier and more objective detections.</div></div><div><h3>Purpose</h3><div>To map biometric modalities used for PTSD detection in digital health, identify underused markers, characterise machine learning (ML)/artificial intelligence (AI) approaches, and assess sex-related analyses.</div></div><div><h3>Methods</h3><div>Guided by PRISMA-ScR, a protocol on the Open Science Framework was pre-registered and searches in PubMed, IEEE Xplore, and Google Scholar (2015–2025) were conducted. The full search string was: (“post-traumatic stress disorder” OR “PTSD”) AND (“biometric data” OR “biosensor” OR “wearable technology”) AND (“detection” OR “screening” OR “diagnosis” OR “monitoring”) AND (“digital health” OR “mobile health” OR “AI-based” OR “machine learning”). Peer-reviewed human studies using biometric data with digital tools and/or ML/AI for PTSD detection were eligible. Of 3,312 records, 89 underwent full-text review, and 18 studies met the inclusion criteria.</div></div><div><h3>Analysis</h3><div>Data were categorised by biometric modality, digital platform (wearable devices, mobile applications, ML/AI systems), study population, and performance metrics (area under the curve, sensitivity, specificity). Findings were grouped thematically (physiological, neuroimaging, behavioural, genetic, multimodal) and synthesised narratively to identify trends, gaps, and the application of sex-stratified modelling.</div></div><div><h3>Results</h3><div>Most studies focused on physiological (e.g., heart rate, sleep) and neuroimaging (functional magnetic resonance imaging, electroencephalography) signals; behavioural and genetic modalities were underexplored. Data were frequently captured via wearables and mobile platforms, with ML commonly applied. Performance reporting was uneven, sex-stratified analyses were rare, and several promising modalities (e.g., eye-tracking, electrodermal activity) remain underused.</div></div><div><h3>Conclusion</h3><div>Digital biometric approaches can detect PTSD; however, progress has been slowed by heterogeneous study designs, inconsistent reporting, and limited attention to sex differences. Establishing common reporting standards, evaluating multimodal models in real-world settings, and developing algorithms incorporating sex for more equitable screening are warranted.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"211 ","pages":"Article 106289"},"PeriodicalIF":4.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Medical Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1