Mohammad Azizmalayeri , Ameen Abu-Hanna , Giovanni Cinà
{"title":"Unmasking the chameleons: A benchmark for out-of-distribution detection in medical tabular data","authors":"Mohammad Azizmalayeri , Ameen Abu-Hanna , Giovanni Cinà","doi":"10.1016/j.ijmedinf.2024.105762","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Machine Learning (ML) models often struggle to generalize effectively to data that deviates from the training distribution. This raises significant concerns about the reliability of real-world healthcare systems encountering such inputs known as out-of-distribution (OOD) data. These concerns can be addressed by real-time detection of OOD inputs. While numerous OOD detection approaches have been suggested in other fields - especially in computer vision - it remains unclear whether similar methods effectively address challenges posed by medical tabular data.</div></div><div><h3>Objective</h3><div>To answer this important question, we propose an extensive reproducible benchmark to compare different OOD detection methods in medical tabular data across a comprehensive suite of tests.</div></div><div><h3>Method</h3><div>To achieve this, we leverage 4 different and large public medical datasets, including eICU and MIMIC-IV, and consider various kinds of OOD cases within these datasets. For example, we examine OODs originating from a statistically different dataset than the training set according to the membership model introduced by Debray et al. <span><span>[1]</span></span>, as well as OODs obtained by splitting a given dataset based on a value of a distinguishing variable. To identify OOD instances, we explore a range of 10 density-based methods that learn the marginal distribution of the data, alongside 17 post-hoc detectors that are applied on top of prediction models already trained on the data. The prediction models involve three distinct architectures, namely MLP, ResNet, and Transformer.</div></div><div><h3>Main results</h3><div>In our experiments, when the membership model achieved an AUC of 0.98, which indicated a clear distinction between OOD data and the training set, we observed that the OOD detection methods had achieved AUC values exceeding 0.95 in distinguishing OOD data. In contrast, in the experiments with subtler changes in data distribution such as selecting OOD data based on ethnicity and age characteristics, many OOD detection methods performed similarly to a random classifier with AUC values close to 0.5. This may suggest a correlation between separability, as indicated by the membership model, and OOD detection performance, as indicated by the AUC of the detection model. This warrants future research.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"195 ","pages":"Article 105762"},"PeriodicalIF":3.7000,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624004258","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Machine Learning (ML) models often struggle to generalize effectively to data that deviates from the training distribution. This raises significant concerns about the reliability of real-world healthcare systems encountering such inputs known as out-of-distribution (OOD) data. These concerns can be addressed by real-time detection of OOD inputs. While numerous OOD detection approaches have been suggested in other fields - especially in computer vision - it remains unclear whether similar methods effectively address challenges posed by medical tabular data.
Objective
To answer this important question, we propose an extensive reproducible benchmark to compare different OOD detection methods in medical tabular data across a comprehensive suite of tests.
Method
To achieve this, we leverage 4 different and large public medical datasets, including eICU and MIMIC-IV, and consider various kinds of OOD cases within these datasets. For example, we examine OODs originating from a statistically different dataset than the training set according to the membership model introduced by Debray et al. [1], as well as OODs obtained by splitting a given dataset based on a value of a distinguishing variable. To identify OOD instances, we explore a range of 10 density-based methods that learn the marginal distribution of the data, alongside 17 post-hoc detectors that are applied on top of prediction models already trained on the data. The prediction models involve three distinct architectures, namely MLP, ResNet, and Transformer.
Main results
In our experiments, when the membership model achieved an AUC of 0.98, which indicated a clear distinction between OOD data and the training set, we observed that the OOD detection methods had achieved AUC values exceeding 0.95 in distinguishing OOD data. In contrast, in the experiments with subtler changes in data distribution such as selecting OOD data based on ethnicity and age characteristics, many OOD detection methods performed similarly to a random classifier with AUC values close to 0.5. This may suggest a correlation between separability, as indicated by the membership model, and OOD detection performance, as indicated by the AUC of the detection model. This warrants future research.
背景:机器学习(ML)模型通常难以有效地泛化到偏离训练分布的数据。这引起了人们对现实世界医疗保健系统遇到这种被称为分布外(OOD)数据输入的可靠性的重大关注。这些问题可以通过实时检测OOD输入来解决。虽然在其他领域已经提出了许多OOD检测方法,特别是在计算机视觉领域,但尚不清楚类似的方法是否能有效地解决医疗表格数据带来的挑战。目的:为了回答这个重要的问题,我们提出了一个广泛的可重复的基准,以比较不同的医学表格数据中的OOD检测方法。方法:为了实现这一目标,我们利用了4个不同的大型公共医疗数据集,包括eICU和MIMIC-IV,并考虑了这些数据集中的各种OOD病例。例如,我们根据Debray et al.[1]引入的隶属度模型,检查来自统计上不同于训练集的数据集的ood,以及根据区分变量的值分割给定数据集获得的ood。为了识别OOD实例,我们探索了10种基于密度的方法来学习数据的边际分布,以及17种应用于已经在数据上训练过的预测模型之上的事后检测器。预测模型涉及三种不同的体系结构,即MLP、ResNet和Transformer。主要结果:在我们的实验中,当隶属度模型的AUC达到0.98,表明OOD数据与训练集之间有明显的区别时,我们观察到OOD检测方法在区分OOD数据方面的AUC值已经超过0.95。相比之下,在数据分布有细微变化的实验中,如基于种族和年龄特征选择OOD数据,许多OOD检测方法的表现与AUC值接近0.5的随机分类器相似。这可能表明可分离性(如成员模型所示)与OOD检测性能(如检测模型的AUC所示)之间存在相关性。这值得未来的研究。
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.