Candelaria Mosquera, Luciana Ferrer, Diego H Milone, Daniel Luna, Enzo Ferrante
{"title":"Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.","authors":"Candelaria Mosquera, Luciana Ferrer, Diego H Milone, Daniel Luna, Enzo Ferrante","doi":"10.1007/s00330-024-10834-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.</p><p><strong>Materials and methods: </strong>We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.</p><p><strong>Results: </strong>Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.</p><p><strong>Conclusion: </strong>Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.</p><p><strong>Clinical relevance statement: </strong>This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.</p><p><strong>Key points: </strong>Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.</p>","PeriodicalId":12076,"journal":{"name":"European Radiology","volume":" ","pages":"7895-7903"},"PeriodicalIF":4.7000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00330-024-10834-0","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.
Materials and methods: We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.
Results: Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.
Conclusion: Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.
Clinical relevance statement: This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.
Key points: Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.
目的:这项工作旨在评估研究界用于评估医学影像分类器的标准评估方法,特别关注类不平衡的影响。分析以胸部 X 射线为案例,包含全面的模型性能定义,同时考虑了判别能力和模型校准:我们进行了简明的文献综述,研究了评估 X 射线分类器时使用的主流科学实践。然后,我们在两个主要的胸部 X 射线数据集上进行了系统实验,以展示几个性能指标在不同类别比例下的行为示例,并强调广泛采用的指标如何掩盖少数类别的性能:我们的文献研究证实结果:我们的文献研究证实:(1) 即使在处理高度不平衡的数据集时,社区也倾向于使用由多数类主导的指标;(2) 尽管校准研究在医疗保健领域非常重要,但对胸部 X 射线分类器进行校准研究的情况仍不常见。此外,我们的系统实验证实,目前的评估方法可能无法反映模型在真实临床场景中的性能,并提出了补充指标,以更好地反映系统在此类场景中的性能:我们的分析强调了加强评估实践的必要性,特别是在类不平衡胸部 X 光分类器方面。我们建议纳入精确度-召回曲线下面积(AUC-PR)、调整后的 AUC-PR 和平衡布赖尔得分等补充指标,以更准确地描述系统在真实临床场景中的性能,同时考虑反映分辨和校准性能的指标:本研究强调了医学影像分类器对精细化评估指标的迫切需求,并强调普遍采用的指标可能会掩盖少数类别的不良表现,从而对临床诊断和医疗结果造成潜在影响:要点:有关 X 射线计算机辅助诊断(CAD)系统的论文中常见的科学实践可能会产生误导。我们强调了在高度不平衡的情况下报告 X 射线计算机辅助诊断系统评价指标的局限性。我们建议采用基于大规模数据集实验评估的替代指标。
期刊介绍:
European Radiology (ER) continuously updates scientific knowledge in radiology by publication of strong original articles and state-of-the-art reviews written by leading radiologists. A well balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes ER an indispensable source for current information in this field.
This is the Journal of the European Society of Radiology, and the official journal of a number of societies.
From 2004-2008 supplements to European Radiology were published under its companion, European Radiology Supplements, ISSN 1613-3749.