Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.

IF 4.7 2区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING European Radiology Pub Date : 2024-12-01 Epub Date: 2024-06-11 DOI:10.1007/s00330-024-10834-0
Candelaria Mosquera, Luciana Ferrer, Diego H Milone, Daniel Luna, Enzo Ferrante
{"title":"Class imbalance on medical image classification: towards better evaluation practices for discrimination and calibration performance.","authors":"Candelaria Mosquera, Luciana Ferrer, Diego H Milone, Daniel Luna, Enzo Ferrante","doi":"10.1007/s00330-024-10834-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.</p><p><strong>Materials and methods: </strong>We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.</p><p><strong>Results: </strong>Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.</p><p><strong>Conclusion: </strong>Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.</p><p><strong>Clinical relevance statement: </strong>This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.</p><p><strong>Key points: </strong>Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.</p>","PeriodicalId":12076,"journal":{"name":"European Radiology","volume":" ","pages":"7895-7903"},"PeriodicalIF":4.7000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00330-024-10834-0","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/11 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.

Materials and methods: We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.

Results: Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.

Conclusion: Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.

Clinical relevance statement: This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.

Key points: Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
医学影像分类中的类不平衡问题:改进判别和校准性能的评估方法。
目的:这项工作旨在评估研究界用于评估医学影像分类器的标准评估方法,特别关注类不平衡的影响。分析以胸部 X 射线为案例,包含全面的模型性能定义,同时考虑了判别能力和模型校准:我们进行了简明的文献综述,研究了评估 X 射线分类器时使用的主流科学实践。然后,我们在两个主要的胸部 X 射线数据集上进行了系统实验,以展示几个性能指标在不同类别比例下的行为示例,并强调广泛采用的指标如何掩盖少数类别的性能:我们的文献研究证实结果:我们的文献研究证实:(1) 即使在处理高度不平衡的数据集时,社区也倾向于使用由多数类主导的指标;(2) 尽管校准研究在医疗保健领域非常重要,但对胸部 X 射线分类器进行校准研究的情况仍不常见。此外,我们的系统实验证实,目前的评估方法可能无法反映模型在真实临床场景中的性能,并提出了补充指标,以更好地反映系统在此类场景中的性能:我们的分析强调了加强评估实践的必要性,特别是在类不平衡胸部 X 光分类器方面。我们建议纳入精确度-召回曲线下面积(AUC-PR)、调整后的 AUC-PR 和平衡布赖尔得分等补充指标,以更准确地描述系统在真实临床场景中的性能,同时考虑反映分辨和校准性能的指标:本研究强调了医学影像分类器对精细化评估指标的迫切需求,并强调普遍采用的指标可能会掩盖少数类别的不良表现,从而对临床诊断和医疗结果造成潜在影响:要点:有关 X 射线计算机辅助诊断(CAD)系统的论文中常见的科学实践可能会产生误导。我们强调了在高度不平衡的情况下报告 X 射线计算机辅助诊断系统评价指标的局限性。我们建议采用基于大规模数据集实验评估的替代指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
European Radiology
European Radiology 医学-核医学
CiteScore
11.60
自引率
8.50%
发文量
874
审稿时长
2-4 weeks
期刊介绍: European Radiology (ER) continuously updates scientific knowledge in radiology by publication of strong original articles and state-of-the-art reviews written by leading radiologists. A well balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes ER an indispensable source for current information in this field. This is the Journal of the European Society of Radiology, and the official journal of a number of societies. From 2004-2008 supplements to European Radiology were published under its companion, European Radiology Supplements, ISSN 1613-3749.
期刊最新文献
Correction: Comparison between CT volumetry and extracellular volume fraction using liver dynamic CT for the predictive ability of liver fibrosis in patients with hepatocellular carcinoma. Correction: Development and evaluation of two open-source nnU-Net models for automatic segmentation of lung tumors on PET and CT images with and without respiratory motion compensation. Correction: Machine learning detects symptomatic patients with carotid plaques based on 6-type calcium configuration classification on CT angiography. Natural language processing pipeline to extract prostate cancer-related information from clinical notes. ESR Essentials: characterisation and staging of adnexal masses with MRI and CT-practice recommendations by ESUR.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1