Multi-centre benchmarking of deep learning models for COVID-19 detection in chest x-rays

IF 2.3 Frontiers in radiology Pub Date : 2024-05-21 DOI:10.3389/fradi.2024.1386906

Rachael Harkness, A. F. Frangi, K. Zucker, Nishant Ravikumar

{"title":"Multi-centre benchmarking of deep learning models for COVID-19 detection in chest x-rays","authors":"Rachael Harkness, A. F. Frangi, K. Zucker, Nishant Ravikumar","doi":"10.3389/fradi.2024.1386906","DOIUrl":null,"url":null,"abstract":"This study is a retrospective evaluation of the performance of deep learning models that were developed for the detection of COVID-19 from chest x-rays, undertaken with the goal of assessing the suitability of such systems as clinical decision support tools.Models were trained on the National COVID-19 Chest Imaging Database (NCCID), a UK-wide multi-centre dataset from 26 different NHS hospitals and evaluated on independent multi-national clinical datasets. The evaluation considers clinical and technical contributors to model error and potential model bias. Model predictions are examined for spurious feature correlations using techniques for explainable prediction.Models performed adequately on NHS populations, with performance comparable to radiologists, but generalised poorly to international populations. Models performed better in males than females, and performance varied across age groups. Alarmingly, models routinely failed when applied to complex clinical cases with confounding pathologies and when applied to radiologist defined “mild” cases.This comprehensive benchmarking study examines the pitfalls in current practices that have led to impractical model development. Key findings highlight the need for clinician involvement at all stages of model development, from data curation and label definition, to model evaluation, to ensure that all clinical factors and disease features are appropriately considered during model design. This is imperative to ensure automated approaches developed for disease detection are fit-for-purpose in a clinical setting.","PeriodicalId":73101,"journal":{"name":"Frontiers in radiology","volume":"135 9","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in radiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fradi.2024.1386906","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This study is a retrospective evaluation of the performance of deep learning models that were developed for the detection of COVID-19 from chest x-rays, undertaken with the goal of assessing the suitability of such systems as clinical decision support tools.Models were trained on the National COVID-19 Chest Imaging Database (NCCID), a UK-wide multi-centre dataset from 26 different NHS hospitals and evaluated on independent multi-national clinical datasets. The evaluation considers clinical and technical contributors to model error and potential model bias. Model predictions are examined for spurious feature correlations using techniques for explainable prediction.Models performed adequately on NHS populations, with performance comparable to radiologists, but generalised poorly to international populations. Models performed better in males than females, and performance varied across age groups. Alarmingly, models routinely failed when applied to complex clinical cases with confounding pathologies and when applied to radiologist defined “mild” cases.This comprehensive benchmarking study examines the pitfalls in current practices that have led to impractical model development. Key findings highlight the need for clinician involvement at all stages of model development, from data curation and label definition, to model evaluation, to ensure that all clinical factors and disease features are appropriately considered during model design. This is imperative to ensure automated approaches developed for disease detection are fit-for-purpose in a clinical setting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

胸部 X 射线中 COVID-19 检测深度学习模型的多中心基准测试

本研究是对为检测胸部X光片中的COVID-19而开发的深度学习模型的性能进行的回顾性评估，目的是评估此类系统作为临床决策支持工具的适用性。模型在国家COVID-19胸部成像数据库（NCCID）上进行了训练，该数据库是英国范围内的多中心数据集，来自26家不同的国家医疗服务系统医院，并在独立的多国临床数据集上进行了评估。评估考虑了导致模型误差和潜在模型偏差的临床和技术因素。使用可解释预测技术检查了模型预测的虚假特征相关性。模型在英国国家医疗服务系统人群中的表现良好，与放射科医生的表现相当，但在国际人群中的普适性较差。模型在男性中的表现优于女性，在不同年龄组中的表现也不尽相同。令人担忧的是，当模型应用于具有混杂病理的复杂临床病例时，以及应用于放射科医生定义的 "轻度 "病例时，通常都会失败。这项综合基准研究探讨了当前实践中导致模型开发不切实际的陷阱。主要发现强调了临床医生参与模型开发各个阶段的必要性，从数据整理和标签定义到模型评估，以确保在模型设计过程中适当考虑所有临床因素和疾病特征。这对于确保为疾病检测开发的自动方法适合临床环境至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in radiology

CiteScore

1.20

自引率

0.00%

发文量