On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Data Technologies and Applications Pub Date : 2024-01-10 DOI:10.1108/dta-01-2023-0005

Sara El-Ateif, Ali Idri, José Luis Fernández-Alemán

{"title":"On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality","authors":"Sara El-Ateif, Ali Idri, José Luis Fernández-Alemán","doi":"10.1108/dta-01-2023-0005","DOIUrl":null,"url":null,"abstract":"<h3>Purpose</h3>\n<p>COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).</p>\n<h3>Design/methodology/approach</h3>\n<p>This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.</p>\n<h3>Findings</h3>\n<p>Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.</p>\n<h3>Originality/value</h3>\n<p>Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.</p>","PeriodicalId":56156,"journal":{"name":"Data Technologies and Applications","volume":"54 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Technologies and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1108/dta-01-2023-0005","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).

Design/methodology/approach

This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.

Findings

Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.

Originality/value

Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用 CT 和胸部 X 射线单模态和多模态诊断 COVID-19 的 CNN 与视觉变换器之间的差异

目的 COVID-19 仍在继续传播，并导致越来越多的人死亡。医生在诊断 COVID-19 时，不仅要使用实时聚合酶链反应，还要根据感染阶段使用计算机断层扫描（CT）和胸部 X 光检查（CXR）。然而，由于患者多、医生少，要及时了解病情变得十分困难。为了在这方面提供帮助，人们开发了深度学习模型，视觉转换器也是目前最先进的方法，但大多数技术目前只关注一种模式（CXR）。本文研究了使用卷积 MobileNetV2、ViT DeiT 和 Swin Transformer 模型从头开始训练与在 MedNIST 医学数据集而非自然图像的 ImageNet 数据集上进行预训练之间的差异。通过报告六项性能指标、斯科特-克诺特效应大小差、Wilcoxon 统计检验和 Borda 计数法进行比较。我们还使用 Grad-CAM 算法来研究模型的可解释性。研究结果虽然经过预训练的 MobileNetV2 是性能最佳的模型，但在性能、可解释性和对噪声的鲁棒性方面，使用 CXR（准确率 = 93.21%）和 CT（准确率 = 94.14%）模式从头开始训练的 Swin Transformer 才是最佳模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊