基于计算机断层扫描的COVID-19分类的聚类分析和集成迁移学习

International Journal of Advances in Intelligent Informatics Pub Date : 2022-07-31 DOI:10.26555/ijain.v8i2.817

Lyubomir Gotsev, I. Mitkov, E. Kovatcheva, Boyan Jekov, R. Nikolov, E. Shoikova, Milena Petkova

{"title":"基于计算机断层扫描的COVID-19分类的聚类分析和集成迁移学习","authors":"Lyubomir Gotsev, I. Mitkov, E. Kovatcheva, Boyan Jekov, R. Nikolov, E. Shoikova, Milena Petkova","doi":"10.26555/ijain.v8i2.817","DOIUrl":null,"url":null,"abstract":"The paper presents a brief analysis of publications utilizing the public SARS-CoV-2 dataset, consisting of patients’ computer tomography scans captured from Brazil hospitals and an experimental setup addressing the found data challenges. The analysis shows that all protocols, with one exception, suffer from data leakage arising from data organization where the patients and their images are not grouped. Each patient is represented with several scans. It can provide misleading results as data of the same individual may occur in both training and test sets. Furthermore, only one paper proposed ensemble learning utilizing as base models VGG-16, ResNet50, and Xception. Therefore, we proposed and experimented with the following strategy to mitigate the found risks of bias: data standardization and normalization to achieve proper contrast and resolution; k-means and group shuffle split to avoid data leakage; augmentation and ensemble transfer learning to deal with limited sample size and over-fitting. Compared with the earlier proposed ensemble approach, the current one stacks VGG-16, Densenet-201, and Inception v3, achieving higher accuracy (99.3 %), second in the related work, and most significantly, it applies augmentation and clustering analysis to avoid overestimation. In contrast, the paper also presented critical metrics in the medical domain: negative prediction value (99.55%), false positive rate (0.89%), false negative rate (0.42%), and false discovery rate (0.83%). The strategy has two main advantages: reducing data pitfalls and decreasing generalization error. It can serve as a baseline to increase the performance quality and mitigate the risk of bias in the field.","PeriodicalId":52195,"journal":{"name":"International Journal of Advances in Intelligent Informatics","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cluster analysis and ensemble transfer learning for COVID-19 classification from computed tomography scans\",\"authors\":\"Lyubomir Gotsev, I. Mitkov, E. Kovatcheva, Boyan Jekov, R. Nikolov, E. Shoikova, Milena Petkova\",\"doi\":\"10.26555/ijain.v8i2.817\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The paper presents a brief analysis of publications utilizing the public SARS-CoV-2 dataset, consisting of patients’ computer tomography scans captured from Brazil hospitals and an experimental setup addressing the found data challenges. The analysis shows that all protocols, with one exception, suffer from data leakage arising from data organization where the patients and their images are not grouped. Each patient is represented with several scans. It can provide misleading results as data of the same individual may occur in both training and test sets. Furthermore, only one paper proposed ensemble learning utilizing as base models VGG-16, ResNet50, and Xception. Therefore, we proposed and experimented with the following strategy to mitigate the found risks of bias: data standardization and normalization to achieve proper contrast and resolution; k-means and group shuffle split to avoid data leakage; augmentation and ensemble transfer learning to deal with limited sample size and over-fitting. Compared with the earlier proposed ensemble approach, the current one stacks VGG-16, Densenet-201, and Inception v3, achieving higher accuracy (99.3 %), second in the related work, and most significantly, it applies augmentation and clustering analysis to avoid overestimation. In contrast, the paper also presented critical metrics in the medical domain: negative prediction value (99.55%), false positive rate (0.89%), false negative rate (0.42%), and false discovery rate (0.83%). The strategy has two main advantages: reducing data pitfalls and decreasing generalization error. It can serve as a baseline to increase the performance quality and mitigate the risk of bias in the field.\",\"PeriodicalId\":52195,\"journal\":{\"name\":\"International Journal of Advances in Intelligent Informatics\",\"volume\":\"43 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Advances in Intelligent Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.26555/ijain.v8i2.817\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advances in Intelligent Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26555/ijain.v8i2.817","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文对利用公共SARS-CoV-2数据集的出版物进行了简要分析，该数据集包括从巴西医院捕获的患者计算机断层扫描和解决所发现数据挑战的实验设置。分析表明，除了一个例外，所有协议都存在数据泄露，这是由于数据组织中没有对患者及其图像进行分组。每个病人都有几次扫描。它可能提供误导性的结果，因为同一个人的数据可能出现在训练集和测试集中。此外，只有一篇论文提出了集成学习，使用VGG-16、ResNet50和Xception作为基本模型。因此，我们提出并试验了以下策略来减轻发现的偏见风险:数据标准化和规范化，以实现适当的对比度和分辨率;K-means和group shuffle分离，避免数据泄露;处理有限样本量和过拟合的增强和集成迁移学习。与之前提出的集成方法相比，目前的集成方法将VGG-16、Densenet-201和Inception v3叠加在一起，实现了更高的准确率(99.3%)，在相关工作中排名第二，最重要的是，它应用了增强和聚类分析来避免高估。相比之下，本文还提出了医学领域的关键指标:阴性预测值(99.55%)、假阳性率(0.89%)、假阴性率(0.42%)和假发现率(0.83%)。该策略有两个主要优点:减少数据陷阱和减少泛化误差。它可以作为提高性能质量和减轻该领域偏差风险的基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Cluster analysis and ensemble transfer learning for COVID-19 classification from computed tomography scans

The paper presents a brief analysis of publications utilizing the public SARS-CoV-2 dataset, consisting of patients’ computer tomography scans captured from Brazil hospitals and an experimental setup addressing the found data challenges. The analysis shows that all protocols, with one exception, suffer from data leakage arising from data organization where the patients and their images are not grouped. Each patient is represented with several scans. It can provide misleading results as data of the same individual may occur in both training and test sets. Furthermore, only one paper proposed ensemble learning utilizing as base models VGG-16, ResNet50, and Xception. Therefore, we proposed and experimented with the following strategy to mitigate the found risks of bias: data standardization and normalization to achieve proper contrast and resolution; k-means and group shuffle split to avoid data leakage; augmentation and ensemble transfer learning to deal with limited sample size and over-fitting. Compared with the earlier proposed ensemble approach, the current one stacks VGG-16, Densenet-201, and Inception v3, achieving higher accuracy (99.3 %), second in the related work, and most significantly, it applies augmentation and clustering analysis to avoid overestimation. In contrast, the paper also presented critical metrics in the medical domain: negative prediction value (99.55%), false positive rate (0.89%), false negative rate (0.42%), and false discovery rate (0.83%). The strategy has two main advantages: reducing data pitfalls and decreasing generalization error. It can serve as a baseline to increase the performance quality and mitigate the risk of bias in the field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Advances in Intelligent Informatics Computer Science-Computer Vision and Pattern Recognition

CiteScore

3.00

自引率

0.00%

发文量