在白血病图像分类中使用小损失法和加权平均集合 pLOF 解决标签噪声问题

IF 4.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Egyptian Informatics Journal Pub Date : 2024-06-01 Epub Date: 2024-05-07 DOI:10.1016/j.eij.2024.100479

Md. Tarek Aziz , S.M. Hasan Mahmud , Kah Ong Michael Goh , Dip Nandi

{"title":"在白血病图像分类中使用小损失法和加权平均集合 pLOF 解决标签噪声问题","authors":"Md. Tarek Aziz , S.M. Hasan Mahmud , Kah Ong Michael Goh , Dip Nandi","doi":"10.1016/j.eij.2024.100479","DOIUrl":null,"url":null,"abstract":"<div><p>Machine learning (ML) and deep learning (DL) models have been extensively explored for the early diagnosis of various cancer diseases, including Leukemia, with many of them achieving significant performance improvements comparable to those of human experts. However, challenges like limited image data, inaccurate annotations, and prediction reliability still hinder their broad implementation to establish a trustworthy computer-aided diagnosis (CAD) system. This paper introduces a novel weighted-average ensemble model for classifying Acute Lymphoblastic Leukemia, along with a reliable Computer-Aided Diagnosis (CAD) system that combines the strengths of both ML and DL approaches. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques to expand the training data. Second, a modified VGG-19 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, A small-loss approach and probabilistic local outlier factor (pLOF) have been developed on the extracted features to address the label noise issue. Fourth, we proposed an weighted-average ensemble model based on the top five models as base learners, with weights calculated based on their model uncertainty to ensure reliable predictions. Fifth, we calculated Shapley values based on cooperative game theory and performed feature selection with different feature combinations to determine the optimal number of features using SHAP. Finally, we integrate these strategies to develop an interpretable CAD system. This system not only predicts the disease but also generates Grad-CAM images to visualize potential affected areas, enhancing both clarity and diagnostic insight. All of our code is provided in the following repository: <span>https://github.com/taareek/leukemia-classification</span><svg><path></path></svg></p></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"26 ","pages":"Article 100479"},"PeriodicalIF":4.3000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110866524000422/pdfft?md5=c7485c79886026d6f9e82b0fb4f76cd0&pid=1-s2.0-S1110866524000422-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Addressing label noise in leukemia image classification using small loss approach and pLOF with weighted-average ensemble\",\"authors\":\"Md. Tarek Aziz , S.M. Hasan Mahmud , Kah Ong Michael Goh , Dip Nandi\",\"doi\":\"10.1016/j.eij.2024.100479\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Machine learning (ML) and deep learning (DL) models have been extensively explored for the early diagnosis of various cancer diseases, including Leukemia, with many of them achieving significant performance improvements comparable to those of human experts. However, challenges like limited image data, inaccurate annotations, and prediction reliability still hinder their broad implementation to establish a trustworthy computer-aided diagnosis (CAD) system. This paper introduces a novel weighted-average ensemble model for classifying Acute Lymphoblastic Leukemia, along with a reliable Computer-Aided Diagnosis (CAD) system that combines the strengths of both ML and DL approaches. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques to expand the training data. Second, a modified VGG-19 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, A small-loss approach and probabilistic local outlier factor (pLOF) have been developed on the extracted features to address the label noise issue. Fourth, we proposed an weighted-average ensemble model based on the top five models as base learners, with weights calculated based on their model uncertainty to ensure reliable predictions. Fifth, we calculated Shapley values based on cooperative game theory and performed feature selection with different feature combinations to determine the optimal number of features using SHAP. Finally, we integrate these strategies to develop an interpretable CAD system. This system not only predicts the disease but also generates Grad-CAM images to visualize potential affected areas, enhancing both clarity and diagnostic insight. All of our code is provided in the following repository: <span>https://github.com/taareek/leukemia-classification</span><svg><path></path></svg></p></div>\",\"PeriodicalId\":56010,\"journal\":{\"name\":\"Egyptian Informatics Journal\",\"volume\":\"26 \",\"pages\":\"Article 100479\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1110866524000422/pdfft?md5=c7485c79886026d6f9e82b0fb4f76cd0&pid=1-s2.0-S1110866524000422-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Egyptian Informatics Journal\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1110866524000422\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/5/7 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866524000422","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/7 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）和深度学习（DL）模型已被广泛用于包括白血病在内的各种癌症疾病的早期诊断，其中许多模型的性能显著提高，可与人类专家相媲美。然而，有限的图像数据、不准确的注释和预测可靠性等挑战仍然阻碍着它们在建立可信的计算机辅助诊断（CAD）系统方面的广泛应用。本文介绍了一种用于急性淋巴细胞白血病分类的新型加权平均集合模型，以及一种结合了 ML 和 DL 方法优点的可靠计算机辅助诊断（CAD）系统。首先，对各种滤波方法进行了广泛分析，以确定最合适的图像表示方法，并通过后续的数据增强技术来扩展训练数据。其次，提出了一个经过微调的 VGG-19 模型，将其用作特征提取器，从训练样本中提取有意义的特征。第三，在提取的特征上开发了小损失方法和概率局部离群因子（pLOF），以解决标签噪声问题。第四，我们提出了基于前五个模型的加权平均集合模型作为基础学习器，并根据其模型的不确定性计算权重，以确保预测的可靠性。第五，我们基于合作博弈论计算 Shapley 值，并通过不同的特征组合进行特征选择，从而利用 SHAP 确定最佳特征数量。最后，我们整合了这些策略，开发出一个可解释的 CAD 系统。该系统不仅能预测疾病，还能生成 Grad-CAM 图像以直观显示潜在的受影响区域，从而提高清晰度和诊断洞察力。我们的所有代码都在以下资源库中提供： https://github.com/taareek/leukemia-classification

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Addressing label noise in leukemia image classification using small loss approach and pLOF with weighted-average ensemble

Machine learning (ML) and deep learning (DL) models have been extensively explored for the early diagnosis of various cancer diseases, including Leukemia, with many of them achieving significant performance improvements comparable to those of human experts. However, challenges like limited image data, inaccurate annotations, and prediction reliability still hinder their broad implementation to establish a trustworthy computer-aided diagnosis (CAD) system. This paper introduces a novel weighted-average ensemble model for classifying Acute Lymphoblastic Leukemia, along with a reliable Computer-Aided Diagnosis (CAD) system that combines the strengths of both ML and DL approaches. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques to expand the training data. Second, a modified VGG-19 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, A small-loss approach and probabilistic local outlier factor (pLOF) have been developed on the extracted features to address the label noise issue. Fourth, we proposed an weighted-average ensemble model based on the top five models as base learners, with weights calculated based on their model uncertainty to ensure reliable predictions. Fifth, we calculated Shapley values based on cooperative game theory and performed feature selection with different feature combinations to determine the optimal number of features using SHAP. Finally, we integrate these strategies to develop an interpretable CAD system. This system not only predicts the disease but also generates Grad-CAM images to visualize potential affected areas, enhancing both clarity and diagnostic insight. All of our code is provided in the following repository: https://github.com/taareek/leukemia-classification

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.