Md. Tarek Aziz , S.M. Hasan Mahmud , Kah Ong Michael Goh , Dip Nandi
{"title":"Addressing label noise in leukemia image classification using small loss approach and pLOF with weighted-average ensemble","authors":"Md. Tarek Aziz , S.M. Hasan Mahmud , Kah Ong Michael Goh , Dip Nandi","doi":"10.1016/j.eij.2024.100479","DOIUrl":null,"url":null,"abstract":"<div><p>Machine learning (ML) and deep learning (DL) models have been extensively explored for the early diagnosis of various cancer diseases, including Leukemia, with many of them achieving significant performance improvements comparable to those of human experts. However, challenges like limited image data, inaccurate annotations, and prediction reliability still hinder their broad implementation to establish a trustworthy computer-aided diagnosis (CAD) system. This paper introduces a novel weighted-average ensemble model for classifying Acute Lymphoblastic Leukemia, along with a reliable Computer-Aided Diagnosis (CAD) system that combines the strengths of both ML and DL approaches. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques to expand the training data. Second, a modified VGG-19 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, A small-loss approach and probabilistic local outlier factor (pLOF) have been developed on the extracted features to address the label noise issue. Fourth, we proposed an weighted-average ensemble model based on the top five models as base learners, with weights calculated based on their model uncertainty to ensure reliable predictions. Fifth, we calculated Shapley values based on cooperative game theory and performed feature selection with different feature combinations to determine the optimal number of features using SHAP. Finally, we integrate these strategies to develop an interpretable CAD system. This system not only predicts the disease but also generates Grad-CAM images to visualize potential affected areas, enhancing both clarity and diagnostic insight. All of our code is provided in the following repository: <span>https://github.com/taareek/leukemia-classification</span><svg><path></path></svg></p></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1110866524000422/pdfft?md5=c7485c79886026d6f9e82b0fb4f76cd0&pid=1-s2.0-S1110866524000422-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866524000422","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) and deep learning (DL) models have been extensively explored for the early diagnosis of various cancer diseases, including Leukemia, with many of them achieving significant performance improvements comparable to those of human experts. However, challenges like limited image data, inaccurate annotations, and prediction reliability still hinder their broad implementation to establish a trustworthy computer-aided diagnosis (CAD) system. This paper introduces a novel weighted-average ensemble model for classifying Acute Lymphoblastic Leukemia, along with a reliable Computer-Aided Diagnosis (CAD) system that combines the strengths of both ML and DL approaches. Initially, a variety of filtering methods are extensively analyzed to determine the most suitable image representation, with subsequent data augmentation techniques to expand the training data. Second, a modified VGG-19 model was proposed with fine-tuning that was utilized as a feature extractor to extract meaningful features from the training samples. Third, A small-loss approach and probabilistic local outlier factor (pLOF) have been developed on the extracted features to address the label noise issue. Fourth, we proposed an weighted-average ensemble model based on the top five models as base learners, with weights calculated based on their model uncertainty to ensure reliable predictions. Fifth, we calculated Shapley values based on cooperative game theory and performed feature selection with different feature combinations to determine the optimal number of features using SHAP. Finally, we integrate these strategies to develop an interpretable CAD system. This system not only predicts the disease but also generates Grad-CAM images to visualize potential affected areas, enhancing both clarity and diagnostic insight. All of our code is provided in the following repository: https://github.com/taareek/leukemia-classification
期刊介绍:
The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.