{"title":"Semantically redundant training data removal and deep model classification performance: A study with chest X-rays","authors":"Sivaramakrishnan Rajaraman, Ghada Zamzmi , Feng Yang , Zhaohui Liang, Zhiyun Xue, Sameer Antani","doi":"10.1016/j.compmedimag.2024.102379","DOIUrl":null,"url":null,"abstract":"<div><p>Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. However, the data must also exhibit variety to enable improved learning. In medical imaging data, semantic redundancy, which is the presence of similar or repetitive information, can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Also, the common use of augmentation methods to generate variety in DL training could limit performance when indiscriminately applied to such data. We hypothesize that semantic redundancy would therefore tend to lower performance and limit generalizability to unseen data and question its impact on classifier performance even with large data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data and demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.</p></div>","PeriodicalId":50631,"journal":{"name":"Computerized Medical Imaging and Graphics","volume":"115 ","pages":"Article 102379"},"PeriodicalIF":5.4000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0895611124000569/pdfft?md5=6892a4c80999a323e6edf07480aef597&pid=1-s2.0-S0895611124000569-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computerized Medical Imaging and Graphics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0895611124000569","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Deep learning (DL) has demonstrated its innate capacity to independently learn hierarchical features from complex and multi-dimensional data. A common understanding is that its performance scales up with the amount of training data. However, the data must also exhibit variety to enable improved learning. In medical imaging data, semantic redundancy, which is the presence of similar or repetitive information, can occur due to the presence of multiple images that have highly similar presentations for the disease of interest. Also, the common use of augmentation methods to generate variety in DL training could limit performance when indiscriminately applied to such data. We hypothesize that semantic redundancy would therefore tend to lower performance and limit generalizability to unseen data and question its impact on classifier performance even with large data. We propose an entropy-based sample scoring approach to identify and remove semantically redundant training data and demonstrate using the publicly available NIH chest X-ray dataset that the model trained on the resulting informative subset of training data significantly outperforms the model trained on the full training set, during both internal (recall: 0.7164 vs 0.6597, p<0.05) and external testing (recall: 0.3185 vs 0.2589, p<0.05). Our findings emphasize the importance of information-oriented training sample selection as opposed to the conventional practice of using all available training data.
深度学习(DL)已经证明,它具有从复杂的多维数据中独立学习分层特征的天生能力。一个共识是,其性能会随着训练数据量的增加而提高。然而,数据也必须表现出多样性,才能提高学习效率。在医学影像数据中,由于存在多张与相关疾病表现高度相似的图像,可能会出现语义冗余,即存在相似或重复的信息。此外,在 DL 训练中常用增强方法来产生多样性,如果不加区分地应用于此类数据,可能会限制其性能。因此,我们假设语义冗余往往会降低性能,限制对未见数据的泛化,并质疑其对分类器性能的影响,即使是大数据也是如此。我们提出了一种基于熵的样本评分方法来识别和移除语义冗余的训练数据,并使用公开的 NIH 胸部 X 光数据集证明,在内部测试(召回率:0.7164 vs 0.6597,p<0.05)和外部测试(召回率:0.3185 vs 0.2589,p<0.05)中,在训练数据的信息子集上训练的模型明显优于在完整训练集上训练的模型。我们的发现强调了以信息为导向的训练样本选择的重要性,而不是使用所有可用训练数据的传统做法。
期刊介绍:
The purpose of the journal Computerized Medical Imaging and Graphics is to act as a source for the exchange of research results concerning algorithmic advances, development, and application of digital imaging in disease detection, diagnosis, intervention, prevention, precision medicine, and population health. Included in the journal will be articles on novel computerized imaging or visualization techniques, including artificial intelligence and machine learning, augmented reality for surgical planning and guidance, big biomedical data visualization, computer-aided diagnosis, computerized-robotic surgery, image-guided therapy, imaging scanning and reconstruction, mobile and tele-imaging, radiomics, and imaging integration and modeling with other information relevant to digital health. The types of biomedical imaging include: magnetic resonance, computed tomography, ultrasound, nuclear medicine, X-ray, microwave, optical and multi-photon microscopy, video and sensory imaging, and the convergence of biomedical images with other non-imaging datasets.