Boosting accuracy of student models via Masked Adaptive Self-Distillation

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2025-07-07 Epub Date: 2025-03-26 DOI:10.1016/j.neucom.2025.129988

Haoran Zhao , Shuwen Tian , Jinlong Wang , Zhaopeng Deng , Xin Sun , Junyu Dong

{"title":"Boosting accuracy of student models via Masked Adaptive Self-Distillation","authors":"Haoran Zhao , Shuwen Tian , Jinlong Wang , Zhaopeng Deng , Xin Sun , Junyu Dong","doi":"10.1016/j.neucom.2025.129988","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge distillation (KD) has achieved impressive success, yet conventional KD approaches are time-consuming and computationally costly. In contrast, self-distillation methods provide an efficient alternative. However, existing self-distillation methods mostly suffer from information redundancy due to the same network architecture from the teacher and student models. Additionally, they simultaneously face the inherent limitation of lacking a high-capacity teacher model. To cope with the above challenges, we propose a novel and efficient method named Masked Adaptive Self-Distillation (MASD). Specifically, we first introduce the Mask Generation Module, which masks random pixels of the feature maps and force it to reconstruct and refine more valuable features on different layers. Moreover, the Adaptive Weighting Mechanism is designed to dynamically adjust and optimize the weights of supervisory signals utilizing the probabilities from the mutual masked supervisory signals, thereby compensating the absence of high-capacity teacher model. We demonstrate the effectiveness of our MASD method on conventional image classification datasets and fine-grained datasets using state-of-the-art CNN architectures, and show that MASD significantly enhances the generalization of various backbone networks. For instance, on the CIFAR-100 classification benchmark, the proposed MASD method achieves an accuracy of 80.40% with the ResNet-18 architecture, surpassing the baseline with a 4.16% margin in Top-1 accuracy.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"637 ","pages":"Article 129988"},"PeriodicalIF":6.5000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225006605","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/26 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Knowledge distillation (KD) has achieved impressive success, yet conventional KD approaches are time-consuming and computationally costly. In contrast, self-distillation methods provide an efficient alternative. However, existing self-distillation methods mostly suffer from information redundancy due to the same network architecture from the teacher and student models. Additionally, they simultaneously face the inherent limitation of lacking a high-capacity teacher model. To cope with the above challenges, we propose a novel and efficient method named Masked Adaptive Self-Distillation (MASD). Specifically, we first introduce the Mask Generation Module, which masks random pixels of the feature maps and force it to reconstruct and refine more valuable features on different layers. Moreover, the Adaptive Weighting Mechanism is designed to dynamically adjust and optimize the weights of supervisory signals utilizing the probabilities from the mutual masked supervisory signals, thereby compensating the absence of high-capacity teacher model. We demonstrate the effectiveness of our MASD method on conventional image classification datasets and fine-grained datasets using state-of-the-art CNN architectures, and show that MASD significantly enhances the generalization of various backbone networks. For instance, on the CIFAR-100 classification benchmark, the proposed MASD method achieves an accuracy of 80.40% with the ResNet-18 architecture, surpassing the baseline with a 4.16% margin in Top-1 accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过掩模自适应自蒸馏提高学生模型的准确性

知识蒸馏（Knowledge distillation， KD）已经取得了令人印象深刻的成功，然而传统的知识蒸馏方法耗时且计算成本高。相比之下，自蒸馏方法提供了一种有效的替代方法。然而，现有的自蒸馏方法由于师生模型的网络结构相同，存在信息冗余的问题。此外，他们同时面临缺乏高能力教师模式的固有局限性。为了应对上述挑战，我们提出了一种新的高效方法——掩膜自适应蒸馏（MASD）。具体来说，我们首先介绍了掩码生成模块，该模块可以掩码特征映射的随机像素，并强制其在不同层上重建和细化更有价值的特征。此外，设计了自适应加权机制，利用相互屏蔽的监督信号的概率动态调整和优化监督信号的权重，从而弥补高容量教师模型的不足。我们证明了我们的MASD方法在使用最先进的CNN架构的传统图像分类数据集和细粒度数据集上的有效性，并表明MASD显着增强了各种骨干网络的泛化。例如，在CIFAR-100分类基准上，本文提出的MASD方法在ResNet-18架构下的准确率达到80.40%，在Top-1准确率上超过基线4.16%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.