{"title":"Improving ViT interpretability with patch-level mask prediction","authors":"Junyong Kang , Byeongho Heo , Junsuk Choe","doi":"10.1016/j.patrec.2024.11.018","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers (ViTs) have demonstrated remarkable performances on various computer vision tasks. Attention scores are often used to explain the decision-making process of ViTs, showing which tokens are more important than others. However, the attention scores have several limitations as an explanation for ViT, such as conflicting with other explainable methods or highlighting unrelated tokens. In order to address this limitation, we propose a novel method for generating a visual explanation map from ViTs. Unlike previous approaches that rely on attention scores, our method leverages ViT features and conducts a single forward pass through our Patch-level Mask prediction (PM) module. Our visual explanation map provides class-dependent and probabilistic interpretation that can identify crucial regions of model decisions. Experimental results demonstrate that our approach outperforms previous techniques in both classification and interpretability aspects. Additionally, it can be applied to the weakly-supervised object localization (WSOL) tasks using pseudo mask labels. Our method requires no extra parameters and necessitates minimal locality supervision, utilizing less than 1% of the ImageNet-1k training dataset.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"187 ","pages":"Pages 73-79"},"PeriodicalIF":3.9000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524003246","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformers (ViTs) have demonstrated remarkable performances on various computer vision tasks. Attention scores are often used to explain the decision-making process of ViTs, showing which tokens are more important than others. However, the attention scores have several limitations as an explanation for ViT, such as conflicting with other explainable methods or highlighting unrelated tokens. In order to address this limitation, we propose a novel method for generating a visual explanation map from ViTs. Unlike previous approaches that rely on attention scores, our method leverages ViT features and conducts a single forward pass through our Patch-level Mask prediction (PM) module. Our visual explanation map provides class-dependent and probabilistic interpretation that can identify crucial regions of model decisions. Experimental results demonstrate that our approach outperforms previous techniques in both classification and interpretability aspects. Additionally, it can be applied to the weakly-supervised object localization (WSOL) tasks using pseudo mask labels. Our method requires no extra parameters and necessitates minimal locality supervision, utilizing less than 1% of the ImageNet-1k training dataset.
视觉转换器(ViTs)在各种计算机视觉任务中表现出了卓越的性能。注意力分数通常被用来解释 ViTs 的决策过程,显示哪些标记比其他标记更重要。然而,注意力分数作为对 ViT 的解释有一些局限性,例如与其他可解释方法相冲突或突出不相关的标记。为了解决这一局限性,我们提出了一种从 ViT 生成视觉解释图的新方法。与以往依赖注意力分数的方法不同,我们的方法利用了 ViT 的特征,并通过我们的补丁级掩码预测(PM)模块进行一次前向传递。我们的视觉解释图提供了与类别相关的概率解释,可以识别出模型决策的关键区域。实验结果表明,我们的方法在分类和可解释性方面都优于之前的技术。此外,它还可用于使用伪掩码标签的弱监督对象定位(WSOL)任务。我们的方法不需要额外的参数,只需最小的定位监督,使用不到 ImageNet-1k 训练数据集的 1%。
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.