Improving ViT interpretability with patch-level mask prediction

IF 3.9 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Letters Pub Date : 2024-11-22 DOI:10.1016/j.patrec.2024.11.018
Junyong Kang , Byeongho Heo , Junsuk Choe
{"title":"Improving ViT interpretability with patch-level mask prediction","authors":"Junyong Kang ,&nbsp;Byeongho Heo ,&nbsp;Junsuk Choe","doi":"10.1016/j.patrec.2024.11.018","DOIUrl":null,"url":null,"abstract":"<div><div>Vision Transformers (ViTs) have demonstrated remarkable performances on various computer vision tasks. Attention scores are often used to explain the decision-making process of ViTs, showing which tokens are more important than others. However, the attention scores have several limitations as an explanation for ViT, such as conflicting with other explainable methods or highlighting unrelated tokens. In order to address this limitation, we propose a novel method for generating a visual explanation map from ViTs. Unlike previous approaches that rely on attention scores, our method leverages ViT features and conducts a single forward pass through our Patch-level Mask prediction (PM) module. Our visual explanation map provides class-dependent and probabilistic interpretation that can identify crucial regions of model decisions. Experimental results demonstrate that our approach outperforms previous techniques in both classification and interpretability aspects. Additionally, it can be applied to the weakly-supervised object localization (WSOL) tasks using pseudo mask labels. Our method requires no extra parameters and necessitates minimal locality supervision, utilizing less than 1% of the ImageNet-1k training dataset.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"187 ","pages":"Pages 73-79"},"PeriodicalIF":3.9000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865524003246","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Vision Transformers (ViTs) have demonstrated remarkable performances on various computer vision tasks. Attention scores are often used to explain the decision-making process of ViTs, showing which tokens are more important than others. However, the attention scores have several limitations as an explanation for ViT, such as conflicting with other explainable methods or highlighting unrelated tokens. In order to address this limitation, we propose a novel method for generating a visual explanation map from ViTs. Unlike previous approaches that rely on attention scores, our method leverages ViT features and conducts a single forward pass through our Patch-level Mask prediction (PM) module. Our visual explanation map provides class-dependent and probabilistic interpretation that can identify crucial regions of model decisions. Experimental results demonstrate that our approach outperforms previous techniques in both classification and interpretability aspects. Additionally, it can be applied to the weakly-supervised object localization (WSOL) tasks using pseudo mask labels. Our method requires no extra parameters and necessitates minimal locality supervision, utilizing less than 1% of the ImageNet-1k training dataset.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用光斑级掩膜预测提高 ViT 的可解释性
视觉转换器(ViTs)在各种计算机视觉任务中表现出了卓越的性能。注意力分数通常被用来解释 ViTs 的决策过程,显示哪些标记比其他标记更重要。然而,注意力分数作为对 ViT 的解释有一些局限性,例如与其他可解释方法相冲突或突出不相关的标记。为了解决这一局限性,我们提出了一种从 ViT 生成视觉解释图的新方法。与以往依赖注意力分数的方法不同,我们的方法利用了 ViT 的特征,并通过我们的补丁级掩码预测(PM)模块进行一次前向传递。我们的视觉解释图提供了与类别相关的概率解释,可以识别出模型决策的关键区域。实验结果表明,我们的方法在分类和可解释性方面都优于之前的技术。此外,它还可用于使用伪掩码标签的弱监督对象定位(WSOL)任务。我们的方法不需要额外的参数,只需最小的定位监督,使用不到 ImageNet-1k 训练数据集的 1%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
期刊最新文献
Prototypical class-wise test-time adaptation Improving ViT interpretability with patch-level mask prediction Sparse-attention augmented domain adaptation for unsupervised person re-identification GANzzle++: Generative approaches for jigsaw puzzle solving as local to global assignment in latent spatial representations Neuromorphic face analysis: A survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1