Explainability Enhanced Object Detection Transformer With Feature Disentanglement

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-11-12 DOI:10.1109/TIP.2024.3492733

Wenlong Yu;Ruonan Liu;Dongyue Chen;Qinghua Hu

{"title":"Explainability Enhanced Object Detection Transformer With Feature Disentanglement","authors":"Wenlong Yu;Ruonan Liu;Dongyue Chen;Qinghua Hu","doi":"10.1109/TIP.2024.3492733","DOIUrl":null,"url":null,"abstract":"Explainability is a pivotal factor in determining whether a deep learning model can be authorized in critical applications. To enhance the explainability of models of end-to-end object DEtection with TRansformer (DETR), we introduce a disentanglement method that constrains the feature learning process, following a divide-and-conquer decoupling paradigm, similar to how people understand complex real-world problems. We first demonstrate the entangled property of the features between the extractor and detector and find that the regression function is a key factor contributing to the deterioration of disentangled feature activation. These highly entangled features always activate the local characteristics, making it difficult to cover the semantic information of an object, which also reduces the interpretability of single-backbone object detection models. Thus, an Explainability Enhanced object detection Transformer with feature Disentanglement (DETD) model is proposed, in which the Tensor Singular Value Decomposition (T-SVD) is used to produce feature bases and the Batch averaged Feature Spectral Penalization (BFSP) loss is introduced to constrain the disentanglement of the feature and balance the semantic activation. The proposed method is applied across three prominent backbones, two DETR variants, and a CNN based model. By combining two optimization techniques, extensive experiments on two datasets consistently demonstrate that the DETD model outperforms the counterpart in terms of object detection performance and feature disentanglement. The Grad-CAM visualizations demonstrate the enhancement of feature learning explainability in the disentanglement view.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6439-6454"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10751766/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Explainability is a pivotal factor in determining whether a deep learning model can be authorized in critical applications. To enhance the explainability of models of end-to-end object DEtection with TRansformer (DETR), we introduce a disentanglement method that constrains the feature learning process, following a divide-and-conquer decoupling paradigm, similar to how people understand complex real-world problems. We first demonstrate the entangled property of the features between the extractor and detector and find that the regression function is a key factor contributing to the deterioration of disentangled feature activation. These highly entangled features always activate the local characteristics, making it difficult to cover the semantic information of an object, which also reduces the interpretability of single-backbone object detection models. Thus, an Explainability Enhanced object detection Transformer with feature Disentanglement (DETD) model is proposed, in which the Tensor Singular Value Decomposition (T-SVD) is used to produce feature bases and the Batch averaged Feature Spectral Penalization (BFSP) loss is introduced to constrain the disentanglement of the feature and balance the semantic activation. The proposed method is applied across three prominent backbones, two DETR variants, and a CNN based model. By combining two optimization techniques, extensive experiments on two datasets consistently demonstrate that the DETD model outperforms the counterpart in terms of object detection performance and feature disentanglement. The Grad-CAM visualizations demonstrate the enhancement of feature learning explainability in the disentanglement view.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用特征分解增强物体检测变换器的可解释性

可解释性是决定深度学习模型能否在关键应用中获得授权的关键因素。为了提高端到端对象检测与转换器（DETR）模型的可解释性，我们引入了一种解缠方法，该方法遵循分而治之的解耦范式，限制了特征学习过程，类似于人们理解复杂现实世界问题的方式。我们首先展示了提取器和检测器之间特征的纠缠特性，并发现回归函数是导致解纠缠特征激活恶化的关键因素。这些高度纠缠的特征总是激活局部特征，难以涵盖物体的语义信息，这也降低了单骨干物体检测模型的可解释性。因此，本文提出了一种带特征解缠的可解释性增强物体检测变换器（DETD）模型，其中使用张量奇异值分解（T-SVD）来生成特征基，并引入批量平均特征谱惩罚（BFSP）损失来约束特征的解缠并平衡语义激活。所提出的方法适用于三个突出的骨干网、两个 DETR 变体和一个基于 CNN 的模型。通过结合两种优化技术，在两个数据集上进行的大量实验一致表明，DETD 模型在物体检测性能和特征解缠方面优于对应模型。Grad-CAM 可视化展示了在解缠视图中特征学习可解释性的增强。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量