Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-17 DOI:10.1109/TPAMI.2024.3462291

Matthew Kowal;Mennatullah Siam;Md Amirul Islam;Neil D. B. Bruce;Richard P. Wildes;Konstantinos G. Derpanis

{"title":"Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks","authors":"Matthew Kowal;Mennatullah Siam;Md Amirul Islam;Neil D. B. Bruce;Richard P. Wildes;Konstantinos G. Derpanis","doi":"10.1109/TPAMI.2024.3462291","DOIUrl":null,"url":null,"abstract":"There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or jointly encode a combination static and dynamic information. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 1","pages":"190-205"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10682100/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or jointly encode a combination static and dynamic information. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

量化和学习深度时空网络中的静态与动态信息深度时空网络中的动态信息

对深度时空模型在其中间表示中捕获的信息的理解是有限的。例如，虽然有证据表明，动作识别算法在很大程度上受到单帧视觉外观的影响，但与动态偏差相比，没有定量方法可以评估潜在表示中的这种静态偏差。我们通过提出一种量化任何时空模型的静态和动态偏差的方法来解决这一挑战，并将我们的方法应用于三个任务：动作识别、自动视频对象分割（AVOS）和视频实例分割（VIS）。我们的主要发现是：(i)大多数检验的模型偏向于静态信息。（ii）一些被认为偏向于动态的数据集实际上偏向于静态信息。（iii）体系结构中的单个通道可以偏向静态、动态或联合编码静态和动态信息的组合。大多数模型在训练的前半部分收敛到其最终偏差。然后我们探讨这些偏差如何影响动态偏差数据集上的性能。对于动作识别，我们提出了StaticDropout，这是一种语义引导的dropout，可以使模型从静态信息向动态信息偏移。对于AVOS，我们设计了更好的融合层和交叉连接层的组合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量