Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks

Matthew Kowal;Mennatullah Siam;Md Amirul Islam;Neil D. B. Bruce;Richard P. Wildes;Konstantinos G. Derpanis
{"title":"Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal Networks","authors":"Matthew Kowal;Mennatullah Siam;Md Amirul Islam;Neil D. B. Bruce;Richard P. Wildes;Konstantinos G. Derpanis","doi":"10.1109/TPAMI.2024.3462291","DOIUrl":null,"url":null,"abstract":"There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or jointly encode a combination static and dynamic information. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 1","pages":"190-205"},"PeriodicalIF":18.6000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10682100/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or jointly encode a combination static and dynamic information. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
量化和学习深度时空网络中的静态与动态信息深度时空网络中的动态信息
对深度时空模型在其中间表示中捕获的信息的理解是有限的。例如,虽然有证据表明,动作识别算法在很大程度上受到单帧视觉外观的影响,但与动态偏差相比,没有定量方法可以评估潜在表示中的这种静态偏差。我们通过提出一种量化任何时空模型的静态和动态偏差的方法来解决这一挑战,并将我们的方法应用于三个任务:动作识别、自动视频对象分割(AVOS)和视频实例分割(VIS)。我们的主要发现是:(i)大多数检验的模型偏向于静态信息。(ii)一些被认为偏向于动态的数据集实际上偏向于静态信息。(iii)体系结构中的单个通道可以偏向静态、动态或联合编码静态和动态信息的组合。大多数模型在训练的前半部分收敛到其最终偏差。然后我们探讨这些偏差如何影响动态偏差数据集上的性能。对于动作识别,我们提出了StaticDropout,这是一种语义引导的dropout,可以使模型从静态信息向动态信息偏移。对于AVOS,我们设计了更好的融合层和交叉连接层的组合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Tackling Ill-Posedness of Reversible Image Conversion With Well-Posed Invertible Network DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting Generalized Regularized Evidential Deep Learning Models: Theory and Comprehensive Evaluation Full-Scope Vectorization of Geographical Elements from Large-Size Remote Sensing Imagery Decoupled Hierarchical Distillation for Multimodal Emotion Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1