Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition

Yulin Wang;Haoji Zhang;Yang Yue;Shiji Song;Chao Deng;Junlan Feng;Gao Huang
{"title":"Uni-AdaFocus: Spatial-Temporal Dynamic Computation for Video Recognition","authors":"Yulin Wang;Haoji Zhang;Yang Yue;Shiji Song;Chao Deng;Junlan Feng;Gao Huang","doi":"10.1109/TPAMI.2024.3514654","DOIUrl":null,"url":null,"abstract":"This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of <italic>spatial redundancy</i>, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the <italic>temporal</i> and <italic>sample-wise</i> redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1782-1799"},"PeriodicalIF":18.6000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10787270/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The complete model can be trained conveniently in an end-to-end manner. During inference, once the informative patch sequence has been generated, the bulk of computation can be executed in parallel, rendering it efficient on modern GPU devices. Furthermore, we demonstrate that AdaFocus can be easily extended by further considering the temporal and sample-wise redundancies, i.e., allocating the majority of computation to the most task-relevant video frames, and minimizing the computation spent on relatively “easier” videos. Our resulting algorithm, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf backbone models (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven widely-used benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester, and Kinetics-400) and three real-world application scenarios (i.e., fine-grained diving action classification, Alzheimer’s and Parkinson’s diseases diagnosis with brain magnetic resonance images (MRI), and violence recognition for online videos) substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Uni-AdaFocus:视频识别的时空动态计算
本文对视频理解中的数据冗余现象进行了全面的探讨,旨在提高计算效率。我们的研究从空间冗余的检查开始,这是指观察到每个视频帧中信息量最大的区域通常对应于一个小图像补丁,其形状、大小和位置在帧之间平滑地移动。基于这一现象,我们将patch定位问题表述为一个动态决策任务,并引入一种空间自适应视频识别方法,称为AdaFocus。具体而言,首先使用轻量级编码器快速处理完整的视频序列,然后由策略网络利用其特征来识别与任务最相关的区域。随后,通过大容量深度网络对所选斑块进行推断,进行最终预测。可以方便地以端到端方式训练完整的模型。在推理过程中,一旦信息补丁序列生成,大量计算可以并行执行,使其在现代GPU设备上高效。此外,我们证明了AdaFocus可以通过进一步考虑时间和样本冗余来轻松扩展,即将大部分计算分配给与任务最相关的视频帧,并将相对“容易”的视频上花费的计算最小化。我们的最终算法Uni-AdaFocus建立了一个全面的框架,无缝地集成了空间,时间和样本智能动态计算,同时它保留了AdaFocus在高效端到端训练和硬件友好性方面的优点。此外,Uni-AdaFocus具有通用性和灵活性,因为它与现成的骨干模型(例如TSM和X3D)兼容,可以很容易地部署为我们的特征提取器,从而显著提高计算效率。根据经验,基于7个广泛使用的基准数据集(即ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, Jester和Kinetics-400)和3个实际应用场景(即细粒度潜水动作分类,阿尔茨海默病和帕金森病的脑磁共振图像(MRI)诊断,以及在线视频的暴力识别)的大量实验证明,Uni-AdaFocus比竞争基准效率高得多。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GrowSP++: Growing Superpoints and Primitives for Unsupervised 3D Semantic Segmentation. Unsupervised Gaze Representation Learning by Switching Features. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. Parse Trees Guided LLM Prompt Compression.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1