Local Compressed Video Stream Learning for Generic Event Boundary Detection

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Computer Vision Pub Date : 2023-11-01 DOI:10.1007/s11263-023-01921-8

Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan

{"title":"Local Compressed Video Stream Learning for Generic Event Boundary Detection","authors":"Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan","doi":"10.1007/s11263-023-01921-8","DOIUrl":null,"url":null,"abstract":"<p>Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, <i>i.e.</i>, RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 34","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-023-01921-8","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于一般事件边界检测的局部压缩视频流学习

通用事件边界检测旨在定位将视频分割成块的通用、无分类的事件边界。现有的方法通常需要在将视频帧馈送到网络之前对其进行解码，这包含显著的时空冗余，并且需要相当大的计算能力和存储空间。为了解决这些问题，我们提出了一种新的用于事件边界检测的压缩视频表示学习方法，该方法完全端到端地利用压缩域中的丰富信息，即RGB、运动矢量、残差和内部图片组（GOP）结构，而无需对视频进行完全解码。具体来说，我们使用轻量级ConvNets来提取GOP中P帧的特征，并设计了空间通道注意力模块（SCAM）来基于具有双向信息流的压缩信息来细化P帧的特性表示。为了学习用于边界检测的合适表示，我们为每个候选帧构造局部帧包，并使用长短期记忆（LSTM）模块来捕捉时间关系。然后，我们在时域中计算具有组相似性的帧差异。该模块仅在本地窗口内应用，这对于事件边界检测至关重要。最后，基于学习到的特征表示，使用一个简单的分类器来确定视频序列的事件边界。为了弥补注释的模糊性并加快训练过程，我们使用高斯核对地面实况事件边界进行预处理。在Kinetics GEBD和TAPOS数据集上进行的大量实验表明，与以前的端到端方法相比，在以相同速度运行的情况下，所提出的方法实现了相当大的改进。代码可在https://github.com/GX77/LCVSL.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.

期刊最新文献

Fine-Grained Multimodal Alignment for Image-Text Retrieval via Graph Learning A Polynomial Formula for the Perspective Four Points Problem An Effective-Efficient Approach for Dense Multi-Label Action Detection FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes DeepTA: High-Speed Deep Camera Translation Averaging with Reverse Direction Invariance