End-to-End Streaming Video Temporal Action Segmentation With Reinforcement Learning

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE transactions on neural networks and learning systems Pub Date : 2025-04-11 DOI:10.1109/TNNLS.2025.3550910

Jin-Rong Zhang;Wu-Jun Wen;Sheng-Lan Liu;Gao Huang;Yun-Heng Li;Qi-Feng Li;Lin Feng

{"title":"End-to-End Streaming Video Temporal Action Segmentation With Reinforcement Learning","authors":"Jin-Rong Zhang;Wu-Jun Wen;Sheng-Lan Liu;Gao Huang;Yun-Heng Li;Qi-Feng Li;Lin Feng","doi":"10.1109/TNNLS.2025.3550910","DOIUrl":null,"url":null,"abstract":"The streaming temporal action segmentation (STAS) task, a supplementary task of temporal action segmentation (TAS), has not received adequate attention in the field of video understanding. Existing TAS methods are constrained to offline scenarios due to their heavy reliance on multimodal features and complete contextual information. The STAS task requires the model to classify each frame of the entire untrimmed video sequence clip by clip in time, thereby extending the applicability of TAS methods to online scenarios. However, directly applying existing TAS methods to SATS tasks results in significantly poor segmentation outcomes. In this article, we thoroughly analyze the fundamental differences between STAS tasks and TAS tasks, attributing the severe performance degradation when transferring models to model bias and optimization dilemmas. We introduce an end-to-end streaming video TAS model with reinforcement learning (SVTAS-RL). The end-to-end modeling method mitigates the modeling bias introduced by the change in task nature and enhances the feasibility of online solutions. Reinforcement learning (RL) is utilized to alleviate the optimization dilemma. Through extensive experiments, the SVTAS-RL model significantly outperforms existing STAS models and achieves competitive performance to the state-of-the-art (SOTA) TAS model on multiple datasets under the same evaluation criteria, demonstrating notable advantages on the ultralong video dataset EGTEA. Our code is publicly available at <uri>https://github.com/Thinksky5124/SVTAS</uri>.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 8","pages":"15449-15462"},"PeriodicalIF":8.9000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10963907/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The streaming temporal action segmentation (STAS) task, a supplementary task of temporal action segmentation (TAS), has not received adequate attention in the field of video understanding. Existing TAS methods are constrained to offline scenarios due to their heavy reliance on multimodal features and complete contextual information. The STAS task requires the model to classify each frame of the entire untrimmed video sequence clip by clip in time, thereby extending the applicability of TAS methods to online scenarios. However, directly applying existing TAS methods to SATS tasks results in significantly poor segmentation outcomes. In this article, we thoroughly analyze the fundamental differences between STAS tasks and TAS tasks, attributing the severe performance degradation when transferring models to model bias and optimization dilemmas. We introduce an end-to-end streaming video TAS model with reinforcement learning (SVTAS-RL). The end-to-end modeling method mitigates the modeling bias introduced by the change in task nature and enhances the feasibility of online solutions. Reinforcement learning (RL) is utilized to alleviate the optimization dilemma. Through extensive experiments, the SVTAS-RL model significantly outperforms existing STAS models and achieves competitive performance to the state-of-the-art (SOTA) TAS model on multiple datasets under the same evaluation criteria, demonstrating notable advantages on the ultralong video dataset EGTEA. Our code is publicly available at https://github.com/Thinksky5124/SVTAS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于强化学习的端到端流视频时间动作分割

流式时间动作分割（STAS）任务是时间动作分割（TAS）的补充任务，在视频理解领域还没有得到足够的重视。由于现有的TAS方法严重依赖于多模式特征和完整的上下文信息，因此受限于离线场景。STAS任务要求模型对整个未修剪视频序列的每一帧逐段进行分类，从而将TAS方法的适用性扩展到在线场景。然而，直接将现有的TAS方法应用于SATS任务会导致明显较差的分割结果。在本文中，我们深入分析了STAS任务和TAS任务之间的根本区别，将模型迁移时的严重性能下降归因于模型偏差和优化困境。我们引入了一种端到端强化学习的流视频TAS模型（SVTAS-RL）。端到端建模方法减轻了任务性质变化带来的建模偏差，提高了在线解决方案的可行性。利用强化学习（RL）来缓解优化困境。通过大量的实验，SVTAS-RL模型显著优于现有的STAS模型，在相同的评估标准下，在多个数据集上取得了与最先进的（SOTA） TAS模型相媲美的性能，在超长视频数据集EGTEA上显示出显著的优势。我们的代码可以在https://github.com/Thinksky5124/SVTAS上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.