Recognizing Video Activities in the Wild via View-to-Scene Joint Learning

IF 6.4 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS IEEE Transactions on Automation Science and Engineering Pub Date : 2024-07-23 DOI:10.1109/TASE.2024.3431128

Jiahui Yu;Yifan Chen;Xuna Wang;Xu Cheng;Zhaojie Ju;Yingke Xu

{"title":"Recognizing Video Activities in the Wild via View-to-Scene Joint Learning","authors":"Jiahui Yu;Yifan Chen;Xuna Wang;Xu Cheng;Zhaojie Ju;Yingke Xu","doi":"10.1109/TASE.2024.3431128","DOIUrl":null,"url":null,"abstract":"Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2–4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"5816-5827"},"PeriodicalIF":6.4000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10608038/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recognizing video actions in the wild is challenging for visual control systems. In-the-wild videos show actions not seen in training data, recorded from various angles and scenes with the same labels. Most existing methods address this challenge by developing complex frameworks to extract spatiotemporal features. To achieve view robustness and scene generalization cost-effectively, we explore view consistency and scene joint understanding. Based on this, we propose a neural network (called Wild-VAR) to learn view and scene information jointly without any 3D pose ground truth labels, a new approach to recognizing video actions in the wild. Unlike most existing methods, first, we propose a Cubing module to self-learn body consistency between views instead of comprehensive image features, boosting the generalization performance of across-view settings. Specifically, we map 3D representations to multiple 2D features and then adopt a self-adaptive scheme to constrain 2D features from different perspectives. Moreover, we propose temporal neural networks (called T-Scene) to develop a recognizing framework, enabling Wild-VAR to flexibly learn scenes across time, including key interactors and context, in video sequences. Extensive experiments show that Wild-VAR consistently outperforms state-of-the-art methods on four benchmarks. Notably, with only half the computation costs, Wild-VAR improves accuracy by 2.2% and 1.3% on the Kinetics-400 and the Something-Somthing V2 datasets, respectively. Note to Practitioners—In human-robot interaction tasks, video action recognition technology is a prerequisite for visual control. In real applications, humans move freely in 3D space, which results in significant changes in the view of video capture and constantly changing scenes. Deep Neural Networks are limited by the perspectives and scenarios contained in the training data, resulting in most existing methods are only effective for identifying actions from 2–4 fixed views, and the background is single. Therefore, existing models are often difficult to generalize to unconstrained application environments. Human view and video scene understanding are often treated separately. Inspired by the human visual system, this paper proposes a view-to-scene video processing method in a cost-efficient way. In real-world applications, this lightweight method can be integrated into robots to help identify human behavior in complex environments. Fewer parameters indicate that the method can be easily migrated to different types of behaviors, and the reduced computational costs represent the ability to achieve real-time performance under limited hardware conditions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过视图到场景的联合学习识别野外视频活动

对于视觉控制系统来说，在野外识别视频动作是一个挑战。野外视频展示了训练数据中看不到的动作，从不同的角度和场景用相同的标签记录。大多数现有方法通过开发复杂的框架来提取时空特征来解决这一挑战。为了经济有效地实现视图鲁棒性和场景泛化，我们探索了视图一致性和场景联合理解。在此基础上，我们提出了一种神经网络（称为wild - var）来共同学习视图和场景信息，而不需要任何3D姿态地面真实度标签，这是一种识别野外视频动作的新方法。与大多数现有方法不同的是，首先，我们提出了一个立方体模块来自学习视图之间的身体一致性，而不是综合图像特征，提高了跨视图设置的泛化性能。具体而言，我们将3D表示映射到多个2D特征，然后采用自适应方案从不同角度约束2D特征。此外，我们提出时间神经网络（称为T-Scene）来开发识别框架，使Wild-VAR能够灵活地学习视频序列中的场景，包括关键交互者和上下文。大量实验表明，Wild-VAR在四个基准上始终优于最先进的方法。值得注意的是，仅用一半的计算成本，Wild-VAR在tics-400和something - something V2数据集上的准确率分别提高了2.2%和1.3%。从业人员注意：在人机交互任务中，视频动作识别技术是视觉控制的先决条件。在实际应用中，人在三维空间中自由移动，导致视频捕捉的视角变化较大，场景不断变化。深度神经网络受到训练数据中包含的视角和场景的限制，导致大多数现有方法只能有效识别2-4个固定视角的动作，并且背景单一。因此，现有模型通常很难推广到无约束的应用程序环境。人的视角和视频场景的理解通常是分开处理的。受人类视觉系统的启发，本文提出了一种低成本的视到场景视频处理方法。在实际应用中，这种轻量级方法可以集成到机器人中，以帮助识别复杂环境中的人类行为。较少的参数表明该方法可以很容易地迁移到不同类型的行为，减少的计算成本代表了在有限的硬件条件下实现实时性能的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Automation Science and Engineering 工程技术-自动化与控制系统

CiteScore

12.50

自引率

14.30%

发文量

404

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.