React: recognize every action everywhere all at once

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Vision and Applications Pub Date : 2024-07-20 DOI:10.1007/s00138-024-01561-z

Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu

{"title":"React: recognize every action everywhere all at once","authors":"Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu","doi":"10.1007/s00138-024-01561-z","DOIUrl":null,"url":null,"abstract":"In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"24 1","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01561-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

反应：一次性识别各地的每一个动作

在计算机视觉领域，群体活动识别（GAR）发挥着至关重要的作用，可应用于体育视频分析、监控和社会场景理解。本文介绍了 "一次识别所有地方的所有动作"（REACT），这是一种新颖的架构，旨在为视频中复杂的上下文关系建模。REACT 利用先进的基于变压器的模型来编码复杂的上下文关系，从而增强对群体动态的理解。综合视觉语言编码有助于有效捕捉时空互动和多模态信息，从而实现全面的场景理解。该模型的精确动作定位功能可完善对文本和视频数据的共同理解，从而实现精确的边界框检索，并增强文本描述与视觉现实之间的语义联系。特定演员融合在特定演员细节和上下文信息之间取得了平衡，提高了模型的特定性和识别群体活动的鲁棒性。实验结果表明，与最先进的 GAR 方法相比，REACT 在识别和理解不同数据集的群体活动方面具有更高的准确性。这项工作极大地推动了群体活动识别，为细致入微的场景理解提供了一个强大的框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Machine Vision and Applications 工程技术-工程：电子与电气

CiteScore

6.30

自引率

3.00%

发文量

审稿时长

8.7 months

期刊介绍： Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal. Particular emphasis is placed on engineering and technology aspects of image processing and computer vision. The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.