Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu
{"title":"React: recognize every action everywhere all at once","authors":"Naga V. S. Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs, Khoa Luu","doi":"10.1007/s00138-024-01561-z","DOIUrl":null,"url":null,"abstract":"<p>In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces <b>R</b>ecognize <b>E</b>very <b>Act</b>ion Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"24 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01561-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.
期刊介绍:
Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal.
Particular emphasis is placed on engineering and technology aspects of image processing and computer vision.
The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.