Exploring global context and position-aware representation for group activity recognition

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Image and Vision Computing Pub Date : 2024-09-01 Epub Date: 2024-07-15 DOI:10.1016/j.imavis.2024.105181

Zexing Du, Qing Wang

{"title":"Exploring global context and position-aware representation for group activity recognition","authors":"Zexing Du, Qing Wang","doi":"10.1016/j.imavis.2024.105181","DOIUrl":null,"url":null,"abstract":"<div><p>This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.</p></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"149 ","pages":"Article 105181"},"PeriodicalIF":4.2000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624002865","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/15 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper explores the context and position information in the scene for group activity understanding. Firstly, previous group activity recognition methods strive to reason on individual features without considering the information in the scene. Besides correlations among actors, we argue that integrating the scene context simultaneously can afford us more useful and supplementary cues. Therefore, we propose a new network, termed Contextual Transformer Network (CTN), to incorporate global contextual information into individual representations. In addition, the position of individuals also plays a vital role in group activity understanding. Unlike previous methods that explore correlations among individuals semantically, we propose Clustered Position Embedding (CPE) to integrate the spatial structure of actors and produce position-aware representations. Experimental results on two widely used datasets for sports video and social activity (i.e., Volleyball and Collective Activity datasets) show that the proposed method outperforms state-of-the-art approaches. Especially, when using ResNet-18 as the backbone, our method achieves 93.6/93.9% MCA/MPCA on the Volleyball dataset and 95.4/96.3% MCA/MPCA on the Collective Activity dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索群体活动识别的全局上下文和位置感知表示法

本文探讨了如何利用场景中的上下文和位置信息来理解群体活动。首先，以往的群体活动识别方法都是根据单个特征进行推理，而不考虑场景中的信息。我们认为，除了行为者之间的相关性，同时整合场景上下文可以为我们提供更多有用的补充线索。因此，我们提出了一种新的网络，即上下文转换网络（Contextual Transformer Network，CTN），用于将全局上下文信息纳入个体表征中。此外，个体的位置在群体活动理解中也起着至关重要的作用。与以往从语义上探索个体间相关性的方法不同，我们提出了聚类位置嵌入（CPE）来整合参与者的空间结构并生成位置感知表征。在两个广泛使用的体育视频和社交活动数据集（即排球和集体活动数据集）上的实验结果表明，所提出的方法优于最先进的方法。特别是，当使用 ResNet-18 作为骨干网时，我们的方法在排球数据集上实现了 93.6/93.9% 的 MCA/MPCA，在集体活动数据集上实现了 95.4/96.3% 的 MCA/MPCA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.