{"title":"Psychology-Guided Environment Aware Network for Discovering Social Interaction Groups from Videos","authors":"Jiaqi Yu, Jinhai Yang, Hua Yang, Renjie Pan, Pingrui Lai, Guangtao Zhai","doi":"10.1145/3657295","DOIUrl":null,"url":null,"abstract":"<p>Social interaction is a common phenomenon in human societies. Different from discovering groups based on the similarity of individuals’ actions, social interaction focuses more on the mutual influence between people. Although people can easily judge whether or not there are social interactions in a real-world scene, it is difficult for an intelligent system to discover social interactions. Initiating and concluding social interactions are greatly influenced by an individual’s social cognition and the surrounding environment, which are closely related to psychology. Thus, converting the psychological factors that impact social interactions into quantifiable visual representations and creating a model for interaction relationships poses a significant challenge. To this end, we propose a Psychology-Guided Environment Aware Network (PEAN) that models social interaction among people in videos using supervised learning. Specifically, we divide the surrounding environment into scene-aware visual-based and human-aware visual-based descriptions. For the scene-aware visual clue, we utilize 3D features as global visual representations. For the human-aware visual clue, we consider instance-based location and behaviour-related visual representations to map human-centered interaction elements in social psychology: distance, openness and orientation. In addition, we design an environment aware mechanism to integrate features from visual clues, with a Transformer to explore the relation between individuals and construct pairwise interaction strength features. The interaction intensity matrix reflecting the mutual nature of the interaction is obtained by processing the interaction strength features with the interaction discovery module. An interaction constrained loss function composed of interaction critical loss function and smooth <i>F<sub>β</sub></i> loss function is proposed to optimize the whole framework to improve the distinction of the interaction matrix and alleviate class imbalance caused by pairwise interaction sparsity. Given the diversity of real-world interactions, we collect a new dataset named Social Basketball Activity Dataset (Soical-BAD), covering complex social interactions. Our method achieves the best performance among social-CAD, social-BAD, and their combined dataset named Video Social Interaction Dataset (VSID).</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"44 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3657295","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Social interaction is a common phenomenon in human societies. Different from discovering groups based on the similarity of individuals’ actions, social interaction focuses more on the mutual influence between people. Although people can easily judge whether or not there are social interactions in a real-world scene, it is difficult for an intelligent system to discover social interactions. Initiating and concluding social interactions are greatly influenced by an individual’s social cognition and the surrounding environment, which are closely related to psychology. Thus, converting the psychological factors that impact social interactions into quantifiable visual representations and creating a model for interaction relationships poses a significant challenge. To this end, we propose a Psychology-Guided Environment Aware Network (PEAN) that models social interaction among people in videos using supervised learning. Specifically, we divide the surrounding environment into scene-aware visual-based and human-aware visual-based descriptions. For the scene-aware visual clue, we utilize 3D features as global visual representations. For the human-aware visual clue, we consider instance-based location and behaviour-related visual representations to map human-centered interaction elements in social psychology: distance, openness and orientation. In addition, we design an environment aware mechanism to integrate features from visual clues, with a Transformer to explore the relation between individuals and construct pairwise interaction strength features. The interaction intensity matrix reflecting the mutual nature of the interaction is obtained by processing the interaction strength features with the interaction discovery module. An interaction constrained loss function composed of interaction critical loss function and smooth Fβ loss function is proposed to optimize the whole framework to improve the distinction of the interaction matrix and alleviate class imbalance caused by pairwise interaction sparsity. Given the diversity of real-world interactions, we collect a new dataset named Social Basketball Activity Dataset (Soical-BAD), covering complex social interactions. Our method achieves the best performance among social-CAD, social-BAD, and their combined dataset named Video Social Interaction Dataset (VSID).
期刊介绍:
The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome.
TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.