In pursuit of an immersive virtual experience within the Cyber-Physical Metaverse Systems (CPMS), the construction of Avatars often requires a significant amount of real-world data. Mobile Crowd Sensing (MCS) has emerged as an efficient method for collecting data for CPMS. While progress has been made in protecting the privacy of workers, little attention has been given to safeguarding task privacy, potentially exposing the intentions of applications and posing risks to the development of the Metaverse. Additionally, existing privacy protection schemes hinder the exchange of information among entities, inadvertently compromising the quality of the collected data. To this end, we propose a Quality-aware and Obfuscation-based Task Privacy-Preserving (QOTPP) scheme, which protects task privacy and enhances data quality without third-party involvement. The QOTPP scheme initially employs the insight of “showing the fake, and hiding the real” by employing differential privacy techniques to create fake tasks and conceal genuine ones. Additionally, we introduce a two-tier truth discovery mechanism using Deep Matrix Factorization (DMF) to efficiently identify high-quality workers. Furthermore, we propose a Combinatorial Multi-Armed Bandit (CMAB)-based worker incentive and selection mechanism to improve the quality of data collection. Theoretical analysis confirms that our QOTPP scheme satisfies essential properties such as truthfulness, individual rationality, and ϵ-differential privacy. Extensive simulation experiments validate the state-of-the-art performance achieved by QOTPP.
{"title":"A Quality-Aware and Obfuscation-Based Data Collection Scheme for Cyber-Physical Metaverse Systems","authors":"Jianheng Tang, Kejia Fan, Wenjie Yin, Shihao Yang, Yajiang Huang, Anfeng Liu, Neal N. Xiong, Mianxiong Dong, Tian Wang, Shaobo Zhang","doi":"10.1145/3659582","DOIUrl":"https://doi.org/10.1145/3659582","url":null,"abstract":"<p>In pursuit of an immersive virtual experience within the Cyber-Physical Metaverse Systems (CPMS), the construction of Avatars often requires a significant amount of real-world data. Mobile Crowd Sensing (MCS) has emerged as an efficient method for collecting data for CPMS. While progress has been made in protecting the privacy of workers, little attention has been given to safeguarding task privacy, potentially exposing the intentions of applications and posing risks to the development of the Metaverse. Additionally, existing privacy protection schemes hinder the exchange of information among entities, inadvertently compromising the quality of the collected data. To this end, we propose a Quality-aware and Obfuscation-based Task Privacy-Preserving (QOTPP) scheme, which protects task privacy and enhances data quality without third-party involvement. The QOTPP scheme initially employs the insight of “showing the fake, and hiding the real” by employing differential privacy techniques to create fake tasks and conceal genuine ones. Additionally, we introduce a two-tier truth discovery mechanism using Deep Matrix Factorization (DMF) to efficiently identify high-quality workers. Furthermore, we propose a Combinatorial Multi-Armed Bandit (CMAB)-based worker incentive and selection mechanism to improve the quality of data collection. Theoretical analysis confirms that our QOTPP scheme satisfies essential properties such as truthfulness, individual rationality, and ϵ-differential privacy. Extensive simulation experiments validate the state-of-the-art performance achieved by QOTPP.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"63 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vinicius Sato Kawai, Lucas Pascotti Valem, Alexandro Baldassin, Edson Borin, Daniel Carlos Guimarães Pedronette, Longin Jan Latecki
The large and growing amount of digital data creates a pressing need for approaches capable of indexing and retrieving multimedia content. A traditional and fundamental challenge consists of effectively and efficiently performing nearest-neighbor searches. After decades of research, several different methods are available, including trees, hashing, and graph-based approaches. Most of the current methods exploit learning to hash approaches based on deep learning. In spite of effective results and compact codes obtained, such methods often require a significant amount of labeled data for training. Unsupervised approaches also rely on expensive training procedures usually based on a huge amount of data. In this work, we propose an unsupervised data-independent approach for nearest neighbor searches, which can be used with different features, including deep features trained by transfer learning. The method uses a rank-based formulation and exploits a hashing approach for efficient ranked list computation at query time. A comprehensive experimental evaluation was conducted on 7 public datasets, considering deep features based on CNNs and Transformers. Both effectiveness and efficiency aspects were evaluated. The proposed approach achieves remarkable results in comparison to traditional and state-of-the-art methods. Hence, it is an attractive and innovative solution, especially when costly training procedures need to be avoided.
{"title":"Rank-based Hashing for Effective and Efficient Nearest Neighbor Search for Image Retrieval","authors":"Vinicius Sato Kawai, Lucas Pascotti Valem, Alexandro Baldassin, Edson Borin, Daniel Carlos Guimarães Pedronette, Longin Jan Latecki","doi":"10.1145/3659580","DOIUrl":"https://doi.org/10.1145/3659580","url":null,"abstract":"<p>The large and growing amount of digital data creates a pressing need for approaches capable of indexing and retrieving multimedia content. A traditional and fundamental challenge consists of effectively and efficiently performing nearest-neighbor searches. After decades of research, several different methods are available, including trees, hashing, and graph-based approaches. Most of the current methods exploit learning to hash approaches based on deep learning. In spite of effective results and compact codes obtained, such methods often require a significant amount of labeled data for training. Unsupervised approaches also rely on expensive training procedures usually based on a huge amount of data. In this work, we propose an unsupervised data-independent approach for nearest neighbor searches, which can be used with different features, including deep features trained by transfer learning. The method uses a rank-based formulation and exploits a hashing approach for efficient ranked list computation at query time. A comprehensive experimental evaluation was conducted on 7 public datasets, considering deep features based on CNNs and Transformers. Both effectiveness and efficiency aspects were evaluated. The proposed approach achieves remarkable results in comparison to traditional and state-of-the-art methods. Hence, it is an attractive and innovative solution, especially when costly training procedures need to be avoided.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semi-supervised action recognition is a challenging yet prospective task due to its low reliance on costly labeled videos. One high-profile solution is to explore frame-level weak/strong augmentations for learning abundant representations, inspired by the FixMatch framework dominating the semi-supervised image classification task. However, such a solution mainly brings perturbations in terms of texture and scale, leading to the limitation in learning action representations in videos with spatiotemporal redundancy and complexity. Therefore, we revisit the creative trick of weak/strong augmentations in FixMatch, and then propose a novel Frame- and Feature-level augmentation FixMatch (dubbed as F2-FixMatch) framework to learn more abundant action representations for being robust to complex and dynamic video scenarios. Specifically, we design a new Progressive Augmentation (P-Aug) mechanism that implements the weak/strong augmentations first at the frame level, and further implements the perturbation at the feature level, to obtain abundant four types of augmented features in broader perturbation spaces. Moreover, we present an evolved Multihead Pseudo-Labeling (MPL) scheme to promote the consistency of features across different augmented versions based on the pseudo labels. We conduct extensive experiments on several public datasets to demonstrate that our F2-FixMatch achieves the performance gain compared with current state-of-the-art methods. The source codes of F2-FixMatch are publicly available at https://github.com/zwtu/F2FixMatch.
{"title":"Leveraging Frame- and Feature-Level Progressive Augmentation for Semi-supervised Action Recognition","authors":"Zhewei Tu, Xiangbo Shu, Peng Huang, Rui Yan, Zhenxing Liu, Jiachao Zhang","doi":"10.1145/3655025","DOIUrl":"https://doi.org/10.1145/3655025","url":null,"abstract":"<p>Semi-supervised action recognition is a challenging yet prospective task due to its low reliance on costly labeled videos. One high-profile solution is to explore frame-level weak/strong augmentations for learning abundant representations, inspired by the FixMatch framework dominating the semi-supervised image classification task. However, such a solution mainly brings perturbations in terms of texture and scale, leading to the limitation in learning action representations in videos with spatiotemporal redundancy and complexity. Therefore, we revisit the creative trick of weak/strong augmentations in FixMatch, and then propose a novel Frame- and Feature-level augmentation FixMatch (dubbed as F<sup>2</sup>-FixMatch) framework to learn more abundant action representations for being robust to complex and dynamic video scenarios. Specifically, we design a new Progressive Augmentation (P-Aug) mechanism that implements the weak/strong augmentations first at the frame level, and further implements the perturbation at the feature level, to obtain abundant four types of augmented features in broader perturbation spaces. Moreover, we present an evolved Multihead Pseudo-Labeling (MPL) scheme to promote the consistency of features across different augmented versions based on the pseudo labels. We conduct extensive experiments on several public datasets to demonstrate that our F<sup>2</sup>-FixMatch achieves the performance gain compared with current state-of-the-art methods. The source codes of F<sup>2</sup>-FixMatch are publicly available at https://github.com/zwtu/F2FixMatch.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"36 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative adversarial networks (GANs) for tabular data, synthetic data generation methods, and synthetic data quality assessment. By augmenting a public news dataset with synthetic data generated by different GAN architectures, we demonstrate the potential of synthetic data to improve ML models’ performance in fake news detection. Our results show a significant improvement in classification performance, especially in the underrepresented class. We also modify and extend a data usage approach to evaluate the quality of synthetic data and investigate the relationship between synthetic data quality and data augmentation performance in classification tasks. We found a positive correlation between synthetic data quality and performance in the underrepresented class, highlighting the importance of high-quality synthetic data for effective data augmentation.
本文以假新闻检测为例,重点介绍合成数据的创建和评估,以解决机器学习应用(ML)中不平衡数据集所带来的挑战。我们对表格数据生成对抗网络(GAN)、合成数据生成方法和合成数据质量评估进行了全面的文献综述。通过用不同 GAN 架构生成的合成数据来增强公共新闻数据集,我们展示了合成数据在提高 ML 模型的假新闻检测性能方面的潜力。我们的结果表明,分类性能有了显著提高,尤其是在代表性不足的类别中。我们还修改并扩展了数据使用方法,以评估合成数据的质量,并研究了分类任务中合成数据质量与数据增强性能之间的关系。我们发现,合成数据质量与代表性不足类别的性能之间存在正相关,这突出了高质量合成数据对于有效数据增强的重要性。
{"title":"GANs in the Panorama of Synthetic Data Generation Methods: Application and Evaluation: Enhancing Fake News Detection with GAN-Generated Synthetic Data: ACM Transactions on Multimedia Computing, Communications, and Applications: Vol 0, No ja","authors":"Bruno Vaz, Álvaro Figueira","doi":"10.1145/3657294","DOIUrl":"https://doi.org/10.1145/3657294","url":null,"abstract":"<p>This paper focuses on the creation and evaluation of synthetic data to address the challenges of imbalanced datasets in machine learning applications (ML), using fake news detection as a case study. We conducted a thorough literature review on generative adversarial networks (GANs) for tabular data, synthetic data generation methods, and synthetic data quality assessment. By augmenting a public news dataset with synthetic data generated by different GAN architectures, we demonstrate the potential of synthetic data to improve ML models’ performance in fake news detection. Our results show a significant improvement in classification performance, especially in the underrepresented class. We also modify and extend a data usage approach to evaluate the quality of synthetic data and investigate the relationship between synthetic data quality and data augmentation performance in classification tasks. We found a positive correlation between synthetic data quality and performance in the underrepresented class, highlighting the importance of high-quality synthetic data for effective data augmentation.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"215 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing Magnetic resonance imaging (MRI) translation models rely on Generative Adversarial Networks, primarily employing simple convolutional neural networks. Unfortunately, these networks struggle to capture global representations and contextual relationships within MRI images. While the advent of Transformers enables capturing long-range feature dependencies, they often compromise the preservation of local feature details. To address these limitations and enhance both local and global representations, we introduce a novel Dual-Branch Generative Adversarial Network (DBGAN). In this framework, the Transformer branch comprises sparse attention blocks and dense self-attention blocks, allowing for a wider receptive field while simultaneously capturing local and global information. The CNN branch, built with integrated residual convolutional layers, enhances local modeling capabilities. Additionally, we propose a fusion module that cleverly integrates features extracted from both branches. Extensive experimentation on two public datasets and one clinical dataset validates significant performance improvements with DBGAN. On Brats2018, it achieves a 10%improvement in MAE, 3.2% in PSNR, and 4.8% in SSIM for image generation tasks compablack to RegGAN. Notably, the generated MRIs receive positive feedback from radiologists, underscoring the potential of our proposed method as a valuable tool in clinical settings.
{"title":"DBGAN: Dual Branch Generative Adversarial Network for Multi-modal MRI Translation","authors":"Jun Lyu, Shouang Yan, M. Shamim Hossain","doi":"10.1145/3657298","DOIUrl":"https://doi.org/10.1145/3657298","url":null,"abstract":"<p>Existing Magnetic resonance imaging (MRI) translation models rely on Generative Adversarial Networks, primarily employing simple convolutional neural networks. Unfortunately, these networks struggle to capture global representations and contextual relationships within MRI images. While the advent of Transformers enables capturing long-range feature dependencies, they often compromise the preservation of local feature details. To address these limitations and enhance both local and global representations, we introduce a novel Dual-Branch Generative Adversarial Network (DBGAN). In this framework, the Transformer branch comprises sparse attention blocks and dense self-attention blocks, allowing for a wider receptive field while simultaneously capturing local and global information. The CNN branch, built with integrated residual convolutional layers, enhances local modeling capabilities. Additionally, we propose a fusion module that cleverly integrates features extracted from both branches. Extensive experimentation on two public datasets and one clinical dataset validates significant performance improvements with DBGAN. On Brats2018, it achieves a 10%improvement in MAE, 3.2% in PSNR, and 4.8% in SSIM for image generation tasks compablack to RegGAN. Notably, the generated MRIs receive positive feedback from radiologists, underscoring the potential of our proposed method as a valuable tool in clinical settings.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"68 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signal-guided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.
{"title":"SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Action Segmentation","authors":"Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, Wu Liu","doi":"10.1145/3657296","DOIUrl":"https://doi.org/10.1145/3657296","url":null,"abstract":"<p>Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a <b>S</b>parse s<b>i</b>gnal-<b>g</b>uided Transformer (<b>SigFormer</b>) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"66 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social interaction is a common phenomenon in human societies. Different from discovering groups based on the similarity of individuals’ actions, social interaction focuses more on the mutual influence between people. Although people can easily judge whether or not there are social interactions in a real-world scene, it is difficult for an intelligent system to discover social interactions. Initiating and concluding social interactions are greatly influenced by an individual’s social cognition and the surrounding environment, which are closely related to psychology. Thus, converting the psychological factors that impact social interactions into quantifiable visual representations and creating a model for interaction relationships poses a significant challenge. To this end, we propose a Psychology-Guided Environment Aware Network (PEAN) that models social interaction among people in videos using supervised learning. Specifically, we divide the surrounding environment into scene-aware visual-based and human-aware visual-based descriptions. For the scene-aware visual clue, we utilize 3D features as global visual representations. For the human-aware visual clue, we consider instance-based location and behaviour-related visual representations to map human-centered interaction elements in social psychology: distance, openness and orientation. In addition, we design an environment aware mechanism to integrate features from visual clues, with a Transformer to explore the relation between individuals and construct pairwise interaction strength features. The interaction intensity matrix reflecting the mutual nature of the interaction is obtained by processing the interaction strength features with the interaction discovery module. An interaction constrained loss function composed of interaction critical loss function and smooth Fβ loss function is proposed to optimize the whole framework to improve the distinction of the interaction matrix and alleviate class imbalance caused by pairwise interaction sparsity. Given the diversity of real-world interactions, we collect a new dataset named Social Basketball Activity Dataset (Soical-BAD), covering complex social interactions. Our method achieves the best performance among social-CAD, social-BAD, and their combined dataset named Video Social Interaction Dataset (VSID).
社会互动是人类社会的一种普遍现象。与根据个人行为的相似性发现群体不同,社会互动更注重人与人之间的相互影响。虽然人们可以很容易地判断现实世界场景中是否存在社会互动,但智能系统却很难发现社会互动。社会交往的发起和结束在很大程度上受个人的社会认知和周围环境的影响,而这些都与心理学密切相关。因此,将影响社交互动的心理因素转化为可量化的可视化表征,并创建互动关系模型是一项重大挑战。为此,我们提出了心理引导环境感知网络(PEAN),利用监督学习对视频中人与人之间的社交互动进行建模。具体来说,我们将周围环境分为基于场景感知的视觉描述和基于人类感知的视觉描述。对于场景感知视觉线索,我们利用三维特征作为全局视觉表征。对于人感知视觉线索,我们考虑基于实例的位置和行为相关视觉表征,以映射社会心理学中以人为中心的交互元素:距离、开放性和方向。此外,我们还设计了一种环境感知机制来整合来自视觉线索的特征,并利用变形器来探索个体之间的关系,构建成对的交互强度特征。通过交互发现模块处理交互强度特征,可获得反映交互相互性质的交互强度矩阵。由交互临界损失函数和平滑 Fβ 损失函数组成的交互约束损失函数被提出来对整个框架进行优化,以提高交互矩阵的区分度,缓解因成对交互稀疏而导致的类不平衡。鉴于现实世界中互动的多样性,我们收集了一个新的数据集,名为社交篮球活动数据集(Soical-BAD),涵盖了复杂的社交互动。我们的方法在 social-CAD、social-BAD 以及它们的组合数据集(名为视频社交互动数据集,Video Social Interaction Dataset (VSID))中取得了最佳性能。
{"title":"Psychology-Guided Environment Aware Network for Discovering Social Interaction Groups from Videos","authors":"Jiaqi Yu, Jinhai Yang, Hua Yang, Renjie Pan, Pingrui Lai, Guangtao Zhai","doi":"10.1145/3657295","DOIUrl":"https://doi.org/10.1145/3657295","url":null,"abstract":"<p>Social interaction is a common phenomenon in human societies. Different from discovering groups based on the similarity of individuals’ actions, social interaction focuses more on the mutual influence between people. Although people can easily judge whether or not there are social interactions in a real-world scene, it is difficult for an intelligent system to discover social interactions. Initiating and concluding social interactions are greatly influenced by an individual’s social cognition and the surrounding environment, which are closely related to psychology. Thus, converting the psychological factors that impact social interactions into quantifiable visual representations and creating a model for interaction relationships poses a significant challenge. To this end, we propose a Psychology-Guided Environment Aware Network (PEAN) that models social interaction among people in videos using supervised learning. Specifically, we divide the surrounding environment into scene-aware visual-based and human-aware visual-based descriptions. For the scene-aware visual clue, we utilize 3D features as global visual representations. For the human-aware visual clue, we consider instance-based location and behaviour-related visual representations to map human-centered interaction elements in social psychology: distance, openness and orientation. In addition, we design an environment aware mechanism to integrate features from visual clues, with a Transformer to explore the relation between individuals and construct pairwise interaction strength features. The interaction intensity matrix reflecting the mutual nature of the interaction is obtained by processing the interaction strength features with the interaction discovery module. An interaction constrained loss function composed of interaction critical loss function and smooth <i>F<sub>β</sub></i> loss function is proposed to optimize the whole framework to improve the distinction of the interaction matrix and alleviate class imbalance caused by pairwise interaction sparsity. Given the diversity of real-world interactions, we collect a new dataset named Social Basketball Activity Dataset (Soical-BAD), covering complex social interactions. Our method achieves the best performance among social-CAD, social-BAD, and their combined dataset named Video Social Interaction Dataset (VSID).</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"44 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaitra Desai, Sujay Benur, Ujwala Patil, Uma Mudenagudi
In this paper, we propose to synthesize realistic underwater images with a novel image formation model, considering both downwelling depth and line of sight (LOS) distance as cue and call it as Realistic Synthetic Underwater Image Generation Model, RSUIGM. The light interaction in the ocean is a complex process and demands specific modeling of direct and backscattering phenomenon to capture the degradations. Most of the image formation models rely on complex radiative transfer models and in-situ measurements for synthesizing and restoration of underwater images. Typical image formation models consider only line of sight distance z and ignore downwelling depth d in the estimation of effect of direct light scattering. We derive the dependencies of downwelling irradiance in direct light estimation for generation of synthetic underwater images unlike state-of-the-art image formation models. We propose to incorporate the derived downwelling irradiance in estimation of direct light scattering for modeling the image formation process and generate realistic synthetic underwater images with the proposed RSUIGM, and name it as RSUIGM dataset. We demonstrate the effectiveness of the proposed RSUIGM by using RSUIGM dataset in training deep learning based restoration methods. We compare the quality of restored images with state-of-the-art methods using benchmark real underwater image datasets and achieve improved results. In addition, we validate the distribution of realistic synthetic underwater images versus real underwater images both qualitatively and quantitatively. The proposed RSUIGM dataset is available here.
{"title":"RSUIGM: Realistic Synthetic Underwater Image Generation with Image Formation Model","authors":"Chaitra Desai, Sujay Benur, Ujwala Patil, Uma Mudenagudi","doi":"10.1145/3656473","DOIUrl":"https://doi.org/10.1145/3656473","url":null,"abstract":"<p>In this paper, we propose to synthesize realistic underwater images with a novel image formation model, considering both downwelling depth and line of sight (LOS) distance as cue and call it as Realistic Synthetic Underwater Image Generation Model, RSUIGM. The light interaction in the ocean is a complex process and demands specific modeling of direct and backscattering phenomenon to capture the degradations. Most of the image formation models rely on complex radiative transfer models and in-situ measurements for synthesizing and restoration of underwater images. Typical image formation models consider only line of sight distance <i>z</i> and ignore downwelling depth <i>d</i> in the estimation of effect of direct light scattering. We derive the dependencies of downwelling irradiance in direct light estimation for generation of synthetic underwater images unlike state-of-the-art image formation models. We propose to incorporate the derived downwelling irradiance in estimation of direct light scattering for modeling the image formation process and generate realistic synthetic underwater images with the proposed RSUIGM, and name it as <i>RSUIGM dataset</i>. We demonstrate the effectiveness of the proposed RSUIGM by using RSUIGM dataset in training deep learning based restoration methods. We compare the quality of restored images with state-of-the-art methods using benchmark real underwater image datasets and achieve improved results. In addition, we validate the distribution of realistic synthetic underwater images versus real underwater images both qualitatively and quantitatively. The proposed RSUIGM dataset is available here.\u0000</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140590438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinliang Liu, Zhedong Zheng, Zongxin Yang, Yi Yang
In this paper, we address the challenging makeup transfer task, aiming to transfer makeup from a reference image to a source image while preserving facial geometry and background consistency. Existing deep neural network-based methods have shown promising results in aligning facial parts and transferring makeup textures. However, they often neglect the facial geometry of the source image, leading to two adverse effects: (1) alterations in geometrically relevant facial features, causing face flattening and loss of personality, and (2) difficulties in maintaining background consistency, as networks cannot clearly determine the face-background boundary. To jointly tackle these issues, we propose the High Fidelity Makeup via 2D and 3D Identity Preservation Network (IP23-Net), a novel framework that leverages facial geometry information to generate more realistic results. Our method comprises a 3D Shape Identity Encoder, which extracts identity and 3D shape features. We incorporate a 3D face reconstruction model to ensure the three-dimensional effect of face makeup, thereby preserving the characters’ depth and natural appearance. To preserve background consistency, our Background Correction Decoder automatically predicts an adaptive mask for the source image, distinguishing the foreground and background. In addition to popular benchmarks, we introduce a new large-scale High Resolution Synthetic Makeup Dataset containing 335,230 diverse high-resolution face images, to evaluate our method’s generalization ability. Experiments demonstrate that IP23-Net achieves high-fidelity makeup transfer while effectively preserving background consistency. The code will be made publicly available.
{"title":"High Fidelity Makeup via 2D and 3D Identity Preservation Net","authors":"Jinliang Liu, Zhedong Zheng, Zongxin Yang, Yi Yang","doi":"10.1145/3656475","DOIUrl":"https://doi.org/10.1145/3656475","url":null,"abstract":"<p>In this paper, we address the challenging makeup transfer task, aiming to transfer makeup from a reference image to a source image while preserving facial geometry and background consistency. Existing deep neural network-based methods have shown promising results in aligning facial parts and transferring makeup textures. However, they often neglect the facial geometry of the source image, leading to two adverse effects: (1) alterations in geometrically relevant facial features, causing face flattening and loss of personality, and (2) difficulties in maintaining background consistency, as networks cannot clearly determine the face-background boundary. To jointly tackle these issues, we propose the High Fidelity Makeup via 2D and 3D Identity Preservation Network (IP23-Net), a novel framework that leverages facial geometry information to generate more realistic results. Our method comprises a 3D Shape Identity Encoder, which extracts identity and 3D shape features. We incorporate a 3D face reconstruction model to ensure the three-dimensional effect of face makeup, thereby preserving the characters’ depth and natural appearance. To preserve background consistency, our Background Correction Decoder automatically predicts an adaptive mask for the source image, distinguishing the foreground and background. In addition to popular benchmarks, we introduce a new large-scale High Resolution Synthetic Makeup Dataset containing 335,230 diverse high-resolution face images, to evaluate our method’s generalization ability. Experiments demonstrate that IP23-Net achieves high-fidelity makeup transfer while effectively preserving background consistency. The code will be made publicly available.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"56 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140603157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Non-convertible tokens (NFTs) have become a fundamental part of the metaverse ecosystem due to its uniqueness and immutability. However, existing copyright protection schemes of NFT image art relied on the NFTs itself minted by third-party platforms. A minted NFT image art only tracks and verifies the entire transaction process, but the legitimacy of the source and ownership of its mapped digital image art cannot be determined. The original author or authorized publisher lack an active defense mechanism to prove ownership of the digital image art mapped by the unauthorized NFT. Therefore, we propose a self-defense copyright protection scheme for NFT image art based on information embedding in this paper, called SDCP-IE. The original author or authorized publisher can embed the copyright information into the published digital image art without damaging its visual effect in advance. Different from the existing information embedding works, the proposed SDCP-IE can generally enhance the invisibility of copyright information with different embedding capacity. Furthermore, considering the scenario of copyright information being discovered or even destroyed by unauthorized parties, the designed SDCP-IE can efficiently generate enhanced digital image art to improve the security performance of embedded image, thus resisting the detection of multiple known and unknown detection models simultaneously. The experimental results have also shown that the PSNR values of enhanced embedded image are all over 57db on three datasets BOSSBase, BOWS2 and ALASKA#2. Moreover, compared with existing information embedding works, the enhanced embedded images generated by SDCP-IE reaches the best transferability performance on the advanced CNN-based detection models. When the target detector is the pre-trained SRNet at 0.4bpp, the test error rate of SDCP-IE at 0.4bpp on the evaluated detection model YeNet reaches 53.38%, which is 4.92%, 28.62% and 7.05% higher than that of the UTGAN, SPS-ENH and Xie-Model, respectively.
{"title":"A Self-Defense Copyright Protection Scheme for NFT Image Art Based on Information Embedding","authors":"Fan Wang, Zhangjie Fu, Xiang Zhang","doi":"10.1145/3652519","DOIUrl":"https://doi.org/10.1145/3652519","url":null,"abstract":"<p>Non-convertible tokens (NFTs) have become a fundamental part of the metaverse ecosystem due to its uniqueness and immutability. However, existing copyright protection schemes of NFT image art relied on the NFTs itself minted by third-party platforms. A minted NFT image art only tracks and verifies the entire transaction process, but the legitimacy of the source and ownership of its mapped digital image art cannot be determined. The original author or authorized publisher lack an active defense mechanism to prove ownership of the digital image art mapped by the unauthorized NFT. Therefore, we propose a self-defense copyright protection scheme for NFT image art based on information embedding in this paper, called SDCP-IE. The original author or authorized publisher can embed the copyright information into the published digital image art without damaging its visual effect in advance. Different from the existing information embedding works, the proposed SDCP-IE can generally enhance the invisibility of copyright information with different embedding capacity. Furthermore, considering the scenario of copyright information being discovered or even destroyed by unauthorized parties, the designed SDCP-IE can efficiently generate enhanced digital image art to improve the security performance of embedded image, thus resisting the detection of multiple known and unknown detection models simultaneously. The experimental results have also shown that the PSNR values of enhanced embedded image are all over 57db on three datasets BOSSBase, BOWS2 and ALASKA#2. Moreover, compared with existing information embedding works, the enhanced embedded images generated by SDCP-IE reaches the best transferability performance on the advanced CNN-based detection models. When the target detector is the pre-trained SRNet at 0.4bpp, the test error rate of SDCP-IE at 0.4bpp on the evaluated detection model YeNet reaches 53.38%, which is 4.92%, 28.62% and 7.05% higher than that of the UTGAN, SPS-ENH and Xie-Model, respectively.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"82 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}