Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha
Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial time steps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).
{"title":"Exploration of Speech and Music Information for Movie Genre Classification","authors":"Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha","doi":"10.1145/3664197","DOIUrl":"https://doi.org/10.1145/3664197","url":null,"abstract":"<p>Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial time steps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"28 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information, and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.
{"title":"EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency","authors":"Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng","doi":"10.1145/3662185","DOIUrl":"https://doi.org/10.1145/3662185","url":null,"abstract":"<p>Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with <b>E</b>nhanced <b>O</b>bject Information and <b>G</b>lobal <b>T</b>emporal Dependencies <b>(EOGT)</b> and the main novelties are: (1) A <b>L</b>ocal <b>O</b>bject <b>A</b>nomaly <b>S</b>tream <b>(LOAS)</b> is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a <b>D</b>iffusion-based <b>O</b>bject <b>R</b>econstruction <b>N</b>etwork <b>(DORN)</b> with multimodal conditions detects anomalies with object RGB information, and an <b>O</b>bject <b>P</b>ose <b>A</b>nomaly Refiner <b>(OPA)</b> discovers anomalies with human pose information. (2) A <b>G</b>lobal <b>T</b>emporal <b>S</b>trengthening <b>S</b>tream <b>(GTSS)</b> with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"242 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine unlearning is an emerging paradigm that aims to make machine learning models “forget” what they have learned about particular data. It fulfills the requirements of privacy legislation (e.g., GDPR), which stipulates that individuals have the autonomy to determine the usage of their personal data. However, alongside all the achievements, there are still loopholes in machine unlearning that may cause significant losses for the system, especially in edge computing. Edge computing is a distributed computing paradigm with the purpose of migrating data processing tasks closer to terminal devices. While various machine unlearning approaches have been proposed to erase the influence of data sample(s), we claim that it might be dangerous to directly apply them in the realm of edge computing. A malicious edge node may broadcast (possibly fake) unlearning requests to a target data sample (s) and then analyze the behavior of edge devices to infer useful information. In this paper, we exploited the vulnerabilities of current machine unlearning strategies in edge computing and proposed a new inference attack to highlight the potential privacy risk. Furthermore, we developed a defense method against this particular type of attack and proposed the price of unlearning (PoU) as a means to evaluate the inefficiency it brings to an edge computing system. We provide theoretical analyses to show the upper bound of the PoU using tools borrowed from game theory. The experimental results on real-world datasets demonstrate that the proposed defense strategy is effective and capable of preventing an adversary from deducing useful information.
机器非学习是一种新兴模式,旨在让机器学习模型 "忘记 "它们所学到的关于特定数据的知识。它满足了隐私法(如 GDPR)的要求,该法规定个人有权自主决定其个人数据的用途。然而,在取得这些成就的同时,机器学习仍存在漏洞,可能会给系统造成重大损失,尤其是在边缘计算领域。边缘计算是一种分布式计算模式,目的是将数据处理任务迁移到更靠近终端设备的地方。虽然已经提出了各种机器学习方法来消除数据样本的影响,但我们认为,在边缘计算领域直接应用这些方法可能是危险的。恶意边缘节点可能会向目标数据样本广播(可能是伪造的)解除学习请求,然后分析边缘设备的行为,从而推断出有用的信息。在本文中,我们利用了当前边缘计算中机器解除学习策略的漏洞,并提出了一种新的推理攻击,以突出潜在的隐私风险。此外,我们还针对这种特殊类型的攻击开发了一种防御方法,并提出了 "不学习的代价"(PoU),以此来评估它给边缘计算系统带来的低效。我们借用博弈论的工具进行了理论分析,以说明 PoU 的上限。在真实世界数据集上的实验结果表明,所提出的防御策略是有效的,能够阻止对手推导出有用的信息。
{"title":"The Price of Unlearning: Identifying Unlearning Risk in Edge Computing","authors":"Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou","doi":"10.1145/3662184","DOIUrl":"https://doi.org/10.1145/3662184","url":null,"abstract":"<p>Machine unlearning is an emerging paradigm that aims to make machine learning models “forget” what they have learned about particular data. It fulfills the requirements of privacy legislation (e.g., GDPR), which stipulates that individuals have the autonomy to determine the usage of their personal data. However, alongside all the achievements, there are still loopholes in machine unlearning that may cause significant losses for the system, especially in edge computing. Edge computing is a distributed computing paradigm with the purpose of migrating data processing tasks closer to terminal devices. While various machine unlearning approaches have been proposed to erase the influence of data sample(s), we claim that it might be dangerous to directly apply them in the realm of edge computing. A malicious edge node may broadcast (possibly fake) unlearning requests to a target data sample (s) and then analyze the behavior of edge devices to infer useful information. In this paper, we exploited the vulnerabilities of current machine unlearning strategies in edge computing and proposed a new inference attack to highlight the potential privacy risk. Furthermore, we developed a defense method against this particular type of attack and proposed <i>the price of unlearning</i> (<i>PoU</i>) as a means to evaluate the inefficiency it brings to an edge computing system. We provide theoretical analyses to show the upper bound of the <i>PoU</i> using tools borrowed from game theory. The experimental results on real-world datasets demonstrate that the proposed defense strategy is effective and capable of preventing an adversary from deducing useful information.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.
{"title":"InteractNet: Social Interaction Recognition for Semantic-rich Videos","authors":"Yuanjie Lyu, Penggang Qin, Tong Xu, Chen Zhu, Enhong Chen","doi":"10.1145/3663668","DOIUrl":"https://doi.org/10.1145/3663668","url":null,"abstract":"<p>The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara
The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.
{"title":"Towards Retrieval-Augmented Architectures for Image Captioning","authors":"Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara","doi":"10.1145/3663667","DOIUrl":"https://doi.org/10.1145/3663667","url":null,"abstract":"<p>The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external <i>k</i>NN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a <i>k</i>NN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva
Affect decoding through brain-computer interfacing (BCI) holds great potential to capture users’ feelings and emotional responses via non-invasive electroencephalogram (EEG) sensing. Yet, little research has been conducted to understand efficient decoding when users are exposed to dynamic audiovisual contents. In this regard, we study EEG-based affect decoding from videos in arousal and valence classification tasks, considering the impact of signal length, window size for feature extraction, and frequency bands. We train both classic Machine Learning models (SVMs and k-NNs) and modern Deep Learning models (FCNNs and GTNs). Our results show that: (1) affect can be effectively decoded using less than 1 minute of EEG signal; (2) temporal windows of 6 and 10 seconds provide the best classification performance for classic Machine Learning models but Deep Learning models benefit from much shorter windows of 2 seconds; and (3) any model trained on the Beta band alone achieves similar (sometimes better) performance than when trained on all frequency bands. Taken together, our results indicate that affect decoding can work in more realistic conditions than currently assumed, thus becoming a viable technology for creating better interfaces and user models.
{"title":"Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation","authors":"Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva","doi":"10.1145/3663669","DOIUrl":"https://doi.org/10.1145/3663669","url":null,"abstract":"<p>Affect decoding through brain-computer interfacing (BCI) holds great potential to capture users’ feelings and emotional responses via non-invasive electroencephalogram (EEG) sensing. Yet, little research has been conducted to understand <i>efficient</i> decoding when users are exposed to <i>dynamic</i> audiovisual contents. In this regard, we study EEG-based affect decoding from videos in arousal and valence classification tasks, considering the impact of signal length, window size for feature extraction, and frequency bands. We train both classic Machine Learning models (SVMs and <i>k</i>-NNs) and modern Deep Learning models (FCNNs and GTNs). Our results show that: (1) affect can be effectively decoded using less than 1 minute of EEG signal; (2) temporal windows of 6 and 10 seconds provide the best classification performance for classic Machine Learning models but Deep Learning models benefit from much shorter windows of 2 seconds; and (3) any model trained on the Beta band alone achieves similar (sometimes better) performance than when trained on all frequency bands. Taken together, our results indicate that affect decoding can work in more realistic conditions than currently assumed, thus becoming a viable technology for creating better interfaces and user models.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong
Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.
手语制作(SLP)旨在将文本或音频句子转换成与其语义相对应的手语视频,由于手语的多样性和复杂性以及跨模态语义映射问题,这项工作极具挑战性。在这项工作中,我们提出了用于 SLP 的光泽驱动条件扩散模型(GCDM)。GCDM 的核心是一个扩散模型架构,其中符号光泽序列由基于变换器的编码器编码,并作为语义先验条件输入扩散模型。在符号姿态生成的过程中,编码光泽特征所携带的文本语义先验条件通过交叉注意整合到嵌入式高斯噪声中。随后,模型通过 T 轮去噪步骤将融合后的特征转换为手语姿势序列。在训练过程中,模型以手语姿势的地面实况标签为起点,通过 T 轮噪声生成高斯噪声,然后执行 T 轮去噪,以逼近真实的手语姿势。整个过程受 MAE 损失函数的限制,以确保生成的手势尽可能接近真实标签。在推理阶段,模型直接随机采样一组高斯噪声,在光泽序列的指导下生成多个手语手势序列假设,并通过平均多个假设输出高置信度的手语手势视频。在 Phoenix2014T 数据集上的实验结果表明,所提出的 GCDM 方法在定量性能和定性可视化方面都具有竞争力。
{"title":"Gloss-driven Conditional Diffusion Models for Sign Language Production","authors":"Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong","doi":"10.1145/3663572","DOIUrl":"https://doi.org/10.1145/3663572","url":null,"abstract":"<p>Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: 1) a new matching scheme for pairing queries and their related moments in the video; 2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.
文本到视频检索是一项典型的跨模态检索任务,在传统的有监督环境下已被广泛研究。最近,一些研究试图将这一问题扩展为弱监督形式,这种形式更符合现实生活场景,注释成本也更低。在此背景下,我们提出了一项名为 "部分相关视频检索(PRVR)"的新任务,旨在检索与给定文本查询部分相关的视频,即至少包含一个语义相关时刻的视频。先前的研究将这一任务表述为多实例学习(MIL)排序问题,依赖于启发式算法,如简单的贪婪搜索策略,并独立处理每个查询。虽然这些早期探索取得了不错的性能,但它们可能没有充分利用包级标签,而只是考虑局部最优,这可能会导致次优解决方案和较差的最终检索性能。针对这一问题,我们在本文中提出利用实例之间的关系来提高检索性能。基于这一想法,我们创造性地提出了:1)一种新的配对方案,用于配对查询及其在视频中的相关时刻;2)一种新的损失函数,用于促进实例的两个视图之间的跨模态对齐。在三个公开可用的数据集上进行的广泛验证证明了我们解决方案的有效性,并验证了我们的假设,即实例级关系建模有利于 MIL 排名设置。我们的代码将在 https://github.com/xjtupanda/BGM-Net 上公开。
{"title":"Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval","authors":"Shukang Yin, Sirui Zhao, Hao Wang, Tong Xu, Enhong Chen","doi":"10.1145/3663571","DOIUrl":"https://doi.org/10.1145/3663571","url":null,"abstract":"<p>Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: 1) a new matching scheme for pairing queries and their related moments in the video; 2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment network (C2AN), which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our C2AN method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.
{"title":"Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding","authors":"Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu","doi":"10.1145/3663368","DOIUrl":"https://doi.org/10.1145/3663368","url":null,"abstract":"<p>Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment network (C<sub>2</sub>AN), which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our C<sub>2</sub>AN method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"102 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.
{"title":"AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects","authors":"Pedro de Medeiros Gomes, Silvia Rossi, Laura Toni","doi":"10.1145/3662183","DOIUrl":"https://doi.org/10.1145/3662183","url":null,"abstract":"<p>This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a <i>adaptative</i> manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the <i>Mixamo</i> human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"216 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}