ACM Transactions on Multimedia Computing Communications and Applications最新文献_第4页

Exploration of Speech and Music Information for Movie Genre Classification 探索用于电影类型分类的语音和音乐信息

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-07 DOI: 10.1145/3664197

Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha

Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial time steps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).

根据预告片预测电影类型的尝试大多采用多模式方式。然而，电影预告片音频的特点表明，仅靠这种模式可能对类型预测非常有效。电影预告片音频主要由语音和音乐信号组成，这两种信号有的相互独立，有的相互重叠。这项研究假设，电影预告片的类型标签可能与其音频部分的组成有关。在这方面，预告片音频的语音-音乐置信度序列被用作一种特征。此外，在当前任务中还采用了之前提出的用于辨别语音-音乐的其他两个特征。本作品针对流派分类任务提出了一种时间和信道注意力卷积神经网络（ACNN）分类器。ACNN 中的卷积层学习输入特征中的空间关系。时间注意层和通道注意层分别学习关注关键的时间步骤和 CNN 内核输出。实验使用了 Moviescope 数据集，并使用了两种基于音频的基准方法来衡量这项工作。与基线方法相比，带有 ACNN 分类器的特征集提高了流派分类性能。此外，在对受不同文化影响的电影（EmoGDB）进行类型预测时，也获得了不错的泛化性能。

{"title":"Exploration of Speech and Music Information for Movie Genre Classification","authors":"Mrinmoy Bhattacharjee, Prasanna Mahadeva S. R., Prithwijit Guha","doi":"10.1145/3664197","DOIUrl":"https://doi.org/10.1145/3664197","url":null,"abstract":"Movie genre prediction from trailers is mostly attempted in a multi-modal manner. However, the characteristics of movie trailer audio indicate that this modality alone might be highly effective in genre prediction. Movie trailer audio predominantly consists of speech and music signals in isolation or overlapping conditions. This work hypothesizes that the genre labels of movie trailers might relate to the composition of their audio component. In this regard, speech-music confidence sequences for the trailer audio are used as a feature. In addition, two other features previously proposed for discriminating speech-music are also adopted in the current task. This work proposes a time and channel Attention Convolutional Neural Network (ACNN) classifier for the genre classification task. The convolutional layers in ACNN learn the spatial relationships in the input features. The time and channel attention layers learn to focus on crucial time steps and CNN kernel outputs, respectively. The Moviescope dataset is used to perform the experiments, and two audio-based baseline methods are employed to benchmark this work. The proposed feature set with the ACNN classifier improves the genre classification performance over the baselines. Moreover, decent generalization performance is obtained for genre prediction of movies with different cultural influences (EmoGDB).","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"28 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency EOGT：利用增强对象信息和全局时空依赖性进行视频异常检测

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-06 DOI: 10.1145/3662185

Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng

Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information, and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.

视频异常检测（VAD）旨在识别视频中偏离典型模式的事件或场景。现有方法主要通过重建或预测帧来检测异常，近年来性能有所提高。然而，这些方法往往高度依赖局部时空信息，并面临对象特征建模不足的挑战。针对上述问题，本文提出了一种具有增强对象信息和全局时空依赖性（EOGT）的视频异常检测框架，其主要创新点包括(1) 提出了一种局部对象异常流（LOAS），用于提取对象层面的局部多模态时空异常特征。LOAS 集成了两个模块：具有多模态条件的基于扩散的物体重构网络（DORN）利用物体的 RGB 信息检测异常；物体姿态异常提炼器（OPA）利用人体姿态信息发现异常。(2) 提出了具有视频级时间依赖性的全局时间强化流（GTSS），它利用视频级时间依赖性有效识别长期异常和特定视频异常。在 EOGT 中联合使用这两种流来学习用于 VAD 的多模态和多尺度时空异常特征，最后融合异常特征和分数来检测帧级别的异常。我们在三个公共数据集上进行了广泛的实验，以验证 EOGT 的性能：这些数据集包括上海科技大学校园、香港中文大学大道和加州大学圣地亚哥分校 Ped2。

{"title":"EOGT: Video Anomaly Detection with Enhanced Object Information and Global Temporal Dependency","authors":"Ruoyan Pi, Peng Wu, Xiangteng He, Yuxin Peng","doi":"10.1145/3662185","DOIUrl":"https://doi.org/10.1145/3662185","url":null,"abstract":"Video anomaly detection (VAD) aims to identify events or scenes in videos that deviate from typical patterns. Existing approaches primarily focus on reconstructing or predicting frames to detect anomalies and have shown improved performance in recent years. However, they often depend highly on local spatio-temporal information and face the challenge of insufficient object feature modeling. To address the above issues, this paper proposes a video anomaly detection framework with Enhanced Object Information and Global Temporal Dependencies (EOGT) and the main novelties are: (1) A Local Object Anomaly Stream (LOAS) is proposed to extract local multimodal spatio-temporal anomaly features at the object level. LOAS integrates two modules: a Diffusion-based Object Reconstruction Network (DORN) with multimodal conditions detects anomalies with object RGB information, and an Object Pose Anomaly Refiner (OPA) discovers anomalies with human pose information. (2) A Global Temporal Strengthening Stream (GTSS) with video-level temporal dependencies is proposed, which leverages video-level temporal dependencies to identify long-term and video-specific anomalies effectively. Both streams are jointly employed in EOGT to learn multimodal and multi-scale spatio-temporal anomaly features for VAD, and we finally fuse the anomaly features and scores to detect anomalies at the frame level. Extensive experiments are conducted to verify the performance of EOGT on three public datasets: ShanghaiTech Campus, CUHK Avenue, and UCSD Ped2.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"242 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Price of Unlearning: Identifying Unlearning Risk in Edge Computing 不学习的代价：识别边缘计算中的未学习风险

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-06 DOI: 10.1145/3662184

Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou

Machine unlearning is an emerging paradigm that aims to make machine learning models “forget” what they have learned about particular data. It fulfills the requirements of privacy legislation (e.g., GDPR), which stipulates that individuals have the autonomy to determine the usage of their personal data. However, alongside all the achievements, there are still loopholes in machine unlearning that may cause significant losses for the system, especially in edge computing. Edge computing is a distributed computing paradigm with the purpose of migrating data processing tasks closer to terminal devices. While various machine unlearning approaches have been proposed to erase the influence of data sample(s), we claim that it might be dangerous to directly apply them in the realm of edge computing. A malicious edge node may broadcast (possibly fake) unlearning requests to a target data sample (s) and then analyze the behavior of edge devices to infer useful information. In this paper, we exploited the vulnerabilities of current machine unlearning strategies in edge computing and proposed a new inference attack to highlight the potential privacy risk. Furthermore, we developed a defense method against this particular type of attack and proposed the price of unlearning (PoU) as a means to evaluate the inefficiency it brings to an edge computing system. We provide theoretical analyses to show the upper bound of the PoU using tools borrowed from game theory. The experimental results on real-world datasets demonstrate that the proposed defense strategy is effective and capable of preventing an adversary from deducing useful information.

机器非学习是一种新兴模式，旨在让机器学习模型 "忘记 "它们所学到的关于特定数据的知识。它满足了隐私法（如 GDPR）的要求，该法规定个人有权自主决定其个人数据的用途。然而，在取得这些成就的同时，机器学习仍存在漏洞，可能会给系统造成重大损失，尤其是在边缘计算领域。边缘计算是一种分布式计算模式，目的是将数据处理任务迁移到更靠近终端设备的地方。虽然已经提出了各种机器学习方法来消除数据样本的影响，但我们认为，在边缘计算领域直接应用这些方法可能是危险的。恶意边缘节点可能会向目标数据样本广播（可能是伪造的）解除学习请求，然后分析边缘设备的行为，从而推断出有用的信息。在本文中，我们利用了当前边缘计算中机器解除学习策略的漏洞，并提出了一种新的推理攻击，以突出潜在的隐私风险。此外，我们还针对这种特殊类型的攻击开发了一种防御方法，并提出了 "不学习的代价"（PoU），以此来评估它给边缘计算系统带来的低效。我们借用博弈论的工具进行了理论分析，以说明 PoU 的上限。在真实世界数据集上的实验结果表明，所提出的防御策略是有效的，能够阻止对手推导出有用的信息。

{"title":"The Price of Unlearning: Identifying Unlearning Risk in Edge Computing","authors":"Lefeng Zhang, Tianqing Zhu, Ping Xiong, Wanlei Zhou","doi":"10.1145/3662184","DOIUrl":"https://doi.org/10.1145/3662184","url":null,"abstract":"Machine unlearning is an emerging paradigm that aims to make machine learning models “forget” what they have learned about particular data. It fulfills the requirements of privacy legislation (e.g., GDPR), which stipulates that individuals have the autonomy to determine the usage of their personal data. However, alongside all the achievements, there are still loopholes in machine unlearning that may cause significant losses for the system, especially in edge computing. Edge computing is a distributed computing paradigm with the purpose of migrating data processing tasks closer to terminal devices. While various machine unlearning approaches have been proposed to erase the influence of data sample(s), we claim that it might be dangerous to directly apply them in the realm of edge computing. A malicious edge node may broadcast (possibly fake) unlearning requests to a target data sample (s) and then analyze the behavior of edge devices to infer useful information. In this paper, we exploited the vulnerabilities of current machine unlearning strategies in edge computing and proposed a new inference attack to highlight the potential privacy risk. Furthermore, we developed a defense method against this particular type of attack and proposed the price of unlearning (PoU) as a means to evaluate the inefficiency it brings to an edge computing system. We provide theoretical analyses to show the upper bound of the PoU using tools borrowed from game theory. The experimental results on real-world datasets demonstrate that the proposed defense strategy is effective and capable of preventing an adversary from deducing useful information.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

InteractNet: Social Interaction Recognition for Semantic-rich Videos InteractNet：针对语义丰富的视频进行社交互动识别

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-03 DOI: 10.1145/3663668

Yuanjie Lyu, Penggang Qin, Tong Xu, Chen Zhu, Enhong Chen

The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.

网络视频平台的激增迫切需要社交互动识别技术。与简单的短期行为相比，语义丰富的视频中的长期社交互动可以反映出更复杂的语义，如人物关系或情感，这将更好地支持各种下游应用，如故事摘要和细粒度片段检索。然而，考虑到社交互动的持续时间较长、相互重叠严重、涉及多个角色、动态场景和多模态线索等因素，传统的短期动作识别解决方案很可能无法完成这项任务。为了应对这些挑战，我们在本文中提出了一种基于分层图的系统，命名为 InteractNet，用于从多模态角度识别社交互动。具体来说，我们的方法首先为每个采样帧生成一个整合了多模态线索的语义图，然后通过一个适配的 GCN 模块学习作为短期交互模式的节点表征。沿着这一思路，通过子片段识别模块积累全局交互表征，有效过滤无关信息，解决交互之间的时间重叠问题。最后，通过构建一个全局级的字符对图，捕捉并模拟同时发生的互动之间的关联，从而预测最终的社会互动。在公开数据集上进行的综合实验证明，与最先进的基线方法相比，我们的方法非常有效。

{"title":"InteractNet: Social Interaction Recognition for Semantic-rich Videos","authors":"Yuanjie Lyu, Penggang Qin, Tong Xu, Chen Zhu, Enhong Chen","doi":"10.1145/3663668","DOIUrl":"https://doi.org/10.1145/3663668","url":null,"abstract":"The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics like character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this paper, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi-modal cues, and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"17 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Retrieval-Augmented Architectures for Image Captioning 为图像标题设计检索增强架构

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-03 DOI: 10.1145/3663667

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara

The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.

图像标题模型的目标是通过生成能准确反映输入图像内容的自然语言描述，在视觉和语言模式之间架起一座桥梁。近年来，研究人员利用基于深度学习的模型，在视觉特征提取和多模态连接设计方面取得了进展，从而解决了这一任务。本研究提出了一种开发图像字幕模型的新方法，利用外部 kNN 内存来改进生成过程。具体来说，我们提出了两个模型变体，其中包含一个基于视觉相似性的知识检索器组件、一个用于表示输入图像的可微分编码器，以及一个根据上下文线索和从外部存储器检索的文本预测标记的 kNN 增强语言模型。我们在 COCO 和 nocaps 数据集上对我们的方法进行了实验验证，结果表明，加入明确的外部记忆可以显著提高字幕质量，尤其是在检索语料库较大的情况下。这项工作为检索增强字幕模型提供了宝贵的见解，并为更大规模地改进图像字幕开辟了新的途径。

{"title":"Towards Retrieval-Augmented Architectures for Image Captioning","authors":"Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Alessandro Nicolosi, Rita Cucchiara","doi":"10.1145/3663667","DOIUrl":"https://doi.org/10.1145/3663667","url":null,"abstract":"The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation 从视频激发的脑电信号中高效解码情感状态：实证研究

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-03 DOI: 10.1145/3663669

Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva

Affect decoding through brain-computer interfacing (BCI) holds great potential to capture users’ feelings and emotional responses via non-invasive electroencephalogram (EEG) sensing. Yet, little research has been conducted to understand efficient decoding when users are exposed to dynamic audiovisual contents. In this regard, we study EEG-based affect decoding from videos in arousal and valence classification tasks, considering the impact of signal length, window size for feature extraction, and frequency bands. We train both classic Machine Learning models (SVMs and k-NNs) and modern Deep Learning models (FCNNs and GTNs). Our results show that: (1) affect can be effectively decoded using less than 1 minute of EEG signal; (2) temporal windows of 6 and 10 seconds provide the best classification performance for classic Machine Learning models but Deep Learning models benefit from much shorter windows of 2 seconds; and (3) any model trained on the Beta band alone achieves similar (sometimes better) performance than when trained on all frequency bands. Taken together, our results indicate that affect decoding can work in more realistic conditions than currently assumed, thus becoming a viable technology for creating better interfaces and user models.

通过脑机接口（BCI）进行情感解码具有巨大潜力，可通过无创脑电图（EEG）感应捕捉用户的情感和情绪反应。然而，对于用户在接触动态视听内容时如何进行高效解码，目前还鲜有研究。为此，我们研究了在唤醒和情绪分类任务中基于脑电图的视频情感解码，并考虑了信号长度、特征提取窗口大小和频段的影响。我们同时训练了经典机器学习模型（SVM 和 k-NN）和现代深度学习模型（FCNN 和 GTN）。结果表明(1) 使用不到 1 分钟的脑电信号就能有效解码情感；(2) 6 秒和 10 秒的时间窗口为经典机器学习模型提供了最佳分类性能，但深度学习模型则从更短的 2 秒窗口中获益；(3) 任何仅在 Beta 波段上训练的模型都能获得与在所有频段上训练时相似（有时更好）的性能。综上所述，我们的研究结果表明，影响解码可以在比目前假设的更真实的条件下工作，从而成为创建更好的界面和用户模型的可行技术。

{"title":"Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation","authors":"Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, Luis A Leiva","doi":"10.1145/3663669","DOIUrl":"https://doi.org/10.1145/3663669","url":null,"abstract":"Affect decoding through brain-computer interfacing (BCI) holds great potential to capture users’ feelings and emotional responses via non-invasive electroencephalogram (EEG) sensing. Yet, little research has been conducted to understand efficient decoding when users are exposed to dynamic audiovisual contents. In this regard, we study EEG-based affect decoding from videos in arousal and valence classification tasks, considering the impact of signal length, window size for feature extraction, and frequency bands. We train both classic Machine Learning models (SVMs and k-NNs) and modern Deep Learning models (FCNNs and GTNs). Our results show that: (1) affect can be effectively decoded using less than 1 minute of EEG signal; (2) temporal windows of 6 and 10 seconds provide the best classification performance for classic Machine Learning models but Deep Learning models benefit from much shorter windows of 2 seconds; and (3) any model trained on the Beta band alone achieves similar (sometimes better) performance than when trained on all frequency bands. Taken together, our results indicate that affect decoding can work in more realistic conditions than currently assumed, thus becoming a viable technology for creating better interfaces and user models.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"21 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gloss-driven Conditional Diffusion Models for Sign Language Production 用于手语生成的词汇驱动条件扩散模型

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-03 DOI: 10.1145/3663572

Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong

Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.

手语制作（SLP）旨在将文本或音频句子转换成与其语义相对应的手语视频，由于手语的多样性和复杂性以及跨模态语义映射问题，这项工作极具挑战性。在这项工作中，我们提出了用于 SLP 的光泽驱动条件扩散模型（GCDM）。GCDM 的核心是一个扩散模型架构，其中符号光泽序列由基于变换器的编码器编码，并作为语义先验条件输入扩散模型。在符号姿态生成的过程中，编码光泽特征所携带的文本语义先验条件通过交叉注意整合到嵌入式高斯噪声中。随后，模型通过 T 轮去噪步骤将融合后的特征转换为手语姿势序列。在训练过程中，模型以手语姿势的地面实况标签为起点，通过 T 轮噪声生成高斯噪声，然后执行 T 轮去噪，以逼近真实的手语姿势。整个过程受 MAE 损失函数的限制，以确保生成的手势尽可能接近真实标签。在推理阶段，模型直接随机采样一组高斯噪声，在光泽序列的指导下生成多个手语手势序列假设，并通过平均多个假设输出高置信度的手语手势视频。在 Phoenix2014T 数据集上的实验结果表明，所提出的 GCDM 方法在定量性能和定性可视化方面都具有竞争力。

{"title":"Gloss-driven Conditional Diffusion Models for Sign Language Production","authors":"Shengeng Tang, Feng Xue, Jingjing Wu, Shuo Wang, Richang Hong","doi":"10.1145/3663572","DOIUrl":"https://doi.org/10.1145/3663572","url":null,"abstract":"Sign Language Production (SLP) aims to convert text or audio sentences into sign language videos corresponding to their semantics, which is challenging due to the diversity and complexity of sign languages, and cross-modal semantic mapping issues. In this work, we propose a Gloss-driven Conditional Diffusion Model (GCDM) for SLP. The core of the GCDM is a diffusion model architecture, in which the sign gloss sequence is encoded by a Transformer-based encoder and input into the diffusion model as a semantic prior condition. In the process of sign pose generation, the textual semantic priors carried in the encoded gloss features are integrated into the embedded Gaussian noise via cross-attention. Subsequently, the model converts the fused features into sign language pose sequences through T-round denoising steps. During the training process, the model uses the ground-truth labels of sign poses as the starting point, generates Gaussian noise through T rounds of noise, and then performs T rounds of denoising to approximate the real sign language gestures. The entire process is constrained by the MAE loss function to ensure that the generated sign language gestures are as close as possible to the real labels. In the inference phase, the model directly randomly samples a set of Gaussian noise, generates multiple sign language gesture sequence hypotheses under the guidance of the gloss sequence, and outputs a high-confidence sign language gesture video by averaging multiple hypotheses. Experimental results on the Phoenix2014T dataset show that the proposed GCDM method achieves competitiveness in both quantitative performance and qualitative visualization.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval 在弱监督文本到视频检索中利用实例级关系

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-03 DOI: 10.1145/3663571

Shukang Yin, Sirui Zhao, Hao Wang, Tong Xu, Enhong Chen

Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: 1) a new matching scheme for pairing queries and their related moments in the video; 2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.

文本到视频检索是一项典型的跨模态检索任务，在传统的有监督环境下已被广泛研究。最近，一些研究试图将这一问题扩展为弱监督形式，这种形式更符合现实生活场景，注释成本也更低。在此背景下，我们提出了一项名为 "部分相关视频检索（PRVR）"的新任务，旨在检索与给定文本查询部分相关的视频，即至少包含一个语义相关时刻的视频。先前的研究将这一任务表述为多实例学习（MIL）排序问题，依赖于启发式算法，如简单的贪婪搜索策略，并独立处理每个查询。虽然这些早期探索取得了不错的性能，但它们可能没有充分利用包级标签，而只是考虑局部最优，这可能会导致次优解决方案和较差的最终检索性能。针对这一问题，我们在本文中提出利用实例之间的关系来提高检索性能。基于这一想法，我们创造性地提出了：1）一种新的配对方案，用于配对查询及其在视频中的相关时刻；2）一种新的损失函数，用于促进实例的两个视图之间的跨模态对齐。在三个公开可用的数据集上进行的广泛验证证明了我们解决方案的有效性，并验证了我们的假设，即实例级关系建模有利于 MIL 排名设置。我们的代码将在 https://github.com/xjtupanda/BGM-Net 上公开。

{"title":"Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval","authors":"Shukang Yin, Sirui Zhao, Hao Wang, Tong Xu, Enhong Chen","doi":"10.1145/3663571","DOIUrl":"https://doi.org/10.1145/3663571","url":null,"abstract":"Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: 1) a new matching scheme for pairing queries and their related moments in the video; 2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"8 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140842306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding 学习常识感知的时刻-文本对齐，实现快速视频时空定位

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-01 DOI: 10.1145/3663368

Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu

Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment network (C₂AN), which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our C₂AN method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.

有效、高效地定位自然语言查询中描述的时态视频片段是视觉和语言领域所需的一项重要能力。在本文中，我们讨论了快速视频时态接地（FVTG）任务，旨在高速、准确地定位目标片段。现有方法大多采用精心设计的跨模态交互模块来提高接地性能，但存在测试时间瓶颈。虽然几种常见的基于空间的方法在推理过程中具有高速的优点，但它们难以捕捉视觉模态和文本模态之间全面而明确的关系。为了解决速度与准确性之间的权衡问题，本文提出了一种常识感知的跨模态对齐网络（C2AN），它将常识指导下的视觉和文本表征整合到一个互补的公共空间中，从而实现快速的视频时空定位。具体来说，通过从语言语料库中提取结构语义信息来探索和利用常识概念。然后，设计一个常识感知交互模块，利用学习到的常识概念获取桥接的视觉和文本特征。最后，为了保持文本查询的原始语义信息，对跨模态互补公共空间进行了优化，以获得执行 FVTG 的匹配分数。在两个具有挑战性的基准测试中取得的大量结果表明，我们的 C2AN 方法在高速运行的同时，在与同行的竞争中表现出色。我们的代码见 https://github.com/ZiyueWu59/CCA。

{"title":"Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding","authors":"Ziyue Wu, Junyu Gao, Shucheng Huang, Changsheng Xu","doi":"10.1145/3663368","DOIUrl":"https://doi.org/10.1145/3663368","url":null,"abstract":"Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment network (C2AN), which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our C2AN method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"102 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects AGAR - 用于对可变形物体点云进行自适应运动预测的注意力图-RNN

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-05-01 DOI: 10.1145/3662183

Pedro de Medeiros Gomes, Silvia Rossi, Laura Toni

This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.

本文的重点是在可变形三维物体（如人体运动）这一具有挑战性的情况下，对点云序列进行运动预测。首先，我们研究了此类表示中存在的可变形形状和复杂运动所带来的挑战，最终目标是了解最先进模型的技术局限性。在此基础上，我们提出了一种用于可变形三维物体点云预测的改进架构。具体来说，为了处理可变形的形状，我们提出了一种基于图的方法，该方法可以学习和利用点云的空间结构来提取更具代表性的特征。然后，我们提出了一个模块，能够根据点云的移动情况，以适应性的方式将学习到的特征组合起来。所提出的自适应模块可以控制每个点的局部和全局运动的组合，从而使网络能够更有效地对可变形三维物体的复杂运动进行建模。我们在以下数据集上测试了所提出的方法：MNIST 移动数字、Mixamo 人体运动 [15]、JPEG [5] 和 CWIPC-SXR [32] 真实世界动态人体。仿真结果表明，我们的方法在复杂运动建模和保留点云形状方面的能力有所提高，因此优于当前的基线方法。此外，我们还在 MSRAction3D 数据集 [19] 上测试了该框架的动作识别能力，并取得了与最先进方法相当的结果，从而证明了所提出的动态特征学习框架的通用性。

{"title":"AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects","authors":"Pedro de Medeiros Gomes, Silvia Rossi, Laura Toni","doi":"10.1145/3662183","DOIUrl":"https://doi.org/10.1145/3662183","url":null,"abstract":"This paper focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"216 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140830793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0