ACM Transactions on Multimedia Computing Communications and Applications最新文献_第2页

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis 用于多模态情感分析的多模态 PEAR 思维推理链

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-11 DOI: 10.1145/3672398

Yan Li, Xiangyuan Lan, Haifeng Chen, Ke Lu, Dongmei Jiang

Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. This paper introduces the Multimodal PEAR Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR (Preliminaries, quEstion, Answer, Reason) chain-of-thought prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based chain-of-thought reasoning is not always reliable and might contain irrational steps due to the hallucinations of large language models. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the chain of thought, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the chain of thought. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply chain-of-thought reasoning to multimodal sentiment analysis.

多模态情感分析旨在从音频、视频和文本等多模态信号中预测情感。现有方法通常依赖预训练语言模型（PLM）从文本数据中提取语义信息，缺乏对文本模态内部逻辑关系的深入理解。本文介绍了用于多模态情感分析的多模态 PEAR 思维链（MM-PEAR-CoT）推理。受人类在解决复杂问题时的思维过程启发，首次提出了 PEAR（Preliminaries、quEstion、Answer、Reason）思维链提示，以诱导大型语言模型（LLM）生成基于文本的推理过程和零误差情感预测结果。然而，基于文本的思维链推理并不总是可靠的，可能会由于大型语言模型的幻觉而包含不合理的步骤。为此，我们进一步设计了跨模态过滤和融合（CMFF）模块。过滤子模块利用音频和视觉模态抑制思维链中的不合理步骤，而融合子模块则在语义表征学习过程中整合高级推理信息和跨模态互补信息。在两个多模态情感分析基准数据集上的实验结果表明，高层推理信息有助于学习辨别性文本表征，而跨模态互补信息可以避免思维链中不合理步骤的误导。MM-PEAR-CoT 在这两个数据集上都取得了最佳结果，在 CMU-MOSI 和 CMU-MOSEI 数据集上的二元分类准确率分别提高了 2.2% 和 1.7%。据我们所知，这是第一项将思维链推理应用于多模态情感分析的研究。

{"title":"Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis","authors":"Yan Li, Xiangyuan Lan, Haifeng Chen, Ke Lu, Dongmei Jiang","doi":"10.1145/3672398","DOIUrl":"https://doi.org/10.1145/3672398","url":null,"abstract":"Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. This paper introduces the Multimodal PEAR Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR (Preliminaries, quEstion, Answer, Reason) chain-of-thought prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based chain-of-thought reasoning is not always reliable and might contain irrational steps due to the hallucinations of large language models. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the chain of thought, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the chain of thought. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply chain-of-thought reasoning to multimodal sentiment analysis.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"144 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mustang: Improving QoE for Real-Time Video in Cellular Networks by Masking Jitter 野马：通过屏蔽抖动改善蜂窝网络中实时视频的 QoE

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-10 DOI: 10.1145/3672399

Encheng Yu, Jianer Zhou, Zhenyu Li, Gareth Tyson, Weichao Li, Xinyi Zhang, Zhiwei Xu, Gaogang Xie

The advent of 5G and interactive live broadcasting has led to a growing trend of people preferring real-time interactive video services on mobile devices, particularly mobile phones. In this work, we measure the performance of Google congestion control (GCC) in cellular networks, which is the default congestion control algorithm for Web Real-Time Communications (WebRTC). Our measurements show that GCC sometimes makes bitrate decisions which are harmful to quality of experience (QoE) in cellular networks with high jitter. We further find that the frame delivery time (FDT) in the player can mitigate network jitter and maintain QoE. Moreover, the receiving rate is better to reflect the network congestion than RTT in cellular networks. Based on these measurements and findings, we propose Mustang, an algorithm designed to overcome the jitter in cellular networks. Mustang makes use of the FDT and receiving rate as feedback information to the sender. Then the sender adjusts its sending rate based on the information to guarantee QoE. We have implemented Mustang in WebRTC and evaluated it in both emulated and real cellular networks. The experimental results show that Mustang can improve WebRTC’s both QoS and QoE performance. For QoS, Mustang increases the sending rate by 72.1% and has similar RTT and packet loss when compared with GCC, while it is about 30% better for QoE.

随着 5G 和互动直播的出现，人们越来越倾向于在移动设备（尤其是手机）上使用实时互动视频服务。在这项工作中，我们测量了蜂窝网络中谷歌拥塞控制（GCC）的性能，这是网络实时通信（WebRTC）的默认拥塞控制算法。我们的测量结果表明，在抖动较高的蜂窝网络中，GCC 有时会做出对体验质量（QoE）有害的比特率决策。我们进一步发现，播放器中的帧传送时间（FDT）可以减轻网络抖动并保持 QoE。此外，在蜂窝网络中，接收速率比 RTT 更能反映网络拥塞情况。基于这些测量和研究结果，我们提出了旨在克服蜂窝网络抖动的算法 Mustang。Mustang 利用 FDT 和接收速率作为对发送方的反馈信息。然后，发送方根据这些信息调整其发送速率，以保证 QoE。我们在 WebRTC 中实现了 Mustang，并在模拟和真实蜂窝网络中对其进行了评估。实验结果表明，Mustang 可以改善 WebRTC 的 QoS 和 QoE 性能。在 QoS 方面，与 GCC 相比，Mustang 将发送速率提高了 72.1%，RTT 和丢包率相似，而在 QoE 方面则提高了约 30%。

{"title":"Mustang: Improving QoE for Real-Time Video in Cellular Networks by Masking Jitter","authors":"Encheng Yu, Jianer Zhou, Zhenyu Li, Gareth Tyson, Weichao Li, Xinyi Zhang, Zhiwei Xu, Gaogang Xie","doi":"10.1145/3672399","DOIUrl":"https://doi.org/10.1145/3672399","url":null,"abstract":"The advent of 5G and interactive live broadcasting has led to a growing trend of people preferring real-time interactive video services on mobile devices, particularly mobile phones. In this work, we measure the performance of Google congestion control (GCC) in cellular networks, which is the default congestion control algorithm for Web Real-Time Communications (WebRTC). Our measurements show that GCC sometimes makes bitrate decisions which are harmful to quality of experience (QoE) in cellular networks with high jitter. We further find that the frame delivery time (FDT) in the player can mitigate network jitter and maintain QoE. Moreover, the receiving rate is better to reflect the network congestion than RTT in cellular networks. Based on these measurements and findings, we propose Mustang, an algorithm designed to overcome the jitter in cellular networks. Mustang makes use of the FDT and receiving rate as feedback information to the sender. Then the sender adjusts its sending rate based on the information to guarantee QoE. We have implemented Mustang in WebRTC and evaluated it in both emulated and real cellular networks. The experimental results show that Mustang can improve WebRTC’s both QoS and QoE performance. For QoS, Mustang increases the sending rate by 72.1% and has similar RTT and packet loss when compared with GCC, while it is about 30% better for QoE.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"15 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mix-DDPM: Enhancing Diffusion Models through Fitting Mixture Noise with Global Stochastic Offset Mix-DDPM：通过全局随机偏移拟合混合噪声增强扩散模型

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-07 DOI: 10.1145/3672080

Hanzhang Wang, Deming Zhai, Xiong Zhou, Junjun Jiang, Xianming Liu

Denoising diffusion probabilistic models (DDPM) have shown impressive performance in various domains as a class of deep generative models. In this paper, we introduce the Mixture noise-based DDPM (Mix-DDPM), which considers the Markov diffusion posterior as a Gaussian mixture model. Specifically, Mix-DDPM randomly selects a Gaussian component and then adds the chosen Gaussian noise, which can be demonstrated as a more efficient way to perturb the signals into a simple known distribution. We further define the reverse probabilistic model as a parameterized Gaussian mixture kernel. Due to the intractability in calculating the KL divergence between Gaussian mixture models, we derive a variational bound to maximize the likelihood, offering a concise formulation for optimizing the denoising model and valuable insights for designing the sampling strategies. Our theoretical derivation highlights that Mix-DDPM need only shift image which requires the inclusion of a global stochastic offset in both the diffusion and reverse processes, which can be efficiently implemented with just several lines of code. The global stochastic offset effectively fits a Gaussian mixture distribution enhancing the degrees of freedom of the entire diffusion model. Furthermore, we present three streamlined sampling strategies that interface with diverse fast dedicated solvers for diffusion ordinary differential equations, boosting the efficacy of image representation in the sampling phase and alleviating the issue of slow generation speed, thereby enhancing both efficiency and accuracy. Extensive experiments on benchmark datasets demonstrate the effectiveness of Mix-DDPM and its superiority over the original DDPM.

去噪扩散概率模型（DDPM）作为一类深度生成模型，在各个领域都表现出令人印象深刻的性能。本文介绍了基于混合噪声的去噪扩散概率模型（Mix-DDPM），它将马尔可夫扩散后验视为高斯混合模型。具体来说，Mix-DDPM 随机选择一个高斯分量，然后添加所选的高斯噪声，这可以证明是将信号扰动为简单已知分布的一种更有效的方法。我们进一步将反向概率模型定义为参数化高斯混合核。由于计算高斯混合物模型之间的 KL 发散很困难，我们推导出了一个变分约束来最大化似然，为优化去噪模型提供了一个简洁的表述，并为设计采样策略提供了宝贵的见解。我们的理论推导强调，Mix-DDPM 只需移动图像，这就要求在扩散和反向过程中加入全局随机偏移，而这只需几行代码就能高效实现。全局随机偏移有效地拟合了高斯混合分布，增强了整个扩散模型的自由度。此外，我们还提出了三种精简的采样策略，这些策略可与各种快速的扩散常微分方程专用求解器对接，提高了采样阶段的图像表示效率，并缓解了生成速度慢的问题，从而提高了效率和准确性。在基准数据集上进行的大量实验证明了 Mix-DDPM 的有效性及其优于原始 DDPM 的性能。

{"title":"Mix-DDPM: Enhancing Diffusion Models through Fitting Mixture Noise with Global Stochastic Offset","authors":"Hanzhang Wang, Deming Zhai, Xiong Zhou, Junjun Jiang, Xianming Liu","doi":"10.1145/3672080","DOIUrl":"https://doi.org/10.1145/3672080","url":null,"abstract":"Denoising diffusion probabilistic models (DDPM) have shown impressive performance in various domains as a class of deep generative models. In this paper, we introduce the Mixture noise-based DDPM (Mix-DDPM), which considers the Markov diffusion posterior as a Gaussian mixture model. Specifically, Mix-DDPM randomly selects a Gaussian component and then adds the chosen Gaussian noise, which can be demonstrated as a more efficient way to perturb the signals into a simple known distribution. We further define the reverse probabilistic model as a parameterized Gaussian mixture kernel. Due to the intractability in calculating the KL divergence between Gaussian mixture models, we derive a variational bound to maximize the likelihood, offering a concise formulation for optimizing the denoising model and valuable insights for designing the sampling strategies. Our theoretical derivation highlights that Mix-DDPM need only shift image which requires the inclusion of a global stochastic offset in both the diffusion and reverse processes, which can be efficiently implemented with just several lines of code. The global stochastic offset effectively fits a Gaussian mixture distribution enhancing the degrees of freedom of the entire diffusion model. Furthermore, we present three streamlined sampling strategies that interface with diverse fast dedicated solvers for diffusion ordinary differential equations, boosting the efficacy of image representation in the sampling phase and alleviating the issue of slow generation speed, thereby enhancing both efficiency and accuracy. Extensive experiments on benchmark datasets demonstrate the effectiveness of Mix-DDPM and its superiority over the original DDPM.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SCAE: Structural Contrastive Auto-encoder for Incomplete Multi-view Representation Learning SCAE：用于不完整多视角表征学习的结构对比自动编码器

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-07 DOI: 10.1145/3672078

Mengran Li, Ronghui Zhang, Yong Zhang, Xinglin Piao, Shiyu Zhao, Baocai Yin

Describing an object from multiple perspectives often leads to incomplete data representation. Consequently, learning consistent representations for missing data from multiple views has emerged as a key focus in the realm of Incomplete Multi-view Representation Learning (IMRL). In recent years, various strategies such as subspace learning, matrix decomposition, and deep learning have been harnessed to develop numerous IMRL methods. In this paper, our primary research revolves around IMRL, with a particular emphasis on addressing two main challenges. Firstly, we delve into the effective integration of intra-view similarity and contextual structure into a unified framework. Secondly, we explore the effective facilitation of information exchange and fusion across multiple views. To tackle these issues, we propose a deep learning approach known as Structural Contrastive Auto-encoder (SCAE) to solve the challenges of IMRL. SCAE comprises two major components: Intra-View Structural Representation Learning and Inter-View Contrastive Representation Learning. The former involves capturing intra-view similarity by minimizing the Dirichlet energy of the feature matrix, while also applying spatial dispersion regularization to capture intra-view contextual structure. The latter encourages maximizing the mutual information of inter-view representations, facilitating information exchange and fusion across views. Experimental results demonstrate the efficacy of our approach in significantly enhancing model accuracy and robustly addressing IMRL problems. The code is available at https://github.com/limengran98/SCAE.

从多个角度描述一个对象往往会导致数据表示不完整。因此，从多个视角学习缺失数据的一致表示已成为不完整多视角表示学习（IMRL）领域的一个重点。近年来，人们利用子空间学习、矩阵分解和深度学习等各种策略，开发出了许多 IMRL 方法。在本文中，我们的主要研究围绕 IMRL 展开，并特别强调要解决两大挑战。首先，我们深入研究如何将视图内相似性和上下文结构有效整合到一个统一的框架中。其次，我们探索如何有效促进多视图之间的信息交流和融合。为了解决这些问题，我们提出了一种称为结构对比自动编码器（SCAE）的深度学习方法，以解决 IMRL 面临的挑战。SCAE 包括两个主要部分：视图内结构表征学习（Intra-View Structural Representation Learning）和视图间对比表征学习（Inter-View Contrastive Representation Learning）。前者通过最小化特征矩阵的 Dirichlet 能量来捕捉视图内的相似性，同时应用空间分散正则化来捕捉视图内的上下文结构。后者鼓励最大化视图间表征的互信息，促进视图间的信息交换和融合。实验结果表明，我们的方法能显著提高模型的准确性，并稳健地解决 IMRL 问题。代码见 https://github.com/limengran98/SCAE。

{"title":"SCAE: Structural Contrastive Auto-encoder for Incomplete Multi-view Representation Learning","authors":"Mengran Li, Ronghui Zhang, Yong Zhang, Xinglin Piao, Shiyu Zhao, Baocai Yin","doi":"10.1145/3672078","DOIUrl":"https://doi.org/10.1145/3672078","url":null,"abstract":"Describing an object from multiple perspectives often leads to incomplete data representation. Consequently, learning consistent representations for missing data from multiple views has emerged as a key focus in the realm of Incomplete Multi-view Representation Learning (IMRL). In recent years, various strategies such as subspace learning, matrix decomposition, and deep learning have been harnessed to develop numerous IMRL methods. In this paper, our primary research revolves around IMRL, with a particular emphasis on addressing two main challenges. Firstly, we delve into the effective integration of intra-view similarity and contextual structure into a unified framework. Secondly, we explore the effective facilitation of information exchange and fusion across multiple views. To tackle these issues, we propose a deep learning approach known as Structural Contrastive Auto-encoder (SCAE) to solve the challenges of IMRL. SCAE comprises two major components: Intra-View Structural Representation Learning and Inter-View Contrastive Representation Learning. The former involves capturing intra-view similarity by minimizing the Dirichlet energy of the feature matrix, while also applying spatial dispersion regularization to capture intra-view contextual structure. The latter encourages maximizing the mutual information of inter-view representations, facilitating information exchange and fusion across views. Experimental results demonstrate the efficacy of our approach in significantly enhancing model accuracy and robustly addressing IMRL problems. The code is available at https://github.com/limengran98/SCAE.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Long Form Audio-visual Video Understanding 实现对长篇视听视频的理解

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-07 DOI: 10.1145/3672079

Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu

We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.

我们生活的世界充满了永无止境的多模态信息流。作为对真实场景更自然的记录，长视频有望成为更好地探索和理解世界的重要桥梁。在本文中，我们提出了长视频中的多感官时间事件定位任务，并努力解决相关挑战。为了促进这项研究，我们首先收集了一个大规模的长格式视听视频（LFAV）数据集，其中包含 5,175 个视频，平均视频长度为 210 秒。每个收集到的视频都按照长时序精心注释了多种感知模态的事件。然后，我们提出了一个以事件为中心的框架，用于定位多感官事件并理解它们在长视频中的关系。它包括三个不同层次的阶段：学习片段特征的片段预测阶段、提取事件级特征的事件提取阶段以及研究事件关系的事件交互阶段。实验证明，利用新的 LFAV 数据集，所提出的方法在定位长视频中的多种模式感知事件方面表现出了相当高的效率。我们希望我们新收集的数据集和新颖的方法能成为进一步研究长视频视听理解领域的基石。项目页面：https://gewu-lab.github.io/LFAV/。

{"title":"Towards Long Form Audio-visual Video Understanding","authors":"Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu","doi":"10.1145/3672079","DOIUrl":"https://doi.org/10.1145/3672079","url":null,"abstract":"We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"134 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Style Variable and Irrelevant Learning for Generalizable Person Re-identification 可通用的人员再识别的风格变量和无关学习

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-06 DOI: 10.1145/3671003

Kai Lv, Haobo Chen, Chuyang Zhao, Kai Tu, Junru Chen, Yadong Li, Boxun Li, Youfang Lin

Domain Generalization person Re-identification (DG-ReID) has gained much attention recently due to the poor performance of supervised re-identification on unseen domains. The goal of domain generalization is to develop a model that is insensitive to domain bias and can perform well across different domains. In this paper, We conduct experiments to verify the importance of style factors in domain bias. Specifically, the experiments are to affirm that style bias across different domains significantly contributes to domain bias. Based on this observation, we propose Style Variable and Irrelevant Learning (SVIL) to eliminate the influence of style factors on the model. Specifically, we employ a Style Jitter Module (SJM) that enhances the style diversity of a specific source domain and reduces the style differences among various source domains. This allows the model to focus on identity-relevant information and be robust to style changes. We also integrate the SJM module with a meta-learning algorithm to further enhance the model’s generalization ability. Notably, our SJM module is easy to implement and does not add any inference cost. Our extensive experiments demonstrate the effectiveness of our approach, which outperforms existing methods on DG-ReID benchmarks.

由于有监督的再识别技术在未见过的领域中表现不佳，领域泛化人再识别（DG-ReID）近来备受关注。领域泛化的目标是开发一种对领域偏差不敏感的模型，并能在不同领域中表现良好。在本文中，我们通过实验来验证风格因素在领域偏差中的重要性。具体来说，实验证实了不同领域的风格偏差对领域偏差有很大的影响。基于这一观察结果，我们提出了风格变量和无关学习（SVIL）来消除风格因素对模型的影响。具体来说，我们采用了风格抖动模块（SJM）来增强特定源域的风格多样性，并减少不同源域之间的风格差异。这使得模型能够专注于与身份相关的信息，并对风格变化保持稳健。我们还将 SJM 模块与元学习算法相结合，以进一步增强模型的泛化能力。值得注意的是，我们的 SJM 模块易于实现，并且不增加任何推理成本。我们的大量实验证明了我们方法的有效性，在 DG-ReID 基准测试中，我们的方法优于现有方法。

{"title":"Style Variable and Irrelevant Learning for Generalizable Person Re-identification","authors":"Kai Lv, Haobo Chen, Chuyang Zhao, Kai Tu, Junru Chen, Yadong Li, Boxun Li, Youfang Lin","doi":"10.1145/3671003","DOIUrl":"https://doi.org/10.1145/3671003","url":null,"abstract":"Domain Generalization person Re-identification (DG-ReID) has gained much attention recently due to the poor performance of supervised re-identification on unseen domains. The goal of domain generalization is to develop a model that is insensitive to domain bias and can perform well across different domains. In this paper, We conduct experiments to verify the importance of style factors in domain bias. Specifically, the experiments are to affirm that style bias across different domains significantly contributes to domain bias. Based on this observation, we propose Style Variable and Irrelevant Learning (SVIL) to eliminate the influence of style factors on the model. Specifically, we employ a Style Jitter Module (SJM) that enhances the style diversity of a specific source domain and reduces the style differences among various source domains. This allows the model to focus on identity-relevant information and be robust to style changes. We also integrate the SJM module with a meta-learning algorithm to further enhance the model’s generalization ability. Notably, our SJM module is easy to implement and does not add any inference cost. Our extensive experiments demonstrate the effectiveness of our approach, which outperforms existing methods on DG-ReID benchmarks.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Attribute-Controlled Fashion Image Captioning 实现受属性控制的时尚图像字幕制作

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-05 DOI: 10.1145/3671000

Chen Cai, Kim-Hui Yap, Suchen Wang

Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve, etc.) and styles (e.g., cool, classic, fresh, etc.) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.

时尚图像标题是时尚行业的一项重要任务，旨在自动生成时尚产品的产品描述。然而，现有的时尚图像标题模型一旦部署，就会为特定的时尚产品预测一个固定的标题，无法满足独特的偏好。我们探索了一种可控的时尚图片标题生成方法，允许用户指定一些语义属性来指导标题生成。我们的方法利用语义属性作为控制信号，让用户能够指定他们希望模型在生成标题时纳入的特定时尚属性（如缝合、针织、袖子等）和风格（如酷、经典、清新等）。通过提供这种程度的定制，我们的方法可以根据个人喜好创建更加个性化和有针对性的标题。为了评估我们提出的方法的有效性，我们对当前的 FACAD 数据集进行了清理、过滤，并组装了一个名为 FACAD170K 的新的时尚图片标题数据集。该数据集为学习提供了便利，使我们能够研究我们方法的有效性。结果表明，我们提出的方法优于现有的时尚图像标题模型和传统标题方法。此外，我们还在 MSCOCO 和 Flickr30K 标题数据集上进一步验证了所提方法的有效性，并取得了具有竞争力的性能。

{"title":"Towards Attribute-Controlled Fashion Image Captioning","authors":"Chen Cai, Kim-Hui Yap, Suchen Wang","doi":"10.1145/3671000","DOIUrl":"https://doi.org/10.1145/3671000","url":null,"abstract":"Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve, etc.) and styles (e.g., cool, classic, fresh, etc.) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"43 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning VoiceStyle：通过跨模态原型对比学习进行基于语音的人脸生成

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-05 DOI: 10.1145/3671002

Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng

Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.

我们能仅根据声音预测一个人的外貌吗？本文通过从未曾听过的声音片段生成人脸来探讨这一问题。我们提出的方法 VoiceStyle 将跨模态表征学习与生成建模相结合，使我们能够将语音语义线索纳入生成的人脸中。在第一阶段，我们引入了跨模态原型对比学习（CMPC），以建立声音与面部之间的关联。由于认识到真实世界无标记数据中存在假阴性和偏差阳性实例，我们不仅使用同一视频中的语音-人脸对，还通过无监督聚类构建了额外的语义阳性对，从而加强了学习过程。此外，我们还根据实例与另一种模式的聚类中心的相似性对实例进行重新校准。在第二阶段，我们利用 StyleGAN 强大的生成能力来生成人脸。我们以学习到的语音-人脸对齐为指导，优化 StyleGAN 潜在空间中的潜在代码。为了解决选择合适的优化起点这一重要问题，我们的目标是利用从语音输入中获得的人脸原型自动找到最佳起点。整个管道可以自监督方式实现，无需人工标注注释。通过大量实验，我们证明了 VoiceStyle 方法在跨模态表征学习和基于语音的人脸生成方面的有效性和性能。

{"title":"VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning","authors":"Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng","doi":"10.1145/3671002","DOIUrl":"https://doi.org/10.1145/3671002","url":null,"abstract":"Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"67 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ANAGL: A Noise-resistant and Anti-sparse Graph Learning for micro-video recommendation ANAGL：用于微视频推荐的抗噪声反稀疏图学习法

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-03 DOI: 10.1145/3670407

Jingwei Ma, Kangkang Bian, Yang Xu, Lei Zhu

In recent years, Graph Convolutional Networks (GCNs) have seen widespread utilization within micro-video recommendation systems, facilitating the understanding of user preferences through interactions with micro-videos. Despite the commendable performance exhibited by GCN-based methodologies, several persistent issues demand further scrutiny. Primarily, most user-micro-video interactions involve implicit behaviors, such as clicks or abstentions, which may inadvertently capture irrelevant micro-video content, thereby introducing significant noise (false touches, low watch-ratio, low ratings) into users’ histories. Consequently, this noise undermines the efficacy of micro-video recommendations. Moreover, the abundance of micro-videos has resulted in fewer interactions between users and micro-video content. To tackle these challenges, we propose a noise-resistant and anti-sparse graph learning framework for micro-video recommendation. Initially, we construct a denoiser that leverages implicit multi-attribute information (e.g., watch-ratio, timestamp, ratings, etc.) to filter noisy data from user interaction histories. This process yields high-fidelity micro-video information, enabling a more precise modeling of users’ feature preferences. Subsequently, we employ a multi-view reconstruction approach and utilize cross-view self-supervised learning to gain insights into user and micro-video features. This strategic approach effectively mitigates the issue of data sparsity. Extensive experiments conducted on two publicly available micro-video recommendation datasets validate the effectiveness of our proposed method. For in-depth details and access to the code, please refer to our repository at “https://github.com/kbk12/ANAGL.git.”

近年来，图形卷积网络（GCN）在微视频推荐系统中得到了广泛应用，通过与微视频的交互，促进了对用户偏好的了解。尽管基于 GCN 的方法表现出了值得称道的性能，但仍有几个问题需要进一步研究。首先，大多数用户与微视频的互动都涉及点击或弃权等隐性行为，这些行为可能会无意中捕捉到无关的微视频内容，从而在用户的历史记录中引入大量噪音（误触、低观看率、低评分）。因此，这种噪音会削弱微视频推荐的效果。此外，微视频的大量出现也导致用户与微视频内容之间的互动减少。为了应对这些挑战，我们提出了一种用于微视频推荐的抗噪声和抗稀疏图学习框架。首先，我们构建了一个去噪器，利用隐含的多属性信息（如观看率、时间戳、评分等）过滤用户互动历史中的噪声数据。这一过程产生了高保真的微视频信息，从而能够对用户的特征偏好进行更精确的建模。随后，我们采用多视角重构方法，并利用跨视角自监督学习来深入了解用户和微视频特征。这种策略性方法有效地缓解了数据稀少的问题。在两个公开的微视频推荐数据集上进行的广泛实验验证了我们所提方法的有效性。有关深入细节和代码访问，请参阅我们的知识库："https://github.com/kbk12/ANAGL.git"。

{"title":"ANAGL: A Noise-resistant and Anti-sparse Graph Learning for micro-video recommendation","authors":"Jingwei Ma, Kangkang Bian, Yang Xu, Lei Zhu","doi":"10.1145/3670407","DOIUrl":"https://doi.org/10.1145/3670407","url":null,"abstract":"In recent years, Graph Convolutional Networks (GCNs) have seen widespread utilization within micro-video recommendation systems, facilitating the understanding of user preferences through interactions with micro-videos. Despite the commendable performance exhibited by GCN-based methodologies, several persistent issues demand further scrutiny. Primarily, most user-micro-video interactions involve implicit behaviors, such as clicks or abstentions, which may inadvertently capture irrelevant micro-video content, thereby introducing significant noise (false touches, low watch-ratio, low ratings) into users’ histories. Consequently, this noise undermines the efficacy of micro-video recommendations. Moreover, the abundance of micro-videos has resulted in fewer interactions between users and micro-video content. To tackle these challenges, we propose a noise-resistant and anti-sparse graph learning framework for micro-video recommendation. Initially, we construct a denoiser that leverages implicit multi-attribute information (e.g., watch-ratio, timestamp, ratings, etc.) to filter noisy data from user interaction histories. This process yields high-fidelity micro-video information, enabling a more precise modeling of users’ feature preferences. Subsequently, we employ a multi-view reconstruction approach and utilize cross-view self-supervised learning to gain insights into user and micro-video features. This strategic approach effectively mitigates the issue of data sparsity. Extensive experiments conducted on two publicly available micro-video recommendation datasets validate the effectiveness of our proposed method. For in-depth details and access to the code, please refer to our repository at “https://github.com/kbk12/ANAGL.git.”","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"34 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi Fine-Grained Fusion Network for Depression Detection 用于抑郁检测的多细粒度融合网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-06-01 DOI: 10.1145/3665247

Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, Bin Hu

Depression is an illness that involves emotional and mental health. Currently, depression detection through interviews is the most popular way. With the advancement of natural language processing and sentiment analysis, automated interview-based depression detection is strongly supported. However, current multimodal depression detection models fail to adequately capture the fine-grained features of depressive behaviors, making it difficult for the models to accurately characterize the subtle changes in depressive symptoms. To address this problem, we propose a Multi Fine-Grained Fusion Network (MFFNet). The core idea of this model is to extract and fuse the information of different scale feature pairs through a Multi-Scale Fastformer (MSfastformer), and then use the Recurrent Pyramid Model (RPM) to integrate the features of different resolutions, promoting the interaction of multi-level information. Through the interaction of multi-scale and multi-resolution features, it aims to explore richer feature representations. To validate the effectiveness of our proposed MFFNet model, we conduct experiments on two depression interview datasets. The experimental results show that the MFFNet model performs better in depression detection compared to other benchmark multimodal models.

抑郁症是一种涉及情绪和心理健康的疾病。目前，通过访谈检测抑郁是最流行的方法。随着自然语言处理和情感分析技术的发展，基于访谈的自动抑郁检测得到了强有力的支持。然而，目前的多模态抑郁检测模型无法充分捕捉抑郁行为的细粒度特征，因此模型难以准确描述抑郁症状的细微变化。为解决这一问题，我们提出了多细粒度融合网络（MFFNet）。该模型的核心思想是通过多尺度快速成型器（MSfastformer）提取并融合不同尺度特征对的信息，然后利用递归金字塔模型（RPM）整合不同分辨率的特征，促进多层次信息的交互。通过多尺度和多分辨率特征的交互，旨在探索更丰富的特征表征。为了验证我们提出的 MFFNet 模型的有效性，我们在两个抑郁症访谈数据集上进行了实验。实验结果表明，与其他基准多模态模型相比，MFFNet 模型在抑郁检测方面表现更好。

{"title":"Multi Fine-Grained Fusion Network for Depression Detection","authors":"Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, Bin Hu","doi":"10.1145/3665247","DOIUrl":"https://doi.org/10.1145/3665247","url":null,"abstract":"Depression is an illness that involves emotional and mental health. Currently, depression detection through interviews is the most popular way. With the advancement of natural language processing and sentiment analysis, automated interview-based depression detection is strongly supported. However, current multimodal depression detection models fail to adequately capture the fine-grained features of depressive behaviors, making it difficult for the models to accurately characterize the subtle changes in depressive symptoms. To address this problem, we propose a Multi Fine-Grained Fusion Network (MFFNet). The core idea of this model is to extract and fuse the information of different scale feature pairs through a Multi-Scale Fastformer (MSfastformer), and then use the Recurrent Pyramid Model (RPM) to integrate the features of different resolutions, promoting the interaction of multi-level information. Through the interaction of multi-scale and multi-resolution features, it aims to explore richer feature representations. To validate the effectiveness of our proposed MFFNet model, we conduct experiments on two depression interview datasets. The experimental results show that the MFFNet model performs better in depression detection compared to other benchmark multimodal models.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"50 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141189953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0