Yan Li, Xiangyuan Lan, Haifeng Chen, Ke Lu, Dongmei Jiang
Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality. This paper introduces the Multimodal PEAR Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR (Preliminaries, quEstion, Answer, Reason) chain-of-thought prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based chain-of-thought reasoning is not always reliable and might contain irrational steps due to the hallucinations of large language models. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the chain of thought, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the chain of thought. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply chain-of-thought reasoning to multimodal sentiment analysis.
{"title":"Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis","authors":"Yan Li, Xiangyuan Lan, Haifeng Chen, Ke Lu, Dongmei Jiang","doi":"10.1145/3672398","DOIUrl":"https://doi.org/10.1145/3672398","url":null,"abstract":"<p>Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, <b>lacking an in-depth understanding of the logical relationships within the text modality</b>. This paper introduces the Multimodal PEAR Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR (Preliminaries, quEstion, Answer, Reason) chain-of-thought prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, <b>text-based chain-of-thought reasoning is not always reliable and might contain irrational steps due to the hallucinations of large language models</b>. To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the chain of thought, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the chain of thought. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply chain-of-thought reasoning to multimodal sentiment analysis.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"144 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advent of 5G and interactive live broadcasting has led to a growing trend of people preferring real-time interactive video services on mobile devices, particularly mobile phones. In this work, we measure the performance of Google congestion control (GCC) in cellular networks, which is the default congestion control algorithm for Web Real-Time Communications (WebRTC). Our measurements show that GCC sometimes makes bitrate decisions which are harmful to quality of experience (QoE) in cellular networks with high jitter. We further find that the frame delivery time (FDT) in the player can mitigate network jitter and maintain QoE. Moreover, the receiving rate is better to reflect the network congestion than RTT in cellular networks. Based on these measurements and findings, we propose Mustang, an algorithm designed to overcome the jitter in cellular networks. Mustang makes use of the FDT and receiving rate as feedback information to the sender. Then the sender adjusts its sending rate based on the information to guarantee QoE. We have implemented Mustang in WebRTC and evaluated it in both emulated and real cellular networks. The experimental results show that Mustang can improve WebRTC’s both QoS and QoE performance. For QoS, Mustang increases the sending rate by 72.1% and has similar RTT and packet loss when compared with GCC, while it is about 30% better for QoE.
{"title":"Mustang: Improving QoE for Real-Time Video in Cellular Networks by Masking Jitter","authors":"Encheng Yu, Jianer Zhou, Zhenyu Li, Gareth Tyson, Weichao Li, Xinyi Zhang, Zhiwei Xu, Gaogang Xie","doi":"10.1145/3672399","DOIUrl":"https://doi.org/10.1145/3672399","url":null,"abstract":"<p>The advent of 5G and interactive live broadcasting has led to a growing trend of people preferring real-time interactive video services on mobile devices, particularly mobile phones. In this work, we measure the performance of Google congestion control (GCC) in cellular networks, which is the default congestion control algorithm for Web Real-Time Communications (WebRTC). Our measurements show that GCC sometimes makes bitrate decisions which are harmful to quality of experience (QoE) in cellular networks with high jitter. We further find that the frame delivery time (FDT) in the player can mitigate network jitter and maintain QoE. Moreover, the receiving rate is better to reflect the network congestion than RTT in cellular networks. Based on these measurements and findings, we propose Mustang, an algorithm designed to overcome the jitter in cellular networks. Mustang makes use of the FDT and receiving rate as feedback information to the sender. Then the sender adjusts its sending rate based on the information to guarantee QoE. We have implemented Mustang in WebRTC and evaluated it in both emulated and real cellular networks. The experimental results show that Mustang can improve WebRTC’s both QoS and QoE performance. For QoS, Mustang increases the sending rate by 72.1% and has similar RTT and packet loss when compared with GCC, while it is about 30% better for QoE.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"15 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanzhang Wang, Deming Zhai, Xiong Zhou, Junjun Jiang, Xianming Liu
Denoising diffusion probabilistic models (DDPM) have shown impressive performance in various domains as a class of deep generative models. In this paper, we introduce the Mixture noise-based DDPM (Mix-DDPM), which considers the Markov diffusion posterior as a Gaussian mixture model. Specifically, Mix-DDPM randomly selects a Gaussian component and then adds the chosen Gaussian noise, which can be demonstrated as a more efficient way to perturb the signals into a simple known distribution. We further define the reverse probabilistic model as a parameterized Gaussian mixture kernel. Due to the intractability in calculating the KL divergence between Gaussian mixture models, we derive a variational bound to maximize the likelihood, offering a concise formulation for optimizing the denoising model and valuable insights for designing the sampling strategies. Our theoretical derivation highlights that Mix-DDPM need only shift image which requires the inclusion of a global stochastic offset in both the diffusion and reverse processes, which can be efficiently implemented with just several lines of code. The global stochastic offset effectively fits a Gaussian mixture distribution enhancing the degrees of freedom of the entire diffusion model. Furthermore, we present three streamlined sampling strategies that interface with diverse fast dedicated solvers for diffusion ordinary differential equations, boosting the efficacy of image representation in the sampling phase and alleviating the issue of slow generation speed, thereby enhancing both efficiency and accuracy. Extensive experiments on benchmark datasets demonstrate the effectiveness of Mix-DDPM and its superiority over the original DDPM.
{"title":"Mix-DDPM: Enhancing Diffusion Models through Fitting Mixture Noise with Global Stochastic Offset","authors":"Hanzhang Wang, Deming Zhai, Xiong Zhou, Junjun Jiang, Xianming Liu","doi":"10.1145/3672080","DOIUrl":"https://doi.org/10.1145/3672080","url":null,"abstract":"<p>Denoising diffusion probabilistic models (DDPM) have shown impressive performance in various domains as a class of deep generative models. In this paper, we introduce the Mixture noise-based DDPM (Mix-DDPM), which considers the Markov diffusion posterior as a Gaussian mixture model. Specifically, Mix-DDPM randomly selects a Gaussian component and then adds the chosen Gaussian noise, which can be demonstrated as a more efficient way to perturb the signals into a simple known distribution. We further define the reverse probabilistic model as a parameterized Gaussian mixture kernel. Due to the intractability in calculating the KL divergence between Gaussian mixture models, we derive a variational bound to maximize the likelihood, offering a concise formulation for optimizing the denoising model and valuable insights for designing the sampling strategies. Our theoretical derivation highlights that <i>Mix-DDPM need only shift image which requires the inclusion of a global stochastic offset in both the diffusion and reverse processes</i>, which can be efficiently implemented with just several lines of code. The global stochastic offset effectively fits a Gaussian mixture distribution enhancing the degrees of freedom of the entire diffusion model. Furthermore, we present three streamlined sampling strategies that interface with diverse fast dedicated solvers for diffusion ordinary differential equations, boosting the efficacy of image representation in the sampling phase and alleviating the issue of slow generation speed, thereby enhancing both efficiency and accuracy. Extensive experiments on benchmark datasets demonstrate the effectiveness of Mix-DDPM and its superiority over the original DDPM.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"11 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Describing an object from multiple perspectives often leads to incomplete data representation. Consequently, learning consistent representations for missing data from multiple views has emerged as a key focus in the realm of Incomplete Multi-view Representation Learning (IMRL). In recent years, various strategies such as subspace learning, matrix decomposition, and deep learning have been harnessed to develop numerous IMRL methods. In this paper, our primary research revolves around IMRL, with a particular emphasis on addressing two main challenges. Firstly, we delve into the effective integration of intra-view similarity and contextual structure into a unified framework. Secondly, we explore the effective facilitation of information exchange and fusion across multiple views. To tackle these issues, we propose a deep learning approach known as Structural Contrastive Auto-encoder (SCAE) to solve the challenges of IMRL. SCAE comprises two major components: Intra-View Structural Representation Learning and Inter-View Contrastive Representation Learning. The former involves capturing intra-view similarity by minimizing the Dirichlet energy of the feature matrix, while also applying spatial dispersion regularization to capture intra-view contextual structure. The latter encourages maximizing the mutual information of inter-view representations, facilitating information exchange and fusion across views. Experimental results demonstrate the efficacy of our approach in significantly enhancing model accuracy and robustly addressing IMRL problems. The code is available at https://github.com/limengran98/SCAE.
{"title":"SCAE: Structural Contrastive Auto-encoder for Incomplete Multi-view Representation Learning","authors":"Mengran Li, Ronghui Zhang, Yong Zhang, Xinglin Piao, Shiyu Zhao, Baocai Yin","doi":"10.1145/3672078","DOIUrl":"https://doi.org/10.1145/3672078","url":null,"abstract":"<p>Describing an object from multiple perspectives often leads to incomplete data representation. Consequently, learning consistent representations for missing data from multiple views has emerged as a key focus in the realm of Incomplete Multi-view Representation Learning (IMRL). In recent years, various strategies such as subspace learning, matrix decomposition, and deep learning have been harnessed to develop numerous IMRL methods. In this paper, our primary research revolves around IMRL, with a particular emphasis on addressing two main challenges. Firstly, we delve into the effective integration of intra-view similarity and contextual structure into a unified framework. Secondly, we explore the effective facilitation of information exchange and fusion across multiple views. To tackle these issues, we propose a deep learning approach known as Structural Contrastive Auto-encoder (SCAE) to solve the challenges of IMRL. SCAE comprises two major components: Intra-View Structural Representation Learning and Inter-View Contrastive Representation Learning. The former involves capturing intra-view similarity by minimizing the Dirichlet energy of the feature matrix, while also applying spatial dispersion regularization to capture intra-view contextual structure. The latter encourages maximizing the mutual information of inter-view representations, facilitating information exchange and fusion across views. Experimental results demonstrate the efficacy of our approach in significantly enhancing model accuracy and robustly addressing IMRL problems. The code is available at https://github.com/limengran98/SCAE.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.
{"title":"Towards Long Form Audio-visual Video Understanding","authors":"Wenxuan Hou, Guangyao Li, Yapeng Tian, Di Hu","doi":"10.1145/3672079","DOIUrl":"https://doi.org/10.1145/3672079","url":null,"abstract":"<p>We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each collected video is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. We hope that our newly collected dataset and novel approach serve as a cornerstone for furthering research in the realm of long form audio-visual video understanding. Project page: https://gewu-lab.github.io/LFAV/.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"134 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kai Lv, Haobo Chen, Chuyang Zhao, Kai Tu, Junru Chen, Yadong Li, Boxun Li, Youfang Lin
Domain Generalization person Re-identification (DG-ReID) has gained much attention recently due to the poor performance of supervised re-identification on unseen domains. The goal of domain generalization is to develop a model that is insensitive to domain bias and can perform well across different domains. In this paper, We conduct experiments to verify the importance of style factors in domain bias. Specifically, the experiments are to affirm that style bias across different domains significantly contributes to domain bias. Based on this observation, we propose Style Variable and Irrelevant Learning (SVIL) to eliminate the influence of style factors on the model. Specifically, we employ a Style Jitter Module (SJM) that enhances the style diversity of a specific source domain and reduces the style differences among various source domains. This allows the model to focus on identity-relevant information and be robust to style changes. We also integrate the SJM module with a meta-learning algorithm to further enhance the model’s generalization ability. Notably, our SJM module is easy to implement and does not add any inference cost. Our extensive experiments demonstrate the effectiveness of our approach, which outperforms existing methods on DG-ReID benchmarks.
{"title":"Style Variable and Irrelevant Learning for Generalizable Person Re-identification","authors":"Kai Lv, Haobo Chen, Chuyang Zhao, Kai Tu, Junru Chen, Yadong Li, Boxun Li, Youfang Lin","doi":"10.1145/3671003","DOIUrl":"https://doi.org/10.1145/3671003","url":null,"abstract":"<p>Domain Generalization person Re-identification (DG-ReID) has gained much attention recently due to the poor performance of supervised re-identification on unseen domains. The goal of domain generalization is to develop a model that is insensitive to domain bias and can perform well across different domains. In this paper, We conduct experiments to verify the importance of style factors in domain bias. Specifically, the experiments are to affirm that style bias across different domains significantly contributes to domain bias. Based on this observation, we propose Style Variable and Irrelevant Learning (SVIL) to eliminate the influence of style factors on the model. Specifically, we employ a Style Jitter Module (SJM) that enhances the style diversity of a specific source domain and reduces the style differences among various source domains. This allows the model to focus on identity-relevant information and be robust to style changes. We also integrate the SJM module with a meta-learning algorithm to further enhance the model’s generalization ability. Notably, our SJM module is easy to implement and does not add any inference cost. Our extensive experiments demonstrate the effectiveness of our approach, which outperforms existing methods on DG-ReID benchmarks.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"26 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141513723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve, etc.) and styles (e.g., cool, classic, fresh, etc.) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.
{"title":"Towards Attribute-Controlled Fashion Image Captioning","authors":"Chen Cai, Kim-Hui Yap, Suchen Wang","doi":"10.1145/3671000","DOIUrl":"https://doi.org/10.1145/3671000","url":null,"abstract":"<p>Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve, etc.) and styles (e.g., cool, classic, fresh, etc.) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"43 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng
Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.
{"title":"VoiceStyle: Voice-based Face Generation Via Cross-modal Prototype Contrastive Learning","authors":"Wuyang Chen, Boqing Zhu, Kele Xu, Yong Dou, Dawei Feng","doi":"10.1145/3671002","DOIUrl":"https://doi.org/10.1145/3671002","url":null,"abstract":"<p>Can we predict a person’s appearance solely based on their voice? This paper explores this question by focusing on generating a face from an unheard voice segment. Our proposed method, VoiceStyle, combines cross-modal representation learning with generation modeling, enabling us to incorporate voice semantic cues into the generated face. In the first stage, we introduce cross-modal prototype contrastive learning (CMPC) to establish the association between voice and face. Recognizing the presence of false negative and deviate positive instances in real-world unlabeled data, we not only use voice-face pairs in the same video but also construct additional semantic positive pairs through unsupervised clustering, enhancing the learning process. Moreover, we recalibrate instances based on their similarity to cluster centers in the other modality. In the second stage, we harness the powerful generative capabilities of StyleGAN to produce faces. We optimize the latent code in StyleGAN’s latent space, guided by the learned voice-face alignment. To address the importance of selecting an appropriate starting point for optimization, we aim to automatically find an optimal starting point by utilizing the face prototype derived from the voice input. The entire pipeline can be implemented in a self-supervised manner, eliminating the need for manually labeled annotations. Through extensive experiments, we demonstrate the effectiveness and performance of our VoiceStyle method in both cross-modal representation learning and voice-based face generation.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"67 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Graph Convolutional Networks (GCNs) have seen widespread utilization within micro-video recommendation systems, facilitating the understanding of user preferences through interactions with micro-videos. Despite the commendable performance exhibited by GCN-based methodologies, several persistent issues demand further scrutiny. Primarily, most user-micro-video interactions involve implicit behaviors, such as clicks or abstentions, which may inadvertently capture irrelevant micro-video content, thereby introducing significant noise (false touches, low watch-ratio, low ratings) into users’ histories. Consequently, this noise undermines the efficacy of micro-video recommendations. Moreover, the abundance of micro-videos has resulted in fewer interactions between users and micro-video content. To tackle these challenges, we propose a noise-resistant and anti-sparse graph learning framework for micro-video recommendation. Initially, we construct a denoiser that leverages implicit multi-attribute information (e.g., watch-ratio, timestamp, ratings, etc.) to filter noisy data from user interaction histories. This process yields high-fidelity micro-video information, enabling a more precise modeling of users’ feature preferences. Subsequently, we employ a multi-view reconstruction approach and utilize cross-view self-supervised learning to gain insights into user and micro-video features. This strategic approach effectively mitigates the issue of data sparsity. Extensive experiments conducted on two publicly available micro-video recommendation datasets validate the effectiveness of our proposed method. For in-depth details and access to the code, please refer to our repository at “https://github.com/kbk12/ANAGL.git.”
{"title":"ANAGL: A Noise-resistant and Anti-sparse Graph Learning for micro-video recommendation","authors":"Jingwei Ma, Kangkang Bian, Yang Xu, Lei Zhu","doi":"10.1145/3670407","DOIUrl":"https://doi.org/10.1145/3670407","url":null,"abstract":"<p>In recent years, Graph Convolutional Networks (GCNs) have seen widespread utilization within micro-video recommendation systems, facilitating the understanding of user preferences through interactions with micro-videos. Despite the commendable performance exhibited by GCN-based methodologies, several persistent issues demand further scrutiny. Primarily, most user-micro-video interactions involve implicit behaviors, such as clicks or abstentions, which may inadvertently capture irrelevant micro-video content, thereby introducing significant noise (false touches, low watch-ratio, low ratings) into users’ histories. Consequently, this noise undermines the efficacy of micro-video recommendations. Moreover, the abundance of micro-videos has resulted in fewer interactions between users and micro-video content. To tackle these challenges, we propose a noise-resistant and anti-sparse graph learning framework for micro-video recommendation. Initially, we construct a denoiser that leverages implicit multi-attribute information (e.g., watch-ratio, timestamp, ratings, etc.) to filter noisy data from user interaction histories. This process yields high-fidelity micro-video information, enabling a more precise modeling of users’ feature preferences. Subsequently, we employ a multi-view reconstruction approach and utilize cross-view self-supervised learning to gain insights into user and micro-video features. This strategic approach effectively mitigates the issue of data sparsity. Extensive experiments conducted on two publicly available micro-video recommendation datasets validate the effectiveness of our proposed method. For in-depth details and access to the code, please refer to our repository at “https://github.com/kbk12/ANAGL.git.”</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"34 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, Bin Hu
Depression is an illness that involves emotional and mental health. Currently, depression detection through interviews is the most popular way. With the advancement of natural language processing and sentiment analysis, automated interview-based depression detection is strongly supported. However, current multimodal depression detection models fail to adequately capture the fine-grained features of depressive behaviors, making it difficult for the models to accurately characterize the subtle changes in depressive symptoms. To address this problem, we propose a Multi Fine-Grained Fusion Network (MFFNet). The core idea of this model is to extract and fuse the information of different scale feature pairs through a Multi-Scale Fastformer (MSfastformer), and then use the Recurrent Pyramid Model (RPM) to integrate the features of different resolutions, promoting the interaction of multi-level information. Through the interaction of multi-scale and multi-resolution features, it aims to explore richer feature representations. To validate the effectiveness of our proposed MFFNet model, we conduct experiments on two depression interview datasets. The experimental results show that the MFFNet model performs better in depression detection compared to other benchmark multimodal models.
{"title":"Multi Fine-Grained Fusion Network for Depression Detection","authors":"Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, Bin Hu","doi":"10.1145/3665247","DOIUrl":"https://doi.org/10.1145/3665247","url":null,"abstract":"<p>Depression is an illness that involves emotional and mental health. Currently, depression detection through interviews is the most popular way. With the advancement of natural language processing and sentiment analysis, automated interview-based depression detection is strongly supported. However, current multimodal depression detection models fail to adequately capture the fine-grained features of depressive behaviors, making it difficult for the models to accurately characterize the subtle changes in depressive symptoms. To address this problem, we propose a Multi Fine-Grained Fusion Network (MFFNet). The core idea of this model is to extract and fuse the information of different scale feature pairs through a Multi-Scale Fastformer (MSfastformer), and then use the Recurrent Pyramid Model (RPM) to integrate the features of different resolutions, promoting the interaction of multi-level information. Through the interaction of multi-scale and multi-resolution features, it aims to explore richer feature representations. To validate the effectiveness of our proposed MFFNet model, we conduct experiments on two depression interview datasets. The experimental results show that the MFFNet model performs better in depression detection compared to other benchmark multimodal models.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"50 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141189953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}