Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu
Salient Object Detection (SOD) aims to identify and segment the most prominent objects in images. Advanced SOD methods often utilize various Convolutional Neural Networks (CNN) or Transformers for deep feature extraction. However, these methods still deliver low performance and poor generalization in complex cases. Recently, Segment Anything Model (SAM) has been proposed as a visual fundamental model, which gives strong segmentation and generalization capabilities. Nonetheless, SAM requires accurate prompts of target objects, which are unavailable in SOD. Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to learn multi-scale information with very few trainable parameters. Then, we propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the multi-level information from the SAM's encoder. Finally, we propose a Detail Enhancement Module (DEM) to incorporate SAM with fine-grained details. Experimental results demonstrate the superior performance of our model on multiple SOD datasets and its strong generalization on other segmentation tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.
突出物体检测(SOD)旨在识别和分割图像中最突出的物体。先进的 SOD 方法通常利用各种卷积神经网络(CNN)或变换器进行深度特征提取。然而,这些方法在复杂情况下的性能仍然较低,泛化能力较差。最近,有人提出了一种视觉基本模型--"任意分割模型"(Segment Anything Model,SAM),它具有很强的分割和泛化能力。然而,SAM 需要目标对象的准确提示,而 SOD 却无法做到这一点。此外,SAM 缺乏对多尺度和多层次信息的利用,也没有纳入细粒度细节。为了解决这些不足,我们提出了一种适用于 SOD 的多尺度和细节增强型 SAM(MDSAM)。具体来说,我们首先引入了轻量级多尺度适配器(LMSA),该适配器允许 SAM 以极少的可训练参数学习多尺度信息。然后,我们提出了多级融合模块(MLFM),以全面利用 SAM 编码器的多级信息。实验结果表明,我们的模型在多个 SOD 数据集上表现出色,在其他分割任务上也有很强的通用性。源代码发布于 https://github.com/BellyBeauty/MDSAM。
{"title":"Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection","authors":"Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu","doi":"arxiv-2408.04326","DOIUrl":"https://doi.org/arxiv-2408.04326","url":null,"abstract":"Salient Object Detection (SOD) aims to identify and segment the most\u0000prominent objects in images. Advanced SOD methods often utilize various\u0000Convolutional Neural Networks (CNN) or Transformers for deep feature\u0000extraction. However, these methods still deliver low performance and poor\u0000generalization in complex cases. Recently, Segment Anything Model (SAM) has\u0000been proposed as a visual fundamental model, which gives strong segmentation\u0000and generalization capabilities. Nonetheless, SAM requires accurate prompts of\u0000target objects, which are unavailable in SOD. Additionally, SAM lacks the\u0000utilization of multi-scale and multi-level information, as well as the\u0000incorporation of fine-grained details. To address these shortcomings, we\u0000propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we\u0000first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to\u0000learn multi-scale information with very few trainable parameters. Then, we\u0000propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the\u0000multi-level information from the SAM's encoder. Finally, we propose a Detail\u0000Enhancement Module (DEM) to incorporate SAM with fine-grained details.\u0000Experimental results demonstrate the superior performance of our model on\u0000multiple SOD datasets and its strong generalization on other segmentation\u0000tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua
We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at https://github.com/LuminosityX/MM-Forecast.
{"title":"MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models","authors":"Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua","doi":"arxiv-2408.04388","DOIUrl":"https://doi.org/arxiv-2408.04388","url":null,"abstract":"We study an emerging and intriguing problem of multimodal temporal event\u0000forecasting with large language models. Compared to using text or graph\u0000modalities, the investigation of utilizing images for temporal event\u0000forecasting has not been fully explored, especially in the era of large\u0000language models (LLMs). To bridge this gap, we are particularly interested in\u0000two key questions of: 1) why images will help in temporal event forecasting,\u0000and 2) how to integrate images into the LLM-based forecasting framework. To\u0000answer these research questions, we propose to identify two essential functions\u0000that images play in the scenario of temporal event forecasting, i.e.,\u0000highlighting and complementary. Then, we develop a novel framework, named\u0000MM-Forecast. It employs an Image Function Identification module to recognize\u0000these functions as verbal descriptions using multimodal large language models\u0000(MLLMs), and subsequently incorporates these function descriptions into\u0000LLM-based forecasting models. To evaluate our approach, we construct a new\u0000multimodal dataset, MidEast-TE-mm, by extending an existing event dataset\u0000MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast\u0000can correctly identify the image functions, and further more, incorporating\u0000these verbal function descriptions significantly improves the forecasting\u0000performance. The dataset, code, and prompts are available at\u0000https://github.com/LuminosityX/MM-Forecast.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Neto, Martin Hartmann, Geoff Luck, Petri Toiviainen
Based on a review of anecdotal beliefs, we explored patterns of track-sequencing within professional music albums. We found that songs with high levels of valence, energy and loudness are more likely to be positioned at the beginning of each album. We also found that transitions between consecutive tracks tend to alternate between increases and decreases of valence and energy. These findings were used to build a system which automates the process of album-sequencing. Our results and hypothesis have both practical and theoretical applications. Practically, sequencing regularities can be used to inform playlist generation systems. Theoretically, we show weak to moderate support for the idea that music is perceived in both global and local contexts.
{"title":"The algorithmic nature of song-sequencing: statistical regularities in music albums","authors":"Pedro Neto, Martin Hartmann, Geoff Luck, Petri Toiviainen","doi":"arxiv-2408.04383","DOIUrl":"https://doi.org/arxiv-2408.04383","url":null,"abstract":"Based on a review of anecdotal beliefs, we explored patterns of\u0000track-sequencing within professional music albums. We found that songs with\u0000high levels of valence, energy and loudness are more likely to be positioned at\u0000the beginning of each album. We also found that transitions between consecutive\u0000tracks tend to alternate between increases and decreases of valence and energy.\u0000These findings were used to build a system which automates the process of\u0000album-sequencing. Our results and hypothesis have both practical and\u0000theoretical applications. Practically, sequencing regularities can be used to\u0000inform playlist generation systems. Theoretically, we show weak to moderate\u0000support for the idea that music is perceived in both global and local contexts.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"2012 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the exponential growth of multimedia data, leveraging multimodal sensors presents a promising approach for improving accuracy in human activity recognition. Nevertheless, accurately identifying these activities using both video data and wearable sensor data presents challenges due to the labor-intensive data annotation, and reliance on external pretrained models or additional data. To address these challenges, we introduce Multimodal Masked Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal masked autoencoder with a synchronized masking strategy tailored for wearable sensors. This masking strategy compels the networks to capture more meaningful spatiotemporal features, which enables effective self-supervised pretraining without the need for external data. Furthermore, Mu-MAE leverages the representation extracted from multimodal masked autoencoders as prior information input to a cross-attention multimodal fusion layer. This fusion layer emphasizes spatiotemporal features requiring attention across different modalities while highlighting differences from other classes, aiding in the classification of various classes in metric-based one-shot learning. Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE outperforms all the evaluated approaches, achieving up to an 80.17% accuracy for five-way one-shot multimodal classification, without the use of additional data.
{"title":"MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning","authors":"Rex Liu, Xin Liu","doi":"arxiv-2408.04243","DOIUrl":"https://doi.org/arxiv-2408.04243","url":null,"abstract":"With the exponential growth of multimedia data, leveraging multimodal sensors\u0000presents a promising approach for improving accuracy in human activity\u0000recognition. Nevertheless, accurately identifying these activities using both\u0000video data and wearable sensor data presents challenges due to the\u0000labor-intensive data annotation, and reliance on external pretrained models or\u0000additional data. To address these challenges, we introduce Multimodal Masked\u0000Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal\u0000masked autoencoder with a synchronized masking strategy tailored for wearable\u0000sensors. This masking strategy compels the networks to capture more meaningful\u0000spatiotemporal features, which enables effective self-supervised pretraining\u0000without the need for external data. Furthermore, Mu-MAE leverages the\u0000representation extracted from multimodal masked autoencoders as prior\u0000information input to a cross-attention multimodal fusion layer. This fusion\u0000layer emphasizes spatiotemporal features requiring attention across different\u0000modalities while highlighting differences from other classes, aiding in the\u0000classification of various classes in metric-based one-shot learning.\u0000Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE\u0000outperforms all the evaluated approaches, achieving up to an 80.17% accuracy\u0000for five-way one-shot multimodal classification, without the use of additional\u0000data.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong
The integration of conversational artificial intelligence (AI) into mental health care promises a new horizon for therapist-client interactions, aiming to closely emulate the depth and nuance of human conversations. Despite the potential, the current landscape of conversational AI is markedly limited by its reliance on single-modal data, constraining the systems' ability to empathize and provide effective emotional support. This limitation stems from a paucity of resources that encapsulate the multimodal nature of human communication essential for therapeutic counseling. To address this gap, we introduce the Multimodal Emotional Support Conversation (MESC) dataset, a first-of-its-kind resource enriched with comprehensive annotations across text, audio, and video modalities. This dataset captures the intricate interplay of user emotions, system strategies, system emotion, and system responses, setting a new precedent in the field. Leveraging the MESC dataset, we propose a general Sequential Multimodal Emotional Support framework (SMES) grounded in Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES framework incorporates an LLM-based reasoning model that sequentially generates user emotion recognition, system strategy prediction, system emotion prediction, and response generation. Our rigorous evaluations demonstrate that this framework significantly enhances the capability of AI systems to mimic therapist behaviors with heightened empathy and strategic responsiveness. By integrating multimodal data in this innovative manner, we bridge the critical gap between emotion recognition and emotional support, marking a significant advancement in conversational AI for mental health support.
{"title":"Towards Multimodal Emotional Support Conversation Systems","authors":"Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong","doi":"arxiv-2408.03650","DOIUrl":"https://doi.org/arxiv-2408.03650","url":null,"abstract":"The integration of conversational artificial intelligence (AI) into mental\u0000health care promises a new horizon for therapist-client interactions, aiming to\u0000closely emulate the depth and nuance of human conversations. Despite the\u0000potential, the current landscape of conversational AI is markedly limited by\u0000its reliance on single-modal data, constraining the systems' ability to\u0000empathize and provide effective emotional support. This limitation stems from a\u0000paucity of resources that encapsulate the multimodal nature of human\u0000communication essential for therapeutic counseling. To address this gap, we\u0000introduce the Multimodal Emotional Support Conversation (MESC) dataset, a\u0000first-of-its-kind resource enriched with comprehensive annotations across text,\u0000audio, and video modalities. This dataset captures the intricate interplay of\u0000user emotions, system strategies, system emotion, and system responses, setting\u0000a new precedent in the field. Leveraging the MESC dataset, we propose a general\u0000Sequential Multimodal Emotional Support framework (SMES) grounded in\u0000Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES\u0000framework incorporates an LLM-based reasoning model that sequentially generates\u0000user emotion recognition, system strategy prediction, system emotion\u0000prediction, and response generation. Our rigorous evaluations demonstrate that\u0000this framework significantly enhances the capability of AI systems to mimic\u0000therapist behaviors with heightened empathy and strategic responsiveness. By\u0000integrating multimodal data in this innovative manner, we bridge the critical\u0000gap between emotion recognition and emotional support, marking a significant\u0000advancement in conversational AI for mental health support.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present TALE, a novel training-free framework harnessing the generative capabilities of text-to-image diffusion models to address the cross-domain image composition task that focuses on flawlessly incorporating user-specified objects into a designated visual contexts regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pre-trained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.
我们提出的TALE是一种新颖的免训练框架,它利用文本到图像扩散模型的生成能力来解决跨领域图像合成任务,该任务的重点是将用户指定的对象完美地融入指定的视觉情境中,而不受领域差异的影响。以往的方法通常涉及在定制数据集上训练辅助网络或微调扩散模型,这不仅成本高昂,而且可能会破坏预训练扩散模型的稳健文本和视觉前验。最近的一些研究试图打破这一障碍,提出了免训练的变通方法,即依靠操纵注意力图来隐式地驯服去噪过程。然而,通过注意力图进行合成并不一定会产生理想的合成结果。这些方法只能保留一些语义信息,通常无法保留输入对象的身份特征,或在生成的图像中表现出有限的背景-对象风格适应性。相比之下,TALE 是一种新颖的方法,它直接在潜在空间中运行,为合成过程提供明确有效的指导,从而解决这些问题。具体来说,我们为 TALE 配备了两种机制,分别称为 "自适应潜影操作"(Adaptive Latent Manipulation)和 "能量引导潜影优化"(Energy-guidedLatent Optimization)。前者通过在相应的时间步直接利用背景和前景潜变量,制定有利于启动和引导合成过程的噪声潜变量,而后者则利用指定的能量函数,进一步优化符合特定条件的中间潜变量,从而与前者相辅相成,生成所需的最终结果。我们的实验证明,TALE 超越了先前的基准线,在各种逼真和艺术领域的图像引导合成中达到了最先进的性能。
{"title":"TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization","authors":"Kien T. Pham, Jingye Chen, Qifeng Chen","doi":"arxiv-2408.03637","DOIUrl":"https://doi.org/arxiv-2408.03637","url":null,"abstract":"We present TALE, a novel training-free framework harnessing the generative\u0000capabilities of text-to-image diffusion models to address the cross-domain\u0000image composition task that focuses on flawlessly incorporating user-specified\u0000objects into a designated visual contexts regardless of domain disparity.\u0000Previous methods often involve either training auxiliary networks or finetuning\u0000diffusion models on customized datasets, which are expensive and may undermine\u0000the robust textual and visual priors of pre-trained diffusion models. Some\u0000recent works attempt to break the barrier by proposing training-free\u0000workarounds that rely on manipulating attention maps to tame the denoising\u0000process implicitly. However, composing via attention maps does not necessarily\u0000yield desired compositional outcomes. These approaches could only retain some\u0000semantic information and usually fall short in preserving identity\u0000characteristics of input objects or exhibit limited background-object style\u0000adaptation in generated images. In contrast, TALE is a novel method that\u0000operates directly on latent space to provide explicit and effective guidance\u0000for the composition process to resolve these problems. Specifically, we equip\u0000TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided\u0000Latent Optimization. The former formulates noisy latents conducive to\u0000initiating and steering the composition process by directly leveraging\u0000background and foreground latents at corresponding timesteps, and the latter\u0000exploits designated energy functions to further optimize intermediate latents\u0000conforming to specific conditions that complement the former to generate\u0000desired final results. Our experiments demonstrate that TALE surpasses prior\u0000baselines and attains state-of-the-art performance in image-guided composition\u0000across various photorealistic and artistic domains.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han
The utilization of automated depression detection significantly enhances early intervention for individuals experiencing depression. Despite numerous proposals on automated depression detection using recorded clinical interview videos, limited attention has been paid to considering the hierarchical structure of the interview questions. In clinical interviews for diagnosing depression, clinicians use a structured questionnaire that includes routine baseline questions and follow-up questions to assess the interviewee's condition. This paper introduces HiQuE (Hierarchical Question Embedding network), a novel depression detection framework that leverages the hierarchical relationship between primary and follow-up questions in clinical interviews. HiQuE can effectively capture the importance of each question in diagnosing depression by learning mutual information across multiple modalities. We conduct extensive experiments on the widely-used clinical interview data, DAIC-WOZ, where our model outperforms other state-of-the-art multimodal depression detection models and emotion recognition models, showcasing its clinical utility in depression detection.
{"title":"HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection","authors":"Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han","doi":"arxiv-2408.03648","DOIUrl":"https://doi.org/arxiv-2408.03648","url":null,"abstract":"The utilization of automated depression detection significantly enhances\u0000early intervention for individuals experiencing depression. Despite numerous\u0000proposals on automated depression detection using recorded clinical interview\u0000videos, limited attention has been paid to considering the hierarchical\u0000structure of the interview questions. In clinical interviews for diagnosing\u0000depression, clinicians use a structured questionnaire that includes routine\u0000baseline questions and follow-up questions to assess the interviewee's\u0000condition. This paper introduces HiQuE (Hierarchical Question Embedding\u0000network), a novel depression detection framework that leverages the\u0000hierarchical relationship between primary and follow-up questions in clinical\u0000interviews. HiQuE can effectively capture the importance of each question in\u0000diagnosing depression by learning mutual information across multiple\u0000modalities. We conduct extensive experiments on the widely-used clinical\u0000interview data, DAIC-WOZ, where our model outperforms other state-of-the-art\u0000multimodal depression detection models and emotion recognition models,\u0000showcasing its clinical utility in depression detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"372 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee
The preservation of cultural heritage, as mandated by the United Nations Sustainable Development Goals (SDGs), is integral to sustainable urban development. This paper focuses on the Dragon Boat Festival, a prominent event in Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), to enhance its preservation and accessibility. Traditionally, participation in the festival's dragon boat races was limited to elite athletes, excluding broader demographics. Our proposed solution, named MetaDragonBoat, enables virtual participation in dragon boat racing, offering immersive experiences that replicate physical exertion through a cultural journey. Thus, we build a digital twin of a university campus located in a region with a rich dragon boat racing tradition. Coupled with three paddling techniques that are enabled by either commercial controllers or physical paddle controllers with haptic feedback, diversified users can engage in realistic rowing experiences. Our results demonstrate that by integrating resistance into the paddle controls, users could simulate the physical effort of dragon boat racing, promoting a deeper understanding and appreciation of this cultural heritage.
{"title":"MetaDragonBoat: Exploring Paddling Techniques of Virtual Dragon Boating in a Metaverse Campus","authors":"Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee","doi":"arxiv-2408.04013","DOIUrl":"https://doi.org/arxiv-2408.04013","url":null,"abstract":"The preservation of cultural heritage, as mandated by the United Nations\u0000Sustainable Development Goals (SDGs), is integral to sustainable urban\u0000development. This paper focuses on the Dragon Boat Festival, a prominent event\u0000in Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), to\u0000enhance its preservation and accessibility. Traditionally, participation in the\u0000festival's dragon boat races was limited to elite athletes, excluding broader\u0000demographics. Our proposed solution, named MetaDragonBoat, enables virtual\u0000participation in dragon boat racing, offering immersive experiences that\u0000replicate physical exertion through a cultural journey. Thus, we build a\u0000digital twin of a university campus located in a region with a rich dragon boat\u0000racing tradition. Coupled with three paddling techniques that are enabled by\u0000either commercial controllers or physical paddle controllers with haptic\u0000feedback, diversified users can engage in realistic rowing experiences. Our\u0000results demonstrate that by integrating resistance into the paddle controls,\u0000users could simulate the physical effort of dragon boat racing, promoting a\u0000deeper understanding and appreciation of this cultural heritage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang
The customization of text-to-image models has seen significant advancements, yet generating multiple personalized concepts remains a challenging task. Current methods struggle with attribute leakage and layout confusion when handling multiple concepts, leading to reduced concept fidelity and semantic consistency. In this work, we introduce a novel training-free framework, Concept Conductor, designed to ensure visual fidelity and correct layout in multi-concept customization. Concept Conductor isolates the sampling processes of multiple custom models to prevent attribute leakage between different concepts and corrects erroneous layouts through self-attention-based spatial guidance. Additionally, we present a concept injection technique that employs shape-aware masks to specify the generation area for each concept. This technique injects the structure and appearance of personalized concepts through feature fusion in the attention layers, ensuring harmony in the final image. Extensive qualitative and quantitative experiments demonstrate that Concept Conductor can consistently generate composite images with accurate layouts while preserving the visual details of each concept. Compared to existing baselines, Concept Conductor shows significant performance improvements. Our method supports the combination of any number of concepts and maintains high fidelity even when dealing with visually similar concepts. The code and models are available at https://github.com/Nihukat/Concept-Conductor.
{"title":"Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis","authors":"Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang","doi":"arxiv-2408.03632","DOIUrl":"https://doi.org/arxiv-2408.03632","url":null,"abstract":"The customization of text-to-image models has seen significant advancements,\u0000yet generating multiple personalized concepts remains a challenging task.\u0000Current methods struggle with attribute leakage and layout confusion when\u0000handling multiple concepts, leading to reduced concept fidelity and semantic\u0000consistency. In this work, we introduce a novel training-free framework,\u0000Concept Conductor, designed to ensure visual fidelity and correct layout in\u0000multi-concept customization. Concept Conductor isolates the sampling processes\u0000of multiple custom models to prevent attribute leakage between different\u0000concepts and corrects erroneous layouts through self-attention-based spatial\u0000guidance. Additionally, we present a concept injection technique that employs\u0000shape-aware masks to specify the generation area for each concept. This\u0000technique injects the structure and appearance of personalized concepts through\u0000feature fusion in the attention layers, ensuring harmony in the final image.\u0000Extensive qualitative and quantitative experiments demonstrate that Concept\u0000Conductor can consistently generate composite images with accurate layouts\u0000while preserving the visual details of each concept. Compared to existing\u0000baselines, Concept Conductor shows significant performance improvements. Our\u0000method supports the combination of any number of concepts and maintains high\u0000fidelity even when dealing with visually similar concepts. The code and models\u0000are available at https://github.com/Nihukat/Concept-Conductor.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.
电子商务的多媒体化程度越来越高,产品以图片、短视频或现场直播推广的方式在国外展示。统一和矢量化的跨域生产表示是必不可少的。在广域场景中,产品内部差异大,产品之间相似度高,因此仅有视觉表示是不够的。虽然从短视频或直播视频中提取的自动语音识别(ASR)文本很容易获得,但如何通过多模态表征学习对噪声过大的文本进行去噪处理却大多没有涉及。我们提出了 ASR 增强多模态产品表征学习(AMPere)。为了从原始 ASR 文本中提取特定产品信息,AMPere 使用了一个易于实现的基于 LLM 的 ASR 文本摘要器。经过 LLM 总结的文本与视觉数据一起输入多分支网络,生成紧凑的多模态嵌入。在大型三域数据集上进行的大量实验验证了 AMPere 在获得统一的多模态产品表示法方面的有效性,这种表示法明显改善了跨域产品检索。
{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":"https://doi.org/arxiv-2408.02978","url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\u0000broad-domain manner as images, short videos, or live stream promotions. A\u0000unified and vectorized cross-domain production representation is essential. Due\u0000to large intra-product variance and high inter-product similarity in the\u0000broad-domain scenario, a visual-only representation is inadequate. While\u0000Automatic Speech Recognition (ASR) text derived from the short or live-stream\u0000videos is readily accessible, how to de-noise the excessively noisy text for\u0000multimodal representation learning is mostly untouched. We propose ASR-enhanced\u0000Multimodal Product Representation Learning (AMPere). In order to extract\u0000product-specific information from the raw ASR text, AMPere uses an\u0000easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\u0000together with visual data, is then fed into a multi-branch network to generate\u0000compact multimodal embeddings. Extensive experiments on a large-scale\u0000tri-domain dataset verify the effectiveness of AMPere in obtaining a unified\u0000multimodal product representation that clearly improves cross-domain product\u0000retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}