arXiv - CS - Multimedia最新文献_第9页

Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection 用于突出物体检测的多尺度和细节增强分段任何模型

arXiv - CS - Multimedia

Pub Date : 2024-08-08 DOI: arxiv-2408.04326

Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu

Salient Object Detection (SOD) aims to identify and segment the mostprominent objects in images. Advanced SOD methods often utilize variousConvolutional Neural Networks (CNN) or Transformers for deep featureextraction. However, these methods still deliver low performance and poorgeneralization in complex cases. Recently, Segment Anything Model (SAM) hasbeen proposed as a visual fundamental model, which gives strong segmentationand generalization capabilities. Nonetheless, SAM requires accurate prompts oftarget objects, which are unavailable in SOD. Additionally, SAM lacks theutilization of multi-scale and multi-level information, as well as theincorporation of fine-grained details. To address these shortcomings, wepropose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, wefirst introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM tolearn multi-scale information with very few trainable parameters. Then, wepropose a Multi-Level Fusion Module (MLFM) to comprehensively utilize themulti-level information from the SAM's encoder. Finally, we propose a DetailEnhancement Module (DEM) to incorporate SAM with fine-grained details.Experimental results demonstrate the superior performance of our model onmultiple SOD datasets and its strong generalization on other segmentationtasks. The source code is released at https://github.com/BellyBeauty/MDSAM.

突出物体检测（SOD）旨在识别和分割图像中最突出的物体。先进的 SOD 方法通常利用各种卷积神经网络（CNN）或变换器进行深度特征提取。然而，这些方法在复杂情况下的性能仍然较低，泛化能力较差。最近，有人提出了一种视觉基本模型--"任意分割模型"（Segment Anything Model，SAM），它具有很强的分割和泛化能力。然而，SAM 需要目标对象的准确提示，而 SOD 却无法做到这一点。此外，SAM 缺乏对多尺度和多层次信息的利用，也没有纳入细粒度细节。为了解决这些不足，我们提出了一种适用于 SOD 的多尺度和细节增强型 SAM（MDSAM）。具体来说，我们首先引入了轻量级多尺度适配器（LMSA），该适配器允许 SAM 以极少的可训练参数学习多尺度信息。然后，我们提出了多级融合模块（MLFM），以全面利用 SAM 编码器的多级信息。实验结果表明，我们的模型在多个 SOD 数据集上表现出色，在其他分割任务上也有很强的通用性。源代码发布于 https://github.com/BellyBeauty/MDSAM。

{"title":"Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection","authors":"Shixuan Gao, Pingping Zhang, Tianyu Yan, Huchuan Lu","doi":"arxiv-2408.04326","DOIUrl":"https://doi.org/arxiv-2408.04326","url":null,"abstract":"Salient Object Detection (SOD) aims to identify and segment the most\u0000prominent objects in images. Advanced SOD methods often utilize various\u0000Convolutional Neural Networks (CNN) or Transformers for deep feature\u0000extraction. However, these methods still deliver low performance and poor\u0000generalization in complex cases. Recently, Segment Anything Model (SAM) has\u0000been proposed as a visual fundamental model, which gives strong segmentation\u0000and generalization capabilities. Nonetheless, SAM requires accurate prompts of\u0000target objects, which are unavailable in SOD. Additionally, SAM lacks the\u0000utilization of multi-scale and multi-level information, as well as the\u0000incorporation of fine-grained details. To address these shortcomings, we\u0000propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. Specifically, we\u0000first introduce a Lightweight Multi-Scale Adapter (LMSA), which allows SAM to\u0000learn multi-scale information with very few trainable parameters. Then, we\u0000propose a Multi-Level Fusion Module (MLFM) to comprehensively utilize the\u0000multi-level information from the SAM's encoder. Finally, we propose a Detail\u0000Enhancement Module (DEM) to incorporate SAM with fine-grained details.\u0000Experimental results demonstrate the superior performance of our model on\u0000multiple SOD datasets and its strong generalization on other segmentation\u0000tasks. The source code is released at https://github.com/BellyBeauty/MDSAM.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models MM-Forecast：利用大型语言模型进行时态事件预测的多模态方法

arXiv - CS - Multimedia

Pub Date : 2024-08-08 DOI: arxiv-2408.04388

Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua

We study an emerging and intriguing problem of multimodal temporal eventforecasting with large language models. Compared to using text or graphmodalities, the investigation of utilizing images for temporal eventforecasting has not been fully explored, especially in the era of largelanguage models (LLMs). To bridge this gap, we are particularly interested intwo key questions of: 1) why images will help in temporal event forecasting,and 2) how to integrate images into the LLM-based forecasting framework. Toanswer these research questions, we propose to identify two essential functionsthat images play in the scenario of temporal event forecasting, i.e.,highlighting and complementary. Then, we develop a novel framework, namedMM-Forecast. It employs an Image Function Identification module to recognizethese functions as verbal descriptions using multimodal large language models(MLLMs), and subsequently incorporates these function descriptions intoLLM-based forecasting models. To evaluate our approach, we construct a newmultimodal dataset, MidEast-TE-mm, by extending an existing event datasetMidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecastcan correctly identify the image functions, and further more, incorporatingthese verbal function descriptions significantly improves the forecastingperformance. The dataset, code, and prompts are available athttps://github.com/LuminosityX/MM-Forecast.

我们研究了利用大型语言模型进行多模态时态事件预测这一新兴而有趣的问题。与使用文本或图形模态相比，利用图像进行时间事件预测的研究尚未得到充分探索，尤其是在大型语言模型（LLM）时代。为了弥补这一差距，我们对以下两个关键问题特别感兴趣：1) 为什么图像有助于时间事件预测，以及 2) 如何将图像集成到基于 LLM 的预测框架中。为了回答这些研究问题，我们建议确定图像在时间事件预测场景中发挥的两个基本功能，即突出和补充功能。然后，我们开发了一个名为 "MM-Forecast "的新型框架。它采用图像功能识别模块，利用多模态大语言模型（MLLMs）将这些功能识别为口头描述，然后将这些功能描述纳入基于LLM 的预测模型。为了评估我们的方法，我们通过用图像扩展现有的事件数据集 MidEast-TE-mini，构建了一个新的多模态数据集 MidEast-TE-mm。实证研究表明，我们的 MM-Forecast 可以正确识别图像函数，而且，加入这些口头函数描述可以显著提高预测性能。数据集、代码和提示可在https://github.com/LuminosityX/MM-Forecast。

{"title":"MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models","authors":"Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua","doi":"arxiv-2408.04388","DOIUrl":"https://doi.org/arxiv-2408.04388","url":null,"abstract":"We study an emerging and intriguing problem of multimodal temporal event\u0000forecasting with large language models. Compared to using text or graph\u0000modalities, the investigation of utilizing images for temporal event\u0000forecasting has not been fully explored, especially in the era of large\u0000language models (LLMs). To bridge this gap, we are particularly interested in\u0000two key questions of: 1) why images will help in temporal event forecasting,\u0000and 2) how to integrate images into the LLM-based forecasting framework. To\u0000answer these research questions, we propose to identify two essential functions\u0000that images play in the scenario of temporal event forecasting, i.e.,\u0000highlighting and complementary. Then, we develop a novel framework, named\u0000MM-Forecast. It employs an Image Function Identification module to recognize\u0000these functions as verbal descriptions using multimodal large language models\u0000(MLLMs), and subsequently incorporates these function descriptions into\u0000LLM-based forecasting models. To evaluate our approach, we construct a new\u0000multimodal dataset, MidEast-TE-mm, by extending an existing event dataset\u0000MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast\u0000can correctly identify the image functions, and further more, incorporating\u0000these verbal function descriptions significantly improves the forecasting\u0000performance. The dataset, code, and prompts are available at\u0000https://github.com/LuminosityX/MM-Forecast.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The algorithmic nature of song-sequencing: statistical regularities in music albums 歌曲序列的算法本质：音乐专辑中的统计规律性

arXiv - CS - Multimedia

Pub Date : 2024-08-08 DOI: arxiv-2408.04383

Pedro Neto, Martin Hartmann, Geoff Luck, Petri Toiviainen

Based on a review of anecdotal beliefs, we explored patterns oftrack-sequencing within professional music albums. We found that songs withhigh levels of valence, energy and loudness are more likely to be positioned atthe beginning of each album. We also found that transitions between consecutivetracks tend to alternate between increases and decreases of valence and energy.These findings were used to build a system which automates the process ofalbum-sequencing. Our results and hypothesis have both practical andtheoretical applications. Practically, sequencing regularities can be used toinform playlist generation systems. Theoretically, we show weak to moderatesupport for the idea that music is perceived in both global and local contexts.

根据对轶事的回顾，我们探讨了专业音乐专辑中的曲目排序模式。我们发现，情绪、能量和响度水平较高的歌曲更有可能出现在每张专辑的开头。我们还发现，连续曲目之间的过渡往往在情绪和能量的增减之间交替进行。我们的结果和假设既有实际应用价值，也有理论应用价值。在实践中，排序规律性可用于为播放列表生成系统提供信息。从理论上讲，我们的研究结果表明，音乐是在全球和本地语境中被感知的，这一观点得到了微弱到适度的支持。

引用次数: 0

MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning MU-MAE：基于多模态屏蔽自动编码器的单次学习

arXiv - CS - Multimedia

Pub Date : 2024-08-08 DOI: arxiv-2408.04243

Rex Liu, Xin Liu

With the exponential growth of multimedia data, leveraging multimodal sensorspresents a promising approach for improving accuracy in human activityrecognition. Nevertheless, accurately identifying these activities using bothvideo data and wearable sensor data presents challenges due to thelabor-intensive data annotation, and reliance on external pretrained models oradditional data. To address these challenges, we introduce Multimodal MaskedAutoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodalmasked autoencoder with a synchronized masking strategy tailored for wearablesensors. This masking strategy compels the networks to capture more meaningfulspatiotemporal features, which enables effective self-supervised pretrainingwithout the need for external data. Furthermore, Mu-MAE leverages therepresentation extracted from multimodal masked autoencoders as priorinformation input to a cross-attention multimodal fusion layer. This fusionlayer emphasizes spatiotemporal features requiring attention across differentmodalities while highlighting differences from other classes, aiding in theclassification of various classes in metric-based one-shot learning.Comprehensive evaluations on MMAct one-shot classification show that Mu-MAEoutperforms all the evaluated approaches, achieving up to an 80.17% accuracyfor five-way one-shot multimodal classification, without the use of additionaldata.

随着多媒体数据的指数级增长，利用多模态传感器是提高人类活动识别准确性的一种前景广阔的方法。然而，要利用视频数据和可穿戴传感器数据准确识别这些活动却面临着挑战，因为数据标注需要大量人力，而且需要依赖外部预训练模型或额外数据。为了应对这些挑战，我们推出了基于多模态掩码自动编码器的单次学习（Mu-MAE）。Mu-MAE 将多模态掩码自动编码器与专为可穿戴设备传感器定制的同步掩码策略整合在一起。这种掩码策略迫使网络捕捉更多有意义的时空特征，从而在不需要外部数据的情况下进行有效的自我监督预训练。此外，Mu-MAE 还利用从多模态掩码自动编码器中提取的表征作为跨注意力多模态融合层的先验信息输入。对 MMAct 一次分类的综合评估表明，Mu-MAE 优于所有评估方法，在不使用额外数据的情况下，五向一次多模态分类的准确率高达 80.17%。

{"title":"MU-MAE: Multimodal Masked Autoencoders-Based One-Shot Learning","authors":"Rex Liu, Xin Liu","doi":"arxiv-2408.04243","DOIUrl":"https://doi.org/arxiv-2408.04243","url":null,"abstract":"With the exponential growth of multimedia data, leveraging multimodal sensors\u0000presents a promising approach for improving accuracy in human activity\u0000recognition. Nevertheless, accurately identifying these activities using both\u0000video data and wearable sensor data presents challenges due to the\u0000labor-intensive data annotation, and reliance on external pretrained models or\u0000additional data. To address these challenges, we introduce Multimodal Masked\u0000Autoencoders-Based One-Shot Learning (Mu-MAE). Mu-MAE integrates a multimodal\u0000masked autoencoder with a synchronized masking strategy tailored for wearable\u0000sensors. This masking strategy compels the networks to capture more meaningful\u0000spatiotemporal features, which enables effective self-supervised pretraining\u0000without the need for external data. Furthermore, Mu-MAE leverages the\u0000representation extracted from multimodal masked autoencoders as prior\u0000information input to a cross-attention multimodal fusion layer. This fusion\u0000layer emphasizes spatiotemporal features requiring attention across different\u0000modalities while highlighting differences from other classes, aiding in the\u0000classification of various classes in metric-based one-shot learning.\u0000Comprehensive evaluations on MMAct one-shot classification show that Mu-MAE\u0000outperforms all the evaluated approaches, achieving up to an 80.17% accuracy\u0000for five-way one-shot multimodal classification, without the use of additional\u0000data.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Multimodal Emotional Support Conversation Systems 开发多模态情感支持对话系统

arXiv - CS - Multimedia

Pub Date : 2024-08-07 DOI: arxiv-2408.03650

Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong

The integration of conversational artificial intelligence (AI) into mentalhealth care promises a new horizon for therapist-client interactions, aiming toclosely emulate the depth and nuance of human conversations. Despite thepotential, the current landscape of conversational AI is markedly limited byits reliance on single-modal data, constraining the systems' ability toempathize and provide effective emotional support. This limitation stems from apaucity of resources that encapsulate the multimodal nature of humancommunication essential for therapeutic counseling. To address this gap, weintroduce the Multimodal Emotional Support Conversation (MESC) dataset, afirst-of-its-kind resource enriched with comprehensive annotations across text,audio, and video modalities. This dataset captures the intricate interplay ofuser emotions, system strategies, system emotion, and system responses, settinga new precedent in the field. Leveraging the MESC dataset, we propose a generalSequential Multimodal Emotional Support framework (SMES) grounded inTherapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMESframework incorporates an LLM-based reasoning model that sequentially generatesuser emotion recognition, system strategy prediction, system emotionprediction, and response generation. Our rigorous evaluations demonstrate thatthis framework significantly enhances the capability of AI systems to mimictherapist behaviors with heightened empathy and strategic responsiveness. Byintegrating multimodal data in this innovative manner, we bridge the criticalgap between emotion recognition and emotional support, marking a significantadvancement in conversational AI for mental health support.

将对话式人工智能（AI）融入心理健康护理有望为治疗师与客户之间的互动开辟一片新天地，其目标是近似模拟人类对话的深度和细微差别。尽管潜力巨大，但目前的对话式人工智能由于依赖单一模式的数据而受到明显限制，制约了系统的移情能力和提供有效情感支持的能力。这种限制源于缺乏对治疗咨询至关重要的人类交流的多模态性质进行概括的资源。为了弥补这一不足，我们引入了多模态情感支持对话（MESC）数据集，这是首个在文本、音频和视频模态中添加了全面注释的同类资源。该数据集捕捉了用户情绪、系统策略、系统情绪和系统响应之间错综复杂的相互作用，在该领域开创了一个新的先例。利用 MESC 数据集，我们提出了一个以治疗技能理论为基础的通用连续多模态情感支持框架（SMES）。为多模态对话系统量身定制的 SMES 框架包含一个基于 LLM 的推理模型，可依次生成用户情感识别、系统策略预测、系统情感预测和响应生成。我们的严格评估结果表明，该框架大大增强了人工智能系统模仿治疗师行为的能力，提高了共鸣和策略响应能力。通过以这种创新方式整合多模态数据，我们弥合了情感识别与情感支持之间的关键差距，标志着心理健康支持对话式人工智能取得了重大进展。

{"title":"Towards Multimodal Emotional Support Conversation Systems","authors":"Yuqi Chu, Lizi Liao, Zhiyuan Zhou, Chong-Wah Ngo, Richang Hong","doi":"arxiv-2408.03650","DOIUrl":"https://doi.org/arxiv-2408.03650","url":null,"abstract":"The integration of conversational artificial intelligence (AI) into mental\u0000health care promises a new horizon for therapist-client interactions, aiming to\u0000closely emulate the depth and nuance of human conversations. Despite the\u0000potential, the current landscape of conversational AI is markedly limited by\u0000its reliance on single-modal data, constraining the systems' ability to\u0000empathize and provide effective emotional support. This limitation stems from a\u0000paucity of resources that encapsulate the multimodal nature of human\u0000communication essential for therapeutic counseling. To address this gap, we\u0000introduce the Multimodal Emotional Support Conversation (MESC) dataset, a\u0000first-of-its-kind resource enriched with comprehensive annotations across text,\u0000audio, and video modalities. This dataset captures the intricate interplay of\u0000user emotions, system strategies, system emotion, and system responses, setting\u0000a new precedent in the field. Leveraging the MESC dataset, we propose a general\u0000Sequential Multimodal Emotional Support framework (SMES) grounded in\u0000Therapeutic Skills Theory. Tailored for multimodal dialogue systems, the SMES\u0000framework incorporates an LLM-based reasoning model that sequentially generates\u0000user emotion recognition, system strategy prediction, system emotion\u0000prediction, and response generation. Our rigorous evaluations demonstrate that\u0000this framework significantly enhances the capability of AI systems to mimic\u0000therapist behaviors with heightened empathy and strategic responsiveness. By\u0000integrating multimodal data in this innovative manner, we bridge the critical\u0000gap between emotion recognition and emotional support, marking a significant\u0000advancement in conversational AI for mental health support.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization TALE：通过自适应潜在操纵和能量引导优化实现免训练跨域图像合成

arXiv - CS - Multimedia

Pub Date : 2024-08-07 DOI: arxiv-2408.03637

Kien T. Pham, Jingye Chen, Qifeng Chen

We present TALE, a novel training-free framework harnessing the generativecapabilities of text-to-image diffusion models to address the cross-domainimage composition task that focuses on flawlessly incorporating user-specifiedobjects into a designated visual contexts regardless of domain disparity.Previous methods often involve either training auxiliary networks or finetuningdiffusion models on customized datasets, which are expensive and may underminethe robust textual and visual priors of pre-trained diffusion models. Somerecent works attempt to break the barrier by proposing training-freeworkarounds that rely on manipulating attention maps to tame the denoisingprocess implicitly. However, composing via attention maps does not necessarilyyield desired compositional outcomes. These approaches could only retain somesemantic information and usually fall short in preserving identitycharacteristics of input objects or exhibit limited background-object styleadaptation in generated images. In contrast, TALE is a novel method thatoperates directly on latent space to provide explicit and effective guidancefor the composition process to resolve these problems. Specifically, we equipTALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guidedLatent Optimization. The former formulates noisy latents conducive toinitiating and steering the composition process by directly leveragingbackground and foreground latents at corresponding timesteps, and the latterexploits designated energy functions to further optimize intermediate latentsconforming to specific conditions that complement the former to generatedesired final results. Our experiments demonstrate that TALE surpasses priorbaselines and attains state-of-the-art performance in image-guided compositionacross various photorealistic and artistic domains.

我们提出的TALE是一种新颖的免训练框架，它利用文本到图像扩散模型的生成能力来解决跨领域图像合成任务，该任务的重点是将用户指定的对象完美地融入指定的视觉情境中，而不受领域差异的影响。以往的方法通常涉及在定制数据集上训练辅助网络或微调扩散模型，这不仅成本高昂，而且可能会破坏预训练扩散模型的稳健文本和视觉前验。最近的一些研究试图打破这一障碍，提出了免训练的变通方法，即依靠操纵注意力图来隐式地驯服去噪过程。然而，通过注意力图进行合成并不一定会产生理想的合成结果。这些方法只能保留一些语义信息，通常无法保留输入对象的身份特征，或在生成的图像中表现出有限的背景-对象风格适应性。相比之下，TALE 是一种新颖的方法，它直接在潜在空间中运行，为合成过程提供明确有效的指导，从而解决这些问题。具体来说，我们为 TALE 配备了两种机制，分别称为 "自适应潜影操作"（Adaptive Latent Manipulation）和 "能量引导潜影优化"（Energy-guidedLatent Optimization）。前者通过在相应的时间步直接利用背景和前景潜变量，制定有利于启动和引导合成过程的噪声潜变量，而后者则利用指定的能量函数，进一步优化符合特定条件的中间潜变量，从而与前者相辅相成，生成所需的最终结果。我们的实验证明，TALE 超越了先前的基准线，在各种逼真和艺术领域的图像引导合成中达到了最先进的性能。

{"title":"TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization","authors":"Kien T. Pham, Jingye Chen, Qifeng Chen","doi":"arxiv-2408.03637","DOIUrl":"https://doi.org/arxiv-2408.03637","url":null,"abstract":"We present TALE, a novel training-free framework harnessing the generative\u0000capabilities of text-to-image diffusion models to address the cross-domain\u0000image composition task that focuses on flawlessly incorporating user-specified\u0000objects into a designated visual contexts regardless of domain disparity.\u0000Previous methods often involve either training auxiliary networks or finetuning\u0000diffusion models on customized datasets, which are expensive and may undermine\u0000the robust textual and visual priors of pre-trained diffusion models. Some\u0000recent works attempt to break the barrier by proposing training-free\u0000workarounds that rely on manipulating attention maps to tame the denoising\u0000process implicitly. However, composing via attention maps does not necessarily\u0000yield desired compositional outcomes. These approaches could only retain some\u0000semantic information and usually fall short in preserving identity\u0000characteristics of input objects or exhibit limited background-object style\u0000adaptation in generated images. In contrast, TALE is a novel method that\u0000operates directly on latent space to provide explicit and effective guidance\u0000for the composition process to resolve these problems. Specifically, we equip\u0000TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided\u0000Latent Optimization. The former formulates noisy latents conducive to\u0000initiating and steering the composition process by directly leveraging\u0000background and foreground latents at corresponding timesteps, and the latter\u0000exploits designated energy functions to further optimize intermediate latents\u0000conforming to specific conditions that complement the former to generate\u0000desired final results. Our experiments demonstrate that TALE surpasses prior\u0000baselines and attains state-of-the-art performance in image-guided composition\u0000across various photorealistic and artistic domains.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection HiQuE：用于多模态抑郁检测的分层问题嵌入网络

arXiv - CS - Multimedia

Pub Date : 2024-08-07 DOI: arxiv-2408.03648

Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han

The utilization of automated depression detection significantly enhancesearly intervention for individuals experiencing depression. Despite numerousproposals on automated depression detection using recorded clinical interviewvideos, limited attention has been paid to considering the hierarchicalstructure of the interview questions. In clinical interviews for diagnosingdepression, clinicians use a structured questionnaire that includes routinebaseline questions and follow-up questions to assess the interviewee'scondition. This paper introduces HiQuE (Hierarchical Question Embeddingnetwork), a novel depression detection framework that leverages thehierarchical relationship between primary and follow-up questions in clinicalinterviews. HiQuE can effectively capture the importance of each question indiagnosing depression by learning mutual information across multiplemodalities. We conduct extensive experiments on the widely-used clinicalinterview data, DAIC-WOZ, where our model outperforms other state-of-the-artmultimodal depression detection models and emotion recognition models,showcasing its clinical utility in depression detection.

利用自动抑郁检测可大大加强对抑郁症患者的早期干预。尽管有很多关于使用临床访谈录像自动检测抑郁的建议，但对访谈问题的层次结构的关注却很有限。在诊断抑郁症的临床访谈中，临床医生会使用结构化问卷，其中包括常规基线问题和后续问题，以评估受访者的状况。本文介绍的 HiQuE（层次问题嵌入网络）是一种新型抑郁检测框架，它利用了临床访谈中基线问题和随访问题之间的层次关系。HiQuE 可以通过学习多模态间的相互信息，有效捕捉每个问题对诊断抑郁症的重要性。我们在广泛使用的临床访谈数据 DAIC-WOZ 上进行了大量实验，结果表明我们的模型优于其他最先进的多模态抑郁检测模型和情绪识别模型，展示了它在抑郁检测方面的临床实用性。

{"title":"HiQuE: Hierarchical Question Embedding Network for Multimodal Depression Detection","authors":"Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, Jinyoung Han","doi":"arxiv-2408.03648","DOIUrl":"https://doi.org/arxiv-2408.03648","url":null,"abstract":"The utilization of automated depression detection significantly enhances\u0000early intervention for individuals experiencing depression. Despite numerous\u0000proposals on automated depression detection using recorded clinical interview\u0000videos, limited attention has been paid to considering the hierarchical\u0000structure of the interview questions. In clinical interviews for diagnosing\u0000depression, clinicians use a structured questionnaire that includes routine\u0000baseline questions and follow-up questions to assess the interviewee's\u0000condition. This paper introduces HiQuE (Hierarchical Question Embedding\u0000network), a novel depression detection framework that leverages the\u0000hierarchical relationship between primary and follow-up questions in clinical\u0000interviews. HiQuE can effectively capture the importance of each question in\u0000diagnosing depression by learning mutual information across multiple\u0000modalities. We conduct extensive experiments on the widely-used clinical\u0000interview data, DAIC-WOZ, where our model outperforms other state-of-the-art\u0000multimodal depression detection models and emotion recognition models,\u0000showcasing its clinical utility in depression detection.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"372 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MetaDragonBoat: Exploring Paddling Techniques of Virtual Dragon Boating in a Metaverse Campus 元龙舟探索元宇宙校园中虚拟龙舟的划桨技巧

arXiv - CS - Multimedia

Pub Date : 2024-08-07 DOI: arxiv-2408.04013

Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee

The preservation of cultural heritage, as mandated by the United NationsSustainable Development Goals (SDGs), is integral to sustainable urbandevelopment. This paper focuses on the Dragon Boat Festival, a prominent eventin Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), toenhance its preservation and accessibility. Traditionally, participation in thefestival's dragon boat races was limited to elite athletes, excluding broaderdemographics. Our proposed solution, named MetaDragonBoat, enables virtualparticipation in dragon boat racing, offering immersive experiences thatreplicate physical exertion through a cultural journey. Thus, we build adigital twin of a university campus located in a region with a rich dragon boatracing tradition. Coupled with three paddling techniques that are enabled byeither commercial controllers or physical paddle controllers with hapticfeedback, diversified users can engage in realistic rowing experiences. Ourresults demonstrate that by integrating resistance into the paddle controls,users could simulate the physical effort of dragon boat racing, promoting adeeper understanding and appreciation of this cultural heritage.

按照联合国可持续发展目标（SDGs）的要求，保护文化遗产是城市可持续发展不可或缺的一部分。本文重点关注中国文化遗产中的一项重要活动--端午节，并建议利用虚拟现实技术（VR）加强其保护和可访问性。传统上，参加端午节龙舟赛的仅限于精英运动员，不包括更广泛的人群。我们提出的解决方案被命名为 "元龙舟"（MetaDragonBoat），它可以让人们虚拟参与龙舟赛，通过文化之旅提供身临其境的体验，复制体力消耗。因此，我们建立了一个大学校园的数字孪生模型，该校园位于一个具有丰富龙舟竞渡传统的地区。通过商业控制器或带有触觉反馈功能的物理桨控制器来实现三种划桨技术，不同的用户可以获得逼真的划船体验。我们的研究结果表明，通过将阻力整合到桨控制器中，用户可以模拟赛龙舟时的体力消耗，从而促进对这一文化遗产的深入了解和欣赏。

{"title":"MetaDragonBoat: Exploring Paddling Techniques of Virtual Dragon Boating in a Metaverse Campus","authors":"Wei He, Xiang Li, Shengtian Xu, Yuzheng Chen, Chan-In Sio, Ge Lin Kan, Lik-Hang Lee","doi":"arxiv-2408.04013","DOIUrl":"https://doi.org/arxiv-2408.04013","url":null,"abstract":"The preservation of cultural heritage, as mandated by the United Nations\u0000Sustainable Development Goals (SDGs), is integral to sustainable urban\u0000development. This paper focuses on the Dragon Boat Festival, a prominent event\u0000in Chinese cultural heritage, and proposes leveraging Virtual Reality (VR), to\u0000enhance its preservation and accessibility. Traditionally, participation in the\u0000festival's dragon boat races was limited to elite athletes, excluding broader\u0000demographics. Our proposed solution, named MetaDragonBoat, enables virtual\u0000participation in dragon boat racing, offering immersive experiences that\u0000replicate physical exertion through a cultural journey. Thus, we build a\u0000digital twin of a university campus located in a region with a rich dragon boat\u0000racing tradition. Coupled with three paddling techniques that are enabled by\u0000either commercial controllers or physical paddle controllers with haptic\u0000feedback, diversified users can engage in realistic rowing experiences. Our\u0000results demonstrate that by integrating resistance into the paddle controls,\u0000users could simulate the physical effort of dragon boat racing, promoting a\u0000deeper understanding and appreciation of this cultural heritage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis 概念指挥：在文本到图像的合成中协调多个个性化概念

arXiv - CS - Multimedia

Pub Date : 2024-08-07 DOI: arxiv-2408.03632

Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang

The customization of text-to-image models has seen significant advancements,yet generating multiple personalized concepts remains a challenging task.Current methods struggle with attribute leakage and layout confusion whenhandling multiple concepts, leading to reduced concept fidelity and semanticconsistency. In this work, we introduce a novel training-free framework,Concept Conductor, designed to ensure visual fidelity and correct layout inmulti-concept customization. Concept Conductor isolates the sampling processesof multiple custom models to prevent attribute leakage between differentconcepts and corrects erroneous layouts through self-attention-based spatialguidance. Additionally, we present a concept injection technique that employsshape-aware masks to specify the generation area for each concept. Thistechnique injects the structure and appearance of personalized concepts throughfeature fusion in the attention layers, ensuring harmony in the final image.Extensive qualitative and quantitative experiments demonstrate that ConceptConductor can consistently generate composite images with accurate layoutswhile preserving the visual details of each concept. Compared to existingbaselines, Concept Conductor shows significant performance improvements. Ourmethod supports the combination of any number of concepts and maintains highfidelity even when dealing with visually similar concepts. The code and modelsare available at https://github.com/Nihukat/Concept-Conductor.

当前的方法在处理多个概念时会出现属性泄露和布局混乱的问题，导致概念保真度和语义一致性降低。在这项工作中，我们引入了一种新型免训练框架--概念指挥器，旨在确保多概念定制过程中的视觉保真度和布局正确性。Concept Conductor 隔离了多个自定义模型的采样过程，以防止不同概念之间的属性泄漏，并通过基于自我注意力的空间引导纠正错误布局。此外，我们还提出了一种概念注入技术，它采用形状感知掩码来指定每个概念的生成区域。广泛的定性和定量实验证明，ConceptConductor 可以持续生成具有准确布局的合成图像，同时保留每个概念的视觉细节。与现有基线相比，概念引导器的性能有了显著提高。我们的方法支持任意数量概念的组合，即使在处理视觉上相似的概念时也能保持高保真。代码和模型可在 https://github.com/Nihukat/Concept-Conductor 上获取。

{"title":"Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis","authors":"Zebin Yao, Fangxiang Feng, Ruifan Li, Xiaojie Wang","doi":"arxiv-2408.03632","DOIUrl":"https://doi.org/arxiv-2408.03632","url":null,"abstract":"The customization of text-to-image models has seen significant advancements,\u0000yet generating multiple personalized concepts remains a challenging task.\u0000Current methods struggle with attribute leakage and layout confusion when\u0000handling multiple concepts, leading to reduced concept fidelity and semantic\u0000consistency. In this work, we introduce a novel training-free framework,\u0000Concept Conductor, designed to ensure visual fidelity and correct layout in\u0000multi-concept customization. Concept Conductor isolates the sampling processes\u0000of multiple custom models to prevent attribute leakage between different\u0000concepts and corrects erroneous layouts through self-attention-based spatial\u0000guidance. Additionally, we present a concept injection technique that employs\u0000shape-aware masks to specify the generation area for each concept. This\u0000technique injects the structure and appearance of personalized concepts through\u0000feature fusion in the attention layers, ensuring harmony in the final image.\u0000Extensive qualitative and quantitative experiments demonstrate that Concept\u0000Conductor can consistently generate composite images with accurate layouts\u0000while preserving the visual details of each concept. Compared to existing\u0000baselines, Concept Conductor shows significant performance improvements. Our\u0000method supports the combination of any number of concepts and maintains high\u0000fidelity even when dealing with visually similar concepts. The code and models\u0000are available at https://github.com/Nihukat/Concept-Conductor.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval 针对跨域产品检索的 ASR 增强型多模态表征学习

arXiv - CS - Multimedia

Pub Date : 2024-08-06 DOI: arxiv-2408.02978

Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li

E-commerce is increasingly multimedia-enriched, with products exhibited in abroad-domain manner as images, short videos, or live stream promotions. Aunified and vectorized cross-domain production representation is essential. Dueto large intra-product variance and high inter-product similarity in thebroad-domain scenario, a visual-only representation is inadequate. WhileAutomatic Speech Recognition (ASR) text derived from the short or live-streamvideos is readily accessible, how to de-noise the excessively noisy text formultimodal representation learning is mostly untouched. We propose ASR-enhancedMultimodal Product Representation Learning (AMPere). In order to extractproduct-specific information from the raw ASR text, AMPere uses aneasy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,together with visual data, is then fed into a multi-branch network to generatecompact multimodal embeddings. Extensive experiments on a large-scaletri-domain dataset verify the effectiveness of AMPere in obtaining a unifiedmultimodal product representation that clearly improves cross-domain productretrieval.

电子商务的多媒体化程度越来越高，产品以图片、短视频或现场直播推广的方式在国外展示。统一和矢量化的跨域生产表示是必不可少的。在广域场景中，产品内部差异大，产品之间相似度高，因此仅有视觉表示是不够的。虽然从短视频或直播视频中提取的自动语音识别（ASR）文本很容易获得，但如何通过多模态表征学习对噪声过大的文本进行去噪处理却大多没有涉及。我们提出了 ASR 增强多模态产品表征学习（AMPere）。为了从原始 ASR 文本中提取特定产品信息，AMPere 使用了一个易于实现的基于 LLM 的 ASR 文本摘要器。经过 LLM 总结的文本与视觉数据一起输入多分支网络，生成紧凑的多模态嵌入。在大型三域数据集上进行的大量实验验证了 AMPere 在获得统一的多模态产品表示法方面的有效性，这种表示法明显改善了跨域产品检索。

{"title":"ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao, Jian Jia, Yan Li, Xuehan Bai, Quan Chen, Han Li, Peng Jiang, Xirong Li","doi":"arxiv-2408.02978","DOIUrl":"https://doi.org/arxiv-2408.02978","url":null,"abstract":"E-commerce is increasingly multimedia-enriched, with products exhibited in a\u0000broad-domain manner as images, short videos, or live stream promotions. A\u0000unified and vectorized cross-domain production representation is essential. Due\u0000to large intra-product variance and high inter-product similarity in the\u0000broad-domain scenario, a visual-only representation is inadequate. While\u0000Automatic Speech Recognition (ASR) text derived from the short or live-stream\u0000videos is readily accessible, how to de-noise the excessively noisy text for\u0000multimodal representation learning is mostly untouched. We propose ASR-enhanced\u0000Multimodal Product Representation Learning (AMPere). In order to extract\u0000product-specific information from the raw ASR text, AMPere uses an\u0000easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text,\u0000together with visual data, is then fed into a multi-branch network to generate\u0000compact multimodal embeddings. Extensive experiments on a large-scale\u0000tri-domain dataset verify the effectiveness of AMPere in obtaining a unified\u0000multimodal product representation that clearly improves cross-domain product\u0000retrieval.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0