TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization

Kien T. Pham, Jingye Chen, Qifeng Chen
{"title":"TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization","authors":"Kien T. Pham, Jingye Chen, Qifeng Chen","doi":"arxiv-2408.03637","DOIUrl":null,"url":null,"abstract":"We present TALE, a novel training-free framework harnessing the generative\ncapabilities of text-to-image diffusion models to address the cross-domain\nimage composition task that focuses on flawlessly incorporating user-specified\nobjects into a designated visual contexts regardless of domain disparity.\nPrevious methods often involve either training auxiliary networks or finetuning\ndiffusion models on customized datasets, which are expensive and may undermine\nthe robust textual and visual priors of pre-trained diffusion models. Some\nrecent works attempt to break the barrier by proposing training-free\nworkarounds that rely on manipulating attention maps to tame the denoising\nprocess implicitly. However, composing via attention maps does not necessarily\nyield desired compositional outcomes. These approaches could only retain some\nsemantic information and usually fall short in preserving identity\ncharacteristics of input objects or exhibit limited background-object style\nadaptation in generated images. In contrast, TALE is a novel method that\noperates directly on latent space to provide explicit and effective guidance\nfor the composition process to resolve these problems. Specifically, we equip\nTALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided\nLatent Optimization. The former formulates noisy latents conducive to\ninitiating and steering the composition process by directly leveraging\nbackground and foreground latents at corresponding timesteps, and the latter\nexploits designated energy functions to further optimize intermediate latents\nconforming to specific conditions that complement the former to generate\ndesired final results. Our experiments demonstrate that TALE surpasses prior\nbaselines and attains state-of-the-art performance in image-guided composition\nacross various photorealistic and artistic domains.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We present TALE, a novel training-free framework harnessing the generative capabilities of text-to-image diffusion models to address the cross-domain image composition task that focuses on flawlessly incorporating user-specified objects into a designated visual contexts regardless of domain disparity. Previous methods often involve either training auxiliary networks or finetuning diffusion models on customized datasets, which are expensive and may undermine the robust textual and visual priors of pre-trained diffusion models. Some recent works attempt to break the barrier by proposing training-free workarounds that rely on manipulating attention maps to tame the denoising process implicitly. However, composing via attention maps does not necessarily yield desired compositional outcomes. These approaches could only retain some semantic information and usually fall short in preserving identity characteristics of input objects or exhibit limited background-object style adaptation in generated images. In contrast, TALE is a novel method that operates directly on latent space to provide explicit and effective guidance for the composition process to resolve these problems. Specifically, we equip TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided Latent Optimization. The former formulates noisy latents conducive to initiating and steering the composition process by directly leveraging background and foreground latents at corresponding timesteps, and the latter exploits designated energy functions to further optimize intermediate latents conforming to specific conditions that complement the former to generate desired final results. Our experiments demonstrate that TALE surpasses prior baselines and attains state-of-the-art performance in image-guided composition across various photorealistic and artistic domains.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TALE:通过自适应潜在操纵和能量引导优化实现免训练跨域图像合成
我们提出的TALE是一种新颖的免训练框架,它利用文本到图像扩散模型的生成能力来解决跨领域图像合成任务,该任务的重点是将用户指定的对象完美地融入指定的视觉情境中,而不受领域差异的影响。以往的方法通常涉及在定制数据集上训练辅助网络或微调扩散模型,这不仅成本高昂,而且可能会破坏预训练扩散模型的稳健文本和视觉前验。最近的一些研究试图打破这一障碍,提出了免训练的变通方法,即依靠操纵注意力图来隐式地驯服去噪过程。然而,通过注意力图进行合成并不一定会产生理想的合成结果。这些方法只能保留一些语义信息,通常无法保留输入对象的身份特征,或在生成的图像中表现出有限的背景-对象风格适应性。相比之下,TALE 是一种新颖的方法,它直接在潜在空间中运行,为合成过程提供明确有效的指导,从而解决这些问题。具体来说,我们为 TALE 配备了两种机制,分别称为 "自适应潜影操作"(Adaptive Latent Manipulation)和 "能量引导潜影优化"(Energy-guidedLatent Optimization)。前者通过在相应的时间步直接利用背景和前景潜变量,制定有利于启动和引导合成过程的噪声潜变量,而后者则利用指定的能量函数,进一步优化符合特定条件的中间潜变量,从而与前者相辅相成,生成所需的最终结果。我们的实验证明,TALE 超越了先前的基准线,在各种逼真和艺术领域的图像引导合成中达到了最先进的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1