{"title":"TALE: Training-free Cross-domain Image Composition via Adaptive Latent Manipulation and Energy-guided Optimization","authors":"Kien T. Pham, Jingye Chen, Qifeng Chen","doi":"arxiv-2408.03637","DOIUrl":null,"url":null,"abstract":"We present TALE, a novel training-free framework harnessing the generative\ncapabilities of text-to-image diffusion models to address the cross-domain\nimage composition task that focuses on flawlessly incorporating user-specified\nobjects into a designated visual contexts regardless of domain disparity.\nPrevious methods often involve either training auxiliary networks or finetuning\ndiffusion models on customized datasets, which are expensive and may undermine\nthe robust textual and visual priors of pre-trained diffusion models. Some\nrecent works attempt to break the barrier by proposing training-free\nworkarounds that rely on manipulating attention maps to tame the denoising\nprocess implicitly. However, composing via attention maps does not necessarily\nyield desired compositional outcomes. These approaches could only retain some\nsemantic information and usually fall short in preserving identity\ncharacteristics of input objects or exhibit limited background-object style\nadaptation in generated images. In contrast, TALE is a novel method that\noperates directly on latent space to provide explicit and effective guidance\nfor the composition process to resolve these problems. Specifically, we equip\nTALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided\nLatent Optimization. The former formulates noisy latents conducive to\ninitiating and steering the composition process by directly leveraging\nbackground and foreground latents at corresponding timesteps, and the latter\nexploits designated energy functions to further optimize intermediate latents\nconforming to specific conditions that complement the former to generate\ndesired final results. Our experiments demonstrate that TALE surpasses prior\nbaselines and attains state-of-the-art performance in image-guided composition\nacross various photorealistic and artistic domains.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present TALE, a novel training-free framework harnessing the generative
capabilities of text-to-image diffusion models to address the cross-domain
image composition task that focuses on flawlessly incorporating user-specified
objects into a designated visual contexts regardless of domain disparity.
Previous methods often involve either training auxiliary networks or finetuning
diffusion models on customized datasets, which are expensive and may undermine
the robust textual and visual priors of pre-trained diffusion models. Some
recent works attempt to break the barrier by proposing training-free
workarounds that rely on manipulating attention maps to tame the denoising
process implicitly. However, composing via attention maps does not necessarily
yield desired compositional outcomes. These approaches could only retain some
semantic information and usually fall short in preserving identity
characteristics of input objects or exhibit limited background-object style
adaptation in generated images. In contrast, TALE is a novel method that
operates directly on latent space to provide explicit and effective guidance
for the composition process to resolve these problems. Specifically, we equip
TALE with two mechanisms dubbed Adaptive Latent Manipulation and Energy-guided
Latent Optimization. The former formulates noisy latents conducive to
initiating and steering the composition process by directly leveraging
background and foreground latents at corresponding timesteps, and the latter
exploits designated energy functions to further optimize intermediate latents
conforming to specific conditions that complement the former to generate
desired final results. Our experiments demonstrate that TALE surpasses prior
baselines and attains state-of-the-art performance in image-guided composition
across various photorealistic and artistic domains.