Improving Text-guided Object Inpainting with Semantic Pre-inpainting

arXiv - CS - Multimedia Pub Date : 2024-09-12 DOI:arxiv-2409.08260

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei

{"title":"Improving Text-guided Object Inpainting with Semantic Pre-inpainting","authors":"Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei","doi":"arxiv-2409.08260","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the success of large text-to-image diffusion\nmodels and their remarkable potential to generate high-quality images. The\nfurther pursuit of enhancing the editability of images has sparked significant\ninterest in the downstream task of inpainting a novel object described by a\ntext prompt within a designated region in the image. Nevertheless, the problem\nis not trivial from two aspects: 1) Solely relying on one single U-Net to align\ntext prompt and visual object across all the denoising timesteps is\ninsufficient to generate desired objects; 2) The controllability of object\ngeneration is not guaranteed in the intricate sampling space of diffusion\nmodel. In this paper, we propose to decompose the typical single-stage object\ninpainting into two cascaded processes: 1) semantic pre-inpainting that infers\nthe semantic features of desired objects in a multi-modal feature space; 2)\nhigh-fieldity object generation in diffusion latent space that pivots on such\ninpainted semantic features. To achieve this, we cascade a Transformer-based\nsemantic inpainter and an object inpainting diffusion model, leading to a novel\nCAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object\ninpainting. Technically, the semantic inpainter is trained to predict the\nsemantic features of the target object conditioning on unmasked context and\ntext prompt. The outputs of the semantic inpainter then act as the informative\nvisual prompts to guide high-fieldity object generation through a reference\nadapter layer, leading to controllable object inpainting. Extensive evaluations\non OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against\nthe state-of-the-art methods. Code is available at\n\\url{https://github.com/Nnn-s/CATdiffusion}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用语义预绘制改进文本引导的对象绘制

近年来，大型文本到图像扩散模型取得了成功，并在生成高质量图像方面具有显著的潜力。为了进一步提高图像的可编辑性，人们对在图像指定区域内插入文本提示所描述的新对象这一下游任务产生了浓厚的兴趣。然而，从两个方面来看，这个问题并不简单：1) 在所有去噪时间步中，仅依靠一个 U-Net 将文本提示和视觉对象对齐不足以生成所需的对象；2) 在扩散模型错综复杂的采样空间中，无法保证对象生成的可控性。在本文中，我们建议将典型的单阶段对象绘制分解为两个级联过程：1) 语义预绘制，即在多模态特征空间中推断所需对象的语义特征；2) 在扩散潜空间中生成高场度对象，该过程以这些绘制的语义特征为中心。为此，我们级联了一个基于变换器的语义绘制器和一个对象绘制扩散模型，从而形成了一个用于文本引导的对象绘制的新型级联变换器-扩散（CAT-Diffusion）框架。从技术上讲，语义绘制器通过训练来预测目标对象的语义特征，并以未屏蔽的上下文和文本提示为条件。然后，语义绘制器的输出作为信息视觉提示，通过参考适配器层引导高场度对象生成，从而实现可控的对象绘制。在 OpenImages-V6 和 MSCOCO 上进行的广泛评估验证了 CAT-Diffusion 优于最先进的方法。代码请访问：url{https://github.com/Nnn-s/CATdiffusion}。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量

期刊最新文献

Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs