Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei
{"title":"Improving Text-guided Object Inpainting with Semantic Pre-inpainting","authors":"Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei","doi":"arxiv-2409.08260","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed the success of large text-to-image diffusion\nmodels and their remarkable potential to generate high-quality images. The\nfurther pursuit of enhancing the editability of images has sparked significant\ninterest in the downstream task of inpainting a novel object described by a\ntext prompt within a designated region in the image. Nevertheless, the problem\nis not trivial from two aspects: 1) Solely relying on one single U-Net to align\ntext prompt and visual object across all the denoising timesteps is\ninsufficient to generate desired objects; 2) The controllability of object\ngeneration is not guaranteed in the intricate sampling space of diffusion\nmodel. In this paper, we propose to decompose the typical single-stage object\ninpainting into two cascaded processes: 1) semantic pre-inpainting that infers\nthe semantic features of desired objects in a multi-modal feature space; 2)\nhigh-fieldity object generation in diffusion latent space that pivots on such\ninpainted semantic features. To achieve this, we cascade a Transformer-based\nsemantic inpainter and an object inpainting diffusion model, leading to a novel\nCAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object\ninpainting. Technically, the semantic inpainter is trained to predict the\nsemantic features of the target object conditioning on unmasked context and\ntext prompt. The outputs of the semantic inpainter then act as the informative\nvisual prompts to guide high-fieldity object generation through a reference\nadapter layer, leading to controllable object inpainting. Extensive evaluations\non OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against\nthe state-of-the-art methods. Code is available at\n\\url{https://github.com/Nnn-s/CATdiffusion}.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08260","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用语义预绘制改进文本引导的对象绘制
近年来,大型文本到图像扩散模型取得了成功,并在生成高质量图像方面具有显著的潜力。为了进一步提高图像的可编辑性,人们对在图像指定区域内插入文本提示所描述的新对象这一下游任务产生了浓厚的兴趣。然而,从两个方面来看,这个问题并不简单:1) 在所有去噪时间步中,仅依靠一个 U-Net 将文本提示和视觉对象对齐不足以生成所需的对象;2) 在扩散模型错综复杂的采样空间中,无法保证对象生成的可控性。在本文中,我们建议将典型的单阶段对象绘制分解为两个级联过程:1) 语义预绘制,即在多模态特征空间中推断所需对象的语义特征;2) 在扩散潜空间中生成高场度对象,该过程以这些绘制的语义特征为中心。为此,我们级联了一个基于变换器的语义绘制器和一个对象绘制扩散模型,从而形成了一个用于文本引导的对象绘制的新型级联变换器-扩散(CAT-Diffusion)框架。从技术上讲,语义绘制器通过训练来预测目标对象的语义特征,并以未屏蔽的上下文和文本提示为条件。然后,语义绘制器的输出作为信息视觉提示,通过参考适配器层引导高场度对象生成,从而实现可控的对象绘制。在 OpenImages-V6 和 MSCOCO 上进行的广泛评估验证了 CAT-Diffusion 优于最先进的方法。代码请访问:url{https://github.com/Nnn-s/CATdiffusion}。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1