EZIGen:通过精确的主体编码和解耦引导,增强零镜头主体驱动图像生成功能

Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu
{"title":"EZIGen:通过精确的主体编码和解耦引导,增强零镜头主体驱动图像生成功能","authors":"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu","doi":"arxiv-2409.08091","DOIUrl":null,"url":null,"abstract":"Zero-shot subject-driven image generation aims to produce images that\nincorporate a subject from a given example image. The challenge lies in\npreserving the subject's identity while aligning with the text prompt, which\noften requires modifying certain aspects of the subject's appearance. Despite\nadvancements in diffusion model based methods, existing approaches still\nstruggle to balance identity preservation with text prompt alignment. In this\nstudy, we conducted an in-depth investigation into this issue and uncovered key\ninsights for achieving effective identity preservation while maintaining a\nstrong balance. Our key findings include: (1) the design of the subject image\nencoder significantly impacts identity preservation quality, and (2) generating\nan initial layout is crucial for both text alignment and identity preservation.\nBuilding on these insights, we introduce a new approach called EZIGen, which\nemploys two main strategies: a carefully crafted subject image Encoder based on\nthe UNet architecture of the pretrained Stable Diffusion model to ensure\nhigh-quality identity transfer, following a process that decouples the guidance\nstages and iteratively refines the initial image layout. Through these\nstrategies, EZIGen achieves state-of-the-art results on multiple subject-driven\nbenchmarks with a unified model and 100 times less training data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance\",\"authors\":\"Zicheng Duan, Yuxuan Ding, Chenhui Gou, Ziqin Zhou, Ethan Smith, Lingqiao Liu\",\"doi\":\"arxiv-2409.08091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Zero-shot subject-driven image generation aims to produce images that\\nincorporate a subject from a given example image. The challenge lies in\\npreserving the subject's identity while aligning with the text prompt, which\\noften requires modifying certain aspects of the subject's appearance. Despite\\nadvancements in diffusion model based methods, existing approaches still\\nstruggle to balance identity preservation with text prompt alignment. In this\\nstudy, we conducted an in-depth investigation into this issue and uncovered key\\ninsights for achieving effective identity preservation while maintaining a\\nstrong balance. Our key findings include: (1) the design of the subject image\\nencoder significantly impacts identity preservation quality, and (2) generating\\nan initial layout is crucial for both text alignment and identity preservation.\\nBuilding on these insights, we introduce a new approach called EZIGen, which\\nemploys two main strategies: a carefully crafted subject image Encoder based on\\nthe UNet architecture of the pretrained Stable Diffusion model to ensure\\nhigh-quality identity transfer, following a process that decouples the guidance\\nstages and iteratively refines the initial image layout. Through these\\nstrategies, EZIGen achieves state-of-the-art results on multiple subject-driven\\nbenchmarks with a unified model and 100 times less training data.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

零拍主体驱动图像生成的目的是根据给定的示例图像生成包含主体的图像。其难点在于如何在保持主体身份的同时与文本提示保持一致,这通常需要修改主体外观的某些方面。尽管基于扩散模型的方法取得了进步,但现有方法仍难以在保持身份和文本提示对齐之间取得平衡。在本研究中,我们对这一问题进行了深入调查,并发现了在保持有力平衡的同时实现有效身份保护的关键见解。我们的主要发现包括(基于这些见解,我们引入了一种名为 EZIGen 的新方法,该方法采用了两种主要策略:一种是基于预训练稳定扩散模型的 UNet 架构精心设计的主题图像编码器,以确保高质量的身份转移;另一种是解耦引导阶段并迭代完善初始图像布局的过程。通过这些策略,EZIGen 以统一的模型和减少 100 倍的训练数据,在多个主题驱动的基准测试中取得了最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
EZIGen: Enhancing zero-shot subject-driven image generation with precise subject encoding and decoupled guidance
Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject's identity while aligning with the text prompt, which often requires modifying certain aspects of the subject's appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) generating an initial layout is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the UNet architecture of the pretrained Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1