FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention

IF 11.6 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Computer Vision Pub Date : 2024-09-19 DOI:10.1007/s11263-024-02227-z
Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
{"title":"FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention","authors":"Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han","doi":"10.1007/s11263-024-02227-z","DOIUrl":null,"url":null,"abstract":"<p>Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions <i>with only forward passes</i>. To address the identity blending problem in the multi-subject generation, FastComposer proposes <i>cross-attention localization</i> supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes <i>delayed subject conditioning</i> in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300<span>\\(\\times \\)</span>–2500<span>\\(\\times \\)</span> speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":null,"pages":null},"PeriodicalIF":11.6000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02227-z","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend identity among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300\(\times \)–2500\(\times \) speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available here (https://github.com/mit-han-lab/fastcomposer).

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FastComposer:利用局部注意力生成无调谐多主体图像
扩散模型在文本到图像的生成方面表现出色,尤其是在主题驱动的个性化图像生成方面。然而,现有的方法由于需要针对特定对象进行微调而效率低下,微调需要大量计算,妨碍了高效部署。此外,现有方法在多主体生成方面也很吃力,因为它们经常会混淆主体间的身份。我们提出的 FastComposer 可实现高效、个性化、多主体文本到图像的生成,而无需微调。FastComposer 使用图像编码器提取的主体嵌入来增强扩散模型中的通用文本调节,只需向前传递即可根据主体图像和文本指示生成个性化图像。为了解决多主体生成中的身份混合问题,FastComposer 在训练过程中提出了交叉注意力定位监督,强制参考主体的注意力定位到目标图像中的正确区域。天真地对被试嵌入进行调节会导致被试过拟合。FastComposer 建议在去噪步骤中延迟主体调节,以保持主体驱动图像生成中的身份识别和可编辑性。FastComposer 能生成多个未见个体的图像,这些个体具有不同的风格、动作和背景。与基于微调的方法相比,它的速度提高了300(次)-2500(次),而且新主体不需要额外存储。FastComposer 为高效、个性化和高质量的多主体图像创建铺平了道路。代码、模型和数据集可在此处获取(https://github.com/mit-han-lab/fastcomposer)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
International Journal of Computer Vision
International Journal of Computer Vision 工程技术-计算机:人工智能
CiteScore
29.80
自引率
2.10%
发文量
163
审稿时长
6 months
期刊介绍: The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.
期刊最新文献
MVTN: Learning Multi-view Transformations for 3D Understanding Adaptive Middle Modality Alignment Learning for Visible-Infrared Person Re-identification Day2Dark: Pseudo-Supervised Activity Recognition Beyond Silent Daylight Rethinking Contemporary Deep Learning Techniques for Error Correction in Biometric Data EfficientDeRain+: Learning Uncertainty-Aware Filtering via RainMix Augmentation for High-Efficiency Deraining
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1