IFAdapter:基于文本到图像生成的实例特征控制

Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang
{"title":"IFAdapter:基于文本到图像生成的实例特征控制","authors":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang","doi":"arxiv-2409.08240","DOIUrl":null,"url":null,"abstract":"While Text-to-Image (T2I) diffusion models excel at generating visually\nappealing images of individual instances, they struggle to accurately position\nand control the features generation of multiple instances. The Layout-to-Image\n(L2I) task was introduced to address the positioning challenges by\nincorporating bounding boxes as spatial control signals, but it still falls\nshort in generating precise instance features. In response, we propose the\nInstance Feature Generation (IFG) task, which aims to ensure both positional\naccuracy and feature fidelity in generated instances. To address the IFG task,\nwe introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\nfeature depiction by incorporating additional appearance tokens and utilizing\nan Instance Semantic Map to align instance-level features with spatial\nlocations. The IFAdapter guides the diffusion process as a plug-and-play\nmodule, making it adaptable to various community models. For evaluation, we\ncontribute an IFG benchmark and develop a verification pipeline to objectively\ncompare models' abilities to generate instances with accurate positioning and\nfeatures. Experimental results demonstrate that IFAdapter outperforms other\nmodels in both quantitative and qualitative evaluations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"61 13 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation\",\"authors\":\"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang\",\"doi\":\"arxiv-2409.08240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While Text-to-Image (T2I) diffusion models excel at generating visually\\nappealing images of individual instances, they struggle to accurately position\\nand control the features generation of multiple instances. The Layout-to-Image\\n(L2I) task was introduced to address the positioning challenges by\\nincorporating bounding boxes as spatial control signals, but it still falls\\nshort in generating precise instance features. In response, we propose the\\nInstance Feature Generation (IFG) task, which aims to ensure both positional\\naccuracy and feature fidelity in generated instances. To address the IFG task,\\nwe introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\\nfeature depiction by incorporating additional appearance tokens and utilizing\\nan Instance Semantic Map to align instance-level features with spatial\\nlocations. The IFAdapter guides the diffusion process as a plug-and-play\\nmodule, making it adaptable to various community models. For evaluation, we\\ncontribute an IFG benchmark and develop a verification pipeline to objectively\\ncompare models' abilities to generate instances with accurate positioning and\\nfeatures. Experimental results demonstrate that IFAdapter outperforms other\\nmodels in both quantitative and qualitative evaluations.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"61 13 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

虽然文本到图像(T2I)扩散模型在生成单个实例的视觉效果图像方面表现出色,但在精确定位和控制多个实例的特征生成方面却很吃力。为了解决定位难题,我们引入了 "从布局到图像"(Layout-to-Image,L2I)任务,将边界框作为空间控制信号,但它仍然无法生成精确的实例特征。为此,我们提出了实例特征生成(IFG)任务,旨在确保生成实例的定位精度和特征保真度。为了完成 IFG 任务,我们引入了实例特征适配器(IFAdapter)。IFAdapter 加入了额外的外观标记,并利用实例语义图(Instance Semantic Map)将实例级特征与空间分配相一致,从而增强了特征描述。IFAdapter 以即插即用模块的形式指导扩散过程,使其能够适应各种社区模型。为了进行评估,我们提供了一个 IFG 基准并开发了一个验证管道,以客观地比较模型生成具有准确定位和特征的实例的能力。实验结果表明,IFAdapter 在定量和定性评估中都优于其他模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1