自定义对比:多层次对比视角,实现主题驱动的文本到图像定制

Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao
{"title":"自定义对比:多层次对比视角,实现主题驱动的文本到图像定制","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":null,"url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\ninterest in academia and industry. This task enables pre-trained models to\ngenerate novel images based on unique subjects. Existing studies adopt a\nself-reconstructive perspective, focusing on capturing all details of a single\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\nview, pose, and background) as the subject intrinsic attributes. This\nmisconstruction leads to both overfitting or underfitting of irrelevant and\nintrinsic attributes of the subject, i.e., these attributes are\nover-represented or under-represented simultaneously, causing a trade-off\nbetween similarity and controllability. In this study, we argue an ideal\nsubject representation can be achieved by a cross-differential perspective,\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\ncontrastive learning, which allows the model to focus more on intrinsic\nattributes through intra-consistency (features of the same subject are\nspatially closer) and inter-distinctiveness (features of different subjects\nhave distinguished differences). Specifically, we propose CustomContrast, a\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\nused to extract intrinsic features of subjects from high-level semantics to\nlow-level appearance through crossmodal semantic contrastive learning and\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\nexperiments show the effectiveness of CustomContrast in subject similarity and\ntext controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization\",\"authors\":\"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao\",\"doi\":\"arxiv-2409.05606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Subject-driven text-to-image (T2I) customization has drawn significant\\ninterest in academia and industry. This task enables pre-trained models to\\ngenerate novel images based on unique subjects. Existing studies adopt a\\nself-reconstructive perspective, focusing on capturing all details of a single\\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\\nview, pose, and background) as the subject intrinsic attributes. This\\nmisconstruction leads to both overfitting or underfitting of irrelevant and\\nintrinsic attributes of the subject, i.e., these attributes are\\nover-represented or under-represented simultaneously, causing a trade-off\\nbetween similarity and controllability. In this study, we argue an ideal\\nsubject representation can be achieved by a cross-differential perspective,\\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\\ncontrastive learning, which allows the model to focus more on intrinsic\\nattributes through intra-consistency (features of the same subject are\\nspatially closer) and inter-distinctiveness (features of different subjects\\nhave distinguished differences). Specifically, we propose CustomContrast, a\\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\\nused to extract intrinsic features of subjects from high-level semantics to\\nlow-level appearance through crossmodal semantic contrastive learning and\\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\\nexperiments show the effectiveness of CustomContrast in subject similarity and\\ntext controllability.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05606\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

主题驱动的文本到图像(T2I)定制在学术界和工业界引起了极大的兴趣。这项任务使预先训练好的模型能够根据独特的主题生成新颖的图像。现有研究采用自我重构的视角,专注于捕捉单张图像的所有细节,这会将特定图像的无关属性(如视图、姿势和背景)误认为是主体的内在属性。这种误解会导致被摄体的无关属性和内在属性的过度拟合或不足拟合,即这些属性同时被过度呈现或不足呈现,从而造成相似性和可控性之间的权衡。在本研究中,我们认为理想的主体表征可以通过交叉差异视角来实现,即通过对比学习将主体内在属性与无关属性分离开来,从而使模型通过内在一致性(同一主体的特征在空间上更接近)和相互区别性(不同主体的特征有显著差异)更加关注内在属性。具体来说,我们提出了 "自定义对比 "这一高级框架,其中包括多级对比学习(MCL)范式和多模态特征注入(MFI)编码器。MCL 范式用于通过跨模态语义对比学习和多尺度外观对比学习,从高层语义到低层外观提取主体的内在特征。为了促进对比学习,我们引入了 MFI 编码器来捕捉跨模态表征。广泛的实验表明,自定义对比在主体相似性和文本可控性方面非常有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization
Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vista3D: Unravel the 3D Darkside of a Single Image MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion Efficient Low-Resolution Face Recognition via Bridge Distillation Enhancing Few-Shot Classification without Forgetting through Multi-Level Contrastive Constraints NVLM: Open Frontier-Class Multimodal LLMs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1