自定义对比：多层次对比视角，实现主题驱动的文本到图像定制

arXiv - CS - Multimedia Pub Date : 2024-09-09 DOI:arxiv-2409.05606

Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao

{"title":"自定义对比：多层次对比视角，实现主题驱动的文本到图像定制","authors":"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao","doi":"arxiv-2409.05606","DOIUrl":null,"url":null,"abstract":"Subject-driven text-to-image (T2I) customization has drawn significant\ninterest in academia and industry. This task enables pre-trained models to\ngenerate novel images based on unique subjects. Existing studies adopt a\nself-reconstructive perspective, focusing on capturing all details of a single\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\nview, pose, and background) as the subject intrinsic attributes. This\nmisconstruction leads to both overfitting or underfitting of irrelevant and\nintrinsic attributes of the subject, i.e., these attributes are\nover-represented or under-represented simultaneously, causing a trade-off\nbetween similarity and controllability. In this study, we argue an ideal\nsubject representation can be achieved by a cross-differential perspective,\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\ncontrastive learning, which allows the model to focus more on intrinsic\nattributes through intra-consistency (features of the same subject are\nspatially closer) and inter-distinctiveness (features of different subjects\nhave distinguished differences). Specifically, we propose CustomContrast, a\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\nused to extract intrinsic features of subjects from high-level semantics to\nlow-level appearance through crossmodal semantic contrastive learning and\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\nexperiments show the effectiveness of CustomContrast in subject similarity and\ntext controllability.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization\",\"authors\":\"Nan Chen, Mengqi Huang, Zhuowei Chen, Yang Zheng, Lei Zhang, Zhendong Mao\",\"doi\":\"arxiv-2409.05606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Subject-driven text-to-image (T2I) customization has drawn significant\\ninterest in academia and industry. This task enables pre-trained models to\\ngenerate novel images based on unique subjects. Existing studies adopt a\\nself-reconstructive perspective, focusing on capturing all details of a single\\nimage, which will misconstrue the specific image's irrelevant attributes (e.g.,\\nview, pose, and background) as the subject intrinsic attributes. This\\nmisconstruction leads to both overfitting or underfitting of irrelevant and\\nintrinsic attributes of the subject, i.e., these attributes are\\nover-represented or under-represented simultaneously, causing a trade-off\\nbetween similarity and controllability. In this study, we argue an ideal\\nsubject representation can be achieved by a cross-differential perspective,\\ni.e., decoupling subject intrinsic attributes from irrelevant attributes via\\ncontrastive learning, which allows the model to focus more on intrinsic\\nattributes through intra-consistency (features of the same subject are\\nspatially closer) and inter-distinctiveness (features of different subjects\\nhave distinguished differences). Specifically, we propose CustomContrast, a\\nnovel framework, which includes a Multilevel Contrastive Learning (MCL)\\nparadigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is\\nused to extract intrinsic features of subjects from high-level semantics to\\nlow-level appearance through crossmodal semantic contrastive learning and\\nmultiscale appearance contrastive learning. To facilitate contrastive learning,\\nwe introduce the MFI encoder to capture cross-modal representations. Extensive\\nexperiments show the effectiveness of CustomContrast in subject similarity and\\ntext controllability.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05606\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

主题驱动的文本到图像（T2I）定制在学术界和工业界引起了极大的兴趣。这项任务使预先训练好的模型能够根据独特的主题生成新颖的图像。现有研究采用自我重构的视角，专注于捕捉单张图像的所有细节，这会将特定图像的无关属性（如视图、姿势和背景）误认为是主体的内在属性。这种误解会导致被摄体的无关属性和内在属性的过度拟合或不足拟合，即这些属性同时被过度呈现或不足呈现，从而造成相似性和可控性之间的权衡。在本研究中，我们认为理想的主体表征可以通过交叉差异视角来实现，即通过对比学习将主体内在属性与无关属性分离开来，从而使模型通过内在一致性（同一主体的特征在空间上更接近）和相互区别性（不同主体的特征有显著差异）更加关注内在属性。具体来说，我们提出了 "自定义对比 "这一高级框架，其中包括多级对比学习（MCL）范式和多模态特征注入（MFI）编码器。MCL 范式用于通过跨模态语义对比学习和多尺度外观对比学习，从高层语义到低层外观提取主体的内在特征。为了促进对比学习，我们引入了 MFI 编码器来捕捉跨模态表征。广泛的实验表明，自定义对比在主体相似性和文本可控性方面非常有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization

Subject-driven text-to-image (T2I) customization has drawn significant interest in academia and industry. This task enables pre-trained models to generate novel images based on unique subjects. Existing studies adopt a self-reconstructive perspective, focusing on capturing all details of a single image, which will misconstrue the specific image's irrelevant attributes (e.g., view, pose, and background) as the subject intrinsic attributes. This misconstruction leads to both overfitting or underfitting of irrelevant and intrinsic attributes of the subject, i.e., these attributes are over-represented or under-represented simultaneously, causing a trade-off between similarity and controllability. In this study, we argue an ideal subject representation can be achieved by a cross-differential perspective, i.e., decoupling subject intrinsic attributes from irrelevant attributes via contrastive learning, which allows the model to focus more on intrinsic attributes through intra-consistency (features of the same subject are spatially closer) and inter-distinctiveness (features of different subjects have distinguished differences). Specifically, we propose CustomContrast, a novel framework, which includes a Multilevel Contrastive Learning (MCL) paradigm and a Multimodal Feature Injection (MFI) Encoder. The MCL paradigm is used to extract intrinsic features of subjects from high-level semantics to low-level appearance through crossmodal semantic contrastive learning and multiscale appearance contrastive learning. To facilitate contrastive learning, we introduce the MFI encoder to capture cross-modal representations. Extensive experiments show the effectiveness of CustomContrast in subject similarity and text controllability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助