FICE: Text-conditioned fashion-image editing with guided GAN inversion

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2024-09-14 DOI:10.1016/j.patcog.2024.111022

Martin Pernuš , Clinton Fookes , Vitomir Štruc , Simon Dobrišek

{"title":"FICE: Text-conditioned fashion-image editing with guided GAN inversion","authors":"Martin Pernuš , Clinton Fookes , Vitomir Štruc , Simon Dobrišek","doi":"10.1016/j.patcog.2024.111022","DOIUrl":null,"url":null,"abstract":"<div><p>Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: <span><span>https://github.com/MartinPernus/FICE</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111022"},"PeriodicalIF":7.6000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324007738","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from: https://github.com/MartinPernus/FICE.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FICE：利用引导式 GAN 反演进行文本条件时尚图像编辑

时尚图像编辑是一项具有挑战性的计算机视觉任务，其目标是将选定的服装融入给定的输入图像中。现有的大多数技术，即虚拟试穿方法，都是通过首先选择所需的服装示例图像，然后将服装转移到目标人物身上来完成这项任务的。相反，在本文中，我们考虑用文字描述来编辑时装图像。与基于示例的虚拟试穿技术相比，这种方法有几个优点：(i) 它不需要目标时装的图像，(ii) 它允许通过使用自然语言来表达各种视觉概念。现有的使用语言输入的图像编辑方法受到很大限制，因为它们要求训练集具有丰富的属性注释，或者只能处理简单的文本描述。针对这些限制，我们提出了一种名为 FICE（时尚图像 CLIP 编辑）的新颖文本条件编辑模型，该模型能够处理各种不同的文本描述，为编辑过程提供指导。具体来说，通过 FICE，我们扩展了常见的 GAN 转换过程，在生成图像时加入了语义、姿势相关和图像级约束。由于 CLIP 模型具有令人印象深刻的图像-文本关联能力，因此我们利用 CLIP 模型的功能来执行文本提供的语义。此外，我们还提出了一种潜在代码正则化技术，为更好地控制合成图像的保真度提供了手段。我们通过在 VITON 图像和 Fashion-Gen 文本描述组合上进行严格实验，并与几种最先进的、以文本为条件的图像编辑方法进行比较，对 FICE 进行了验证。实验结果表明，FICE 生成的时尚图像非常逼真，与现有的竞争方法相比，其编辑效果更好。源代码可从 https://github.com/MartinPernus/FICE 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.

期刊最新文献

Editorial Board Contrastive calibration on consensus and complementary multi-view representations Adversarial supervised contrastive feature learning for cross-modal retrieval A visual-textual mutual guidance fusion network for remote sensing visual question answering Generalizable face forgery detection via mining single-step reconstruction difference