基于对比学习和基于clip特征的服装条件图像检索

ACM Multimedia Asia Pub Date : 2021-12-01 DOI:10.1145/3469877.3493593

Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo

{"title":"基于对比学习和基于clip特征的服装条件图像检索","authors":"Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo","doi":"10.1145/3469877.3493593","DOIUrl":null,"url":null,"abstract":"Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features\",\"authors\":\"Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo\",\"doi\":\"10.1145/3469877.3493593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.\",\"PeriodicalId\":210974,\"journal\":{\"name\":\"ACM Multimedia Asia\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Multimedia Asia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3469877.3493593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Multimedia Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3469877.3493593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

基于多模态零镜头表示学习的最新进展，在本文中，我们探索了使用从最近的CLIP模型中获得的特征来执行条件图像检索。从参考图像和用户想要的关于参考图像的附加文本描述开始，我们学习一个能够理解图像内容，整合文本描述并提供用于执行条件图像检索的组合特征的Combiner网络。从简单的CLIP特征和简单的基线开始，我们展示了基于这种多模式特征的精心制作的Combiner网络非常有效，并且在流行的FashionIQ数据集上优于更复杂的最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Multimedia Asia

自引率

0.00%

发文量