CoVR-2:用于合成视频检索的自动数据构建。

Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol
{"title":"CoVR-2:用于合成视频检索的自动数据构建。","authors":"Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol","doi":"10.1109/TPAMI.2024.3463799","DOIUrl":null,"url":null,"abstract":"Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers \n<italic>both</i>\n text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR \n<italic>triplets</i>\n is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption \n<italic>pairs</i>\n, while also expanding the scope of the task to include Composed \n<italic>Video</i>\n Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11409-11421"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoVR-2: Automatic Data Construction for Composed Video Retrieval\",\"authors\":\"Lucas Ventura;Antoine Yang;Cordelia Schmid;Gül Varol\",\"doi\":\"10.1109/TPAMI.2024.3463799\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers \\n<italic>both</i>\\n text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR \\n<italic>triplets</i>\\n is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption \\n<italic>pairs</i>\\n, while also expanding the scope of the task to include Composed \\n<italic>Video</i>\\n Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"11409-11421\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10685001/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10685001/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

合成图像检索(CoIR)作为一种同时考虑文本和图像查询,在数据库中搜索相关图像的任务,最近越来越受欢迎。大多数 CoIR 方法都需要人工标注数据集,包括图像-文本-图像三元组,其中文本描述了从查询图像到目标图像的修改。然而,人工标注 CoIR 三元组不仅成本高昂,而且不具备可扩展性。在这项工作中,我们提出了一种可扩展的自动数据集创建方法,该方法可根据视频-字幕配对生成三元组,同时还将任务范围扩展到组合视频检索(CoVR)。为此,我们从大型数据库中挖掘具有相似标题的配对视频,并利用大型语言模型生成相应的修改文本。将这一方法应用于广泛的 WebVid2M 数据集,我们自动构建了 WebVid-CoVR 数据集,产生了 160 万个三元组。此外,我们还为 CoVR 引入了一个新的基准,即人工标注的评估集和基准结果。通过使用概念字幕数据集生成 330 万个 CoIR 训练三元组,我们进一步验证了我们的方法同样适用于图像字幕对。我们的模型建立在 BLIP-2 预训练的基础上,使其适用于视频(或图像)检索,并加入了额外的字幕检索损失,以利用三元组之外的额外监督。我们在新的 CoVR 基准上提供了广泛的消减来分析设计选择。我们的实验还证明,在我们的数据集上训练的 CoVR 模型可以有效地转移到 CoIR 上,从而在 CIRR、FashionIQ 和 CIRCO 基准的零镜头设置中提高最先进的性能。我们的代码、数据集和模型可在 https://imagine.enpc.fr/ ventural/covr 上公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CoVR-2: Automatic Data Construction for Composed Video Retrieval
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs , while also expanding the scope of the task to include Composed Video Retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. We further validate that our methodology is equally applicable to image-caption pairs, by generating 3.3 million CoIR training triplets using the Conceptual Captions dataset. Our model builds on BLIP-2 pretraining, adapting it to composed video (or image) retrieval, and incorporates an additional caption retrieval loss to exploit extra supervision beyond the triplet, which is possible since captions are readily available for our training data by design. We provide extensive ablations to analyze the design choices on our new CoVR benchmark. Our experiments also demonstrate that training a CoVR model on our datasets effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on the CIRR, FashionIQ, and CIRCO benchmarks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning. Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing. 360SFUDA++: Towards Source-Free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes. Anti-Forgetting Adaptation for Unsupervised Person Re-Identification. Evolved Hierarchical Masking for Self-Supervised Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1