基于大规模动量对比学习的文本视频检索统一框架

Eung-Ku Kim, Nahyun Lee, Yoon-Sik Cho
{"title":"基于大规模动量对比学习的文本视频检索统一框架","authors":"Eung-Ku Kim, Nahyun Lee, Yoon-Sik Cho","doi":"10.1109/ICEIC57457.2023.10049967","DOIUrl":null,"url":null,"abstract":"Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.","PeriodicalId":373752,"journal":{"name":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Unified framework based on Large-scale Momentum Contrastive learning for Text-Video Retrieval\",\"authors\":\"Eung-Ku Kim, Nahyun Lee, Yoon-Sik Cho\",\"doi\":\"10.1109/ICEIC57457.2023.10049967\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.\",\"PeriodicalId\":373752,\"journal\":{\"name\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEIC57457.2023.10049967\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIC57457.2023.10049967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大规模视觉语言表征学习在图像-文本检索研究中显示出巨大的进步。除此之外,很明显,CLIP(对比语言图像预训练)[1]对这一领域做出了贡献,因为它成功地利用了对比损失对大量网络多媒体数据进行了预训练,对比损失是多模态学习中最具竞争力的损失之一。CLIP4CLIP[2]利用预训练的CLIP权值,在Howto100M上通过额外的后预训练来提高其在文本监督下学习视频表示的性能[3]。然而,从对比学习的角度来看,CLIP4Clip仍然有提升的空间——足够大的负样本实例有助于学习更好的表示,随着每个训练批的批大小越来越大,模型被允许对比更多语义表示的机会就越大。在这一观点的同一队列中,我们提出了将动量对比(MoCo)[4]、跨模态动量对比学习框架结合到CLIP4Clip中的CwMTVR。实验结果表明,CwMTVR模型在MSVD数据集上取得了较好的文本视频检索效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Unified framework based on Large-scale Momentum Contrastive learning for Text-Video Retrieval
Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
DWT+DWT: Deep Learning Domain Generalization Techniques Using Discrete Wavelet Transform with Deep Whitening Transform Fast Virtual Keyboard Typing Using Vowel Hand Gesture Recognition A Study on Edge Computing-Based Microservices Architecture Supporting IoT Device Management and Artificial Intelligence Inference Efficient Pavement Crack Detection in Drone Images using Deep Neural Networks High Performance 3.3KV 4H-SiC MOSFET with a Floating Island and Hetero Junction Diode
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1