{"title":"基于大规模动量对比学习的文本视频检索统一框架","authors":"Eung-Ku Kim, Nahyun Lee, Yoon-Sik Cho","doi":"10.1109/ICEIC57457.2023.10049967","DOIUrl":null,"url":null,"abstract":"Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.","PeriodicalId":373752,"journal":{"name":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Unified framework based on Large-scale Momentum Contrastive learning for Text-Video Retrieval\",\"authors\":\"Eung-Ku Kim, Nahyun Lee, Yoon-Sik Cho\",\"doi\":\"10.1109/ICEIC57457.2023.10049967\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.\",\"PeriodicalId\":373752,\"journal\":{\"name\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEIC57457.2023.10049967\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIC57457.2023.10049967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Unified framework based on Large-scale Momentum Contrastive learning for Text-Video Retrieval
Large-scale vision-language representation learning has shown dramatic advances in image-text retrieval studies. On top of that, it is clear that the CLIP (Contrastive Language-Image Pre-training) [1] has contributed to this field in that it succeeded pre-training on tremendous web multimedia data with leveraging contrastive loss, one of the most competitive loss for multimodal learning. CLIP4CLIP [2] utilized the pre-trained CLIP weight to appeal its performance of learning video representations with textual supervision with additional post-pretraining on Howto100M [3]. However, there is still room for elevation in CLIP4Clip from perspective of contrastive learning - large enough instances of negative samples help to learn better representations, as the batch size in each training batch getting larger, the more chance of the model to be allowed to contrast more semantic representations. In the same queue of this point of view, we propose CwMTVR combining Momentum Contrast (MoCo) [4], cross-modal momentum contrastive learning framework into CLIP4Clip. Experimental results prove that the CwMTVR model achieve competitive results in text-video retrieval on MSVD dataset.