{"title":"Hard Contrastive Learning for Video Captioning","authors":"Lilei Wu, Jie Liu","doi":"10.1109/ICECE56287.2022.10048594","DOIUrl":null,"url":null,"abstract":"Maximum likelihood estimation has been widely adopted along with the encoder-decoder framework for video captioning. However, it ignores the structure of sentences and restrains the diversity and distinction of generated captions. To address this issue, we propose a hard contrastive learning (HCL) method for video captioning. Specifically, built on the encoder-decoder framework, we introduce mismatched pairs to learn a reference distribution of video descriptions. The target model on the matched pairs is learned on top the reference model, which improves the distinctiveness of generated captions. In addition, we further boost the distinctiveness of the captions by developing a hard mining technique to select the hardest mismatched pairs within the contrastive learning framework. Finally, the relationships among multiple relevant captions for each video is consider to encourage the diversity of generated captions. The proposed method generates high quality captions which effectively capture the specialties in individual videos. Extensive experiments on two benchmark datasets, i.e., MSVD and MSR-VTT, show that our approach outperforms state-of-the-art methods.","PeriodicalId":358486,"journal":{"name":"2022 IEEE 5th International Conference on Electronics and Communication Engineering (ICECE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 5th International Conference on Electronics and Communication Engineering (ICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICECE56287.2022.10048594","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Maximum likelihood estimation has been widely adopted along with the encoder-decoder framework for video captioning. However, it ignores the structure of sentences and restrains the diversity and distinction of generated captions. To address this issue, we propose a hard contrastive learning (HCL) method for video captioning. Specifically, built on the encoder-decoder framework, we introduce mismatched pairs to learn a reference distribution of video descriptions. The target model on the matched pairs is learned on top the reference model, which improves the distinctiveness of generated captions. In addition, we further boost the distinctiveness of the captions by developing a hard mining technique to select the hardest mismatched pairs within the contrastive learning framework. Finally, the relationships among multiple relevant captions for each video is consider to encourage the diversity of generated captions. The proposed method generates high quality captions which effectively capture the specialties in individual videos. Extensive experiments on two benchmark datasets, i.e., MSVD and MSR-VTT, show that our approach outperforms state-of-the-art methods.