MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning

2021 International Conference on Visual Communications and Image Processing (VCIP) Pub Date : 2021-12-05 DOI:10.1109/VCIP53242.2021.9675348

Cong Zou, Xuchen Wang, Yaosi Hu, Zhenzhong Chen, Shan Liu

{"title":"MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning","authors":"Cong Zou, Xuchen Wang, Yaosi Hu, Zhenzhong Chen, Shan Liu","doi":"10.1109/VCIP53242.2021.9675348","DOIUrl":null,"url":null,"abstract":"Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accu-rate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.","PeriodicalId":114062,"journal":{"name":"2021 International Conference on Visual Communications and Image Processing (VCIP)","volume":"1129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Visual Communications and Image Processing (VCIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VCIP53242.2021.9675348","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Video captioning is considered to be challenging due to the combination of video understanding and text generation. Recent progress in video captioning has been made mainly using methods of visual feature extraction and sequential learning. However, the syntax structure and semantic consistency of generated captions are not fully explored. Thus, in our work, we propose a novel multimodal attention based framework with Part-of-Speech (POS) sequence guidance to generate more accu-rate video captions. In general, the word sequence generation and POS sequence prediction are hierarchically jointly modeled in the framework. Specifically, different modalities including visual, motion, object and syntactic features are adaptively weighted and fused with the POS guided attention mechanism when computing the probability distributions of prediction words. Experimental results on two benchmark datasets, i.e. MSVD and MSR-VTT, demonstrate that the proposed method can not only fully exploit the information from video and text content, but also focus on the decisive feature modality when generating a word with a certain POS type. Thus, our approach boosts the video captioning performance as well as generating idiomatic captions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MAPS:视频字幕的联合多模态注意和POS序列生成

由于视频理解和文本生成的结合，视频字幕被认为是具有挑战性的。近年来在视频字幕方面取得的进展主要是利用视觉特征提取和顺序学习方法。然而，对生成的标题的语法结构和语义一致性的研究并没有得到充分的探讨。因此，在我们的工作中，我们提出了一种新的基于多模态注意力的框架，该框架具有词性(POS)序列指导，以生成更准确的视频字幕。一般来说，该框架将词序列生成和词序预测分层联合建模。具体而言，在计算预测词的概率分布时，将视觉、运动、对象和句法特征等不同模态自适应加权融合到POS引导注意机制中。在MSVD和MSR-VTT两个基准数据集上的实验结果表明，所提出的方法不仅可以充分利用视频和文本内容中的信息，而且在生成具有特定词性类型的词时，还可以关注决定性的特征模态。因此，我们的方法提高了视频字幕的性能，并生成了习惯的字幕。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 International Conference on Visual Communications and Image Processing (VCIP)

自引率

0.00%

发文量