STSI: Efficiently Mine Spatio- Temporal Semantic Information between Different Multimodal for Video Captioning

Huiyu Xiong, Lanxiao Wang
{"title":"STSI: Efficiently Mine Spatio- Temporal Semantic Information between Different Multimodal for Video Captioning","authors":"Huiyu Xiong, Lanxiao Wang","doi":"10.1109/VCIP56404.2022.10008808","DOIUrl":null,"url":null,"abstract":"As one of the challenging tasks in computer vision, video captioning needs to use natural language to describe the content of video. Video contains complex information, such as semantic information, time information and so on. How to synthesize sentences effectively from rich and different kinds of information is very significant. The existing methods often cannot well integrate the multimodal feature to predict the association between different objects in video. In this paper, we improve the existing encoder-decoder structure and propose a network deeply mining the spatio-temporal correlation between multimodal features. Through the analysis of sentence components, we use spatio-temporal semantic information mining module to fuse the object, 2D and 3D features in both time and space. It is worth mentioning that the word output at the previous time is added as the prediction branch of auxiliary conjunctions. After that, a dynamic gumbel scorer is used to output caption sentences that are more consistent with the facts. The experimental results on two benchmark datasets show that our STSI is superior to the state-of-the-art methods while generating more reasonable and semantic-logical sentences.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VCIP56404.2022.10008808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As one of the challenging tasks in computer vision, video captioning needs to use natural language to describe the content of video. Video contains complex information, such as semantic information, time information and so on. How to synthesize sentences effectively from rich and different kinds of information is very significant. The existing methods often cannot well integrate the multimodal feature to predict the association between different objects in video. In this paper, we improve the existing encoder-decoder structure and propose a network deeply mining the spatio-temporal correlation between multimodal features. Through the analysis of sentence components, we use spatio-temporal semantic information mining module to fuse the object, 2D and 3D features in both time and space. It is worth mentioning that the word output at the previous time is added as the prediction branch of auxiliary conjunctions. After that, a dynamic gumbel scorer is used to output caption sentences that are more consistent with the facts. The experimental results on two benchmark datasets show that our STSI is superior to the state-of-the-art methods while generating more reasonable and semantic-logical sentences.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高效挖掘多模态视频字幕的时空语义信息
视频字幕是计算机视觉中具有挑战性的任务之一,需要使用自然语言来描述视频内容。视频包含了复杂的信息,如语义信息、时间信息等。如何从丰富多样的信息中有效地合成句子是非常重要的。现有的方法往往不能很好地整合多模态特征来预测视频中不同对象之间的关联。在本文中,我们改进了现有的编码器-解码器结构,并提出了一个深度挖掘多模态特征之间时空相关性的网络。通过对句子成分的分析,我们利用时空语义信息挖掘模块,在时间和空间上融合对象、二维和三维特征。值得一提的是,将前一次的单词输出作为辅助连词的预测分支进行了添加。之后,使用动态gumbel评分器输出更符合事实的标题句子。在两个基准数据集上的实验结果表明,我们的STSI在生成更合理和语义逻辑更强的句子方面优于目前最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CdCLR: Clip-Driven Contrastive Learning for Skeleton-Based Action Recognition Spectral Analysis of Aerial Light Field for Optimization Sampling and Rendering of Unmanned Aerial Vehicle Near-lossless Point Cloud Geometry Compression Based on Adaptive Residual Compensation Efficient Interpolation Filters for Chroma Motion Compensation in Video Coding Rate Controllable Learned Image Compression Based on RFL Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1