{"title":"STSI: Efficiently Mine Spatio- Temporal Semantic Information between Different Multimodal for Video Captioning","authors":"Huiyu Xiong, Lanxiao Wang","doi":"10.1109/VCIP56404.2022.10008808","DOIUrl":null,"url":null,"abstract":"As one of the challenging tasks in computer vision, video captioning needs to use natural language to describe the content of video. Video contains complex information, such as semantic information, time information and so on. How to synthesize sentences effectively from rich and different kinds of information is very significant. The existing methods often cannot well integrate the multimodal feature to predict the association between different objects in video. In this paper, we improve the existing encoder-decoder structure and propose a network deeply mining the spatio-temporal correlation between multimodal features. Through the analysis of sentence components, we use spatio-temporal semantic information mining module to fuse the object, 2D and 3D features in both time and space. It is worth mentioning that the word output at the previous time is added as the prediction branch of auxiliary conjunctions. After that, a dynamic gumbel scorer is used to output caption sentences that are more consistent with the facts. The experimental results on two benchmark datasets show that our STSI is superior to the state-of-the-art methods while generating more reasonable and semantic-logical sentences.","PeriodicalId":269379,"journal":{"name":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Visual Communications and Image Processing (VCIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VCIP56404.2022.10008808","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As one of the challenging tasks in computer vision, video captioning needs to use natural language to describe the content of video. Video contains complex information, such as semantic information, time information and so on. How to synthesize sentences effectively from rich and different kinds of information is very significant. The existing methods often cannot well integrate the multimodal feature to predict the association between different objects in video. In this paper, we improve the existing encoder-decoder structure and propose a network deeply mining the spatio-temporal correlation between multimodal features. Through the analysis of sentence components, we use spatio-temporal semantic information mining module to fuse the object, 2D and 3D features in both time and space. It is worth mentioning that the word output at the previous time is added as the prediction branch of auxiliary conjunctions. After that, a dynamic gumbel scorer is used to output caption sentences that are more consistent with the facts. The experimental results on two benchmark datasets show that our STSI is superior to the state-of-the-art methods while generating more reasonable and semantic-logical sentences.