{"title":"Keyword-aware Multi-modal Enhancement Attention for Video Question Answering","authors":"Duo Chen, Fuwei Zhang, Shirou Ou, Ruomei Wang","doi":"10.1145/3507548.3507567","DOIUrl":null,"url":null,"abstract":"Video question answering (VideoQA) is an intriguing topic in the field of visual language. Most of the current VideoQA models directly harness the global video information to answer questions. However, in VideoQA task, the answers associated with the questions merely appear in a few video contents, and other contents are invalid and redundant information. Therefore, VideoQA is vulnerable to be interfered by a large number of irrelevant contents. To address this challenge, we propose a Keyword-aware Multi-modal Enhancement Attention model for VideoQA. Specifically, a multi-factor keyword extraction (MFKE) algorithm is proposed to emphasize the crucial information in multimodal feature extraction. Furthermore, based on attention mechanisms, a keyword-aware enhancement attention (KAEA) module is designed to correlate the information associated with multiple modalities and fuse multimodal features. The experimental results on publicly available large VideoQA datasets, namely TVQA+ and LifeQA, demonstrate the effectiveness of our model.","PeriodicalId":414908,"journal":{"name":"Proceedings of the 2021 5th International Conference on Computer Science and Artificial Intelligence","volume":"74 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 5th International Conference on Computer Science and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3507548.3507567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Video question answering (VideoQA) is an intriguing topic in the field of visual language. Most of the current VideoQA models directly harness the global video information to answer questions. However, in VideoQA task, the answers associated with the questions merely appear in a few video contents, and other contents are invalid and redundant information. Therefore, VideoQA is vulnerable to be interfered by a large number of irrelevant contents. To address this challenge, we propose a Keyword-aware Multi-modal Enhancement Attention model for VideoQA. Specifically, a multi-factor keyword extraction (MFKE) algorithm is proposed to emphasize the crucial information in multimodal feature extraction. Furthermore, based on attention mechanisms, a keyword-aware enhancement attention (KAEA) module is designed to correlate the information associated with multiple modalities and fuse multimodal features. The experimental results on publicly available large VideoQA datasets, namely TVQA+ and LifeQA, demonstrate the effectiveness of our model.