{"title":"基于层次注意策略的多模态情感分析","authors":"Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, I. Marsic","doi":"10.18653/v1/P18-1207","DOIUrl":null,"url":null,"abstract":"Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still a challenge because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model is able to visualize and interpret synchronized attention over modalities.","PeriodicalId":74541,"journal":{"name":"Proceedings of the conference. Association for Computational Linguistics. Meeting","volume":"68 1","pages":"2225-2235"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"100","resultStr":"{\"title\":\"Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment\",\"authors\":\"Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, I. Marsic\",\"doi\":\"10.18653/v1/P18-1207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still a challenge because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model is able to visualize and interpret synchronized attention over modalities.\",\"PeriodicalId\":74541,\"journal\":{\"name\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"volume\":\"68 1\",\"pages\":\"2225-2235\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"100\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the conference. Association for Computational Linguistics. Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/P18-1207\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the conference. Association for Computational Linguistics. Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/P18-1207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment
Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still a challenge because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterance-level sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model is able to visualize and interpret synchronized attention over modalities.