{"title":"GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition","authors":"Yingxue Gao, Huan Zhao, Yufeng Xiao, Zixing Zhang","doi":"10.1145/3577190.3614177","DOIUrl":null,"url":null,"abstract":"Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3614177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.