GCFormer:一种用于语音情感识别的图形卷积转换器

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI:10.1145/3577190.3614177

Yingxue Gao, Huan Zhao, Yufeng Xiao, Zixing Zhang

{"title":"GCFormer:一种用于语音情感识别的图形卷积转换器","authors":"Yingxue Gao, Huan Zhao, Yufeng Xiao, Zixing Zhang","doi":"10.1145/3577190.3614177","DOIUrl":null,"url":null,"abstract":"Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition\",\"authors\":\"Yingxue Gao, Huan Zhao, Yufeng Xiao, Zixing Zhang\",\"doi\":\"10.1145/3577190.3614177\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.\",\"PeriodicalId\":93171,\"journal\":{\"name\":\"Companion Publication of the 2020 International Conference on Multimodal Interaction\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Publication of the 2020 International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577190.3614177\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3614177","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图卷积网络(GCNs)在图像分类和自然语言处理方面取得了优异的成绩。然而，目前GCNs在语音情感识别(SER)中的应用研究并不广泛。与此同时，最近的研究表明，GCNs可能无法自适应地捕获整个音频的远程上下文情感信息。为了解决这一问题，本文提出了一种图形卷积变换(GCFormer)模型，该模型能够提取局部和全局情感信息。具体来说，我们构造了一个循环图，并进行了简洁的图卷积运算来获得空间局部特征。然后，连续变压器网络进一步努力学习更多的高级表示及其全局时间相关性。最后，通过门控循环单元(GRU)池化层将从变压器中学习到的序列化表示映射为向量，用于情感分类。在两个公共情感数据集上的实验结果表明，所提出的GCFormer在预测精度方面明显优于其他基于gcn的模型，在预测精度和模型效率方面优于其他最先进的深度学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GCFormer: A Graph Convolutional Transformer for Speech Emotion Recognition

Graph convolutional networks (GCNs) have achieved excellent results in image classification and natural language processing. However, at present, the application of GCNs in speech emotion recognition (SER) is not widely studied. Meanwhile, recent studies have shown that GCNs may not be able to adaptively capture the long-range context emotional information over the whole audio. To alleviate this problem, this paper proposes a Graph Convolutional Transformer (GCFormer) model which empowers the model to extract local and global emotional information. Specifically, we construct a cyclic graph and perform concise graph convolution operations to obtain spatial local features. Then, a consecutive transformer network further strives to learn more high-level representations and their global temporal correlation. Finally and sequentially, the learned serialized representations from the transformer are mapped into a vector through a gated recurrent unit (GRU) pooling layer for emotion classification. The experiment results obtained on two public emotional datasets demonstrate that the proposed GCFormer performs significantly better than other GCN-based models in terms of prediction accuracy, and surpasses the other state-of-the-art deep learning models in terms of prediction accuracy and model efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion Publication of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量