坐标信息拼接:一种改进视觉变换语音情感识别的新方法

2023 International Conference on Electronics, Information, and Communication (ICEIC) Pub Date : 2023-02-05 DOI:10.1109/ICEIC57457.2023.10049941

Jeongho Kim, Seung-Ho Lee

{"title":"坐标信息拼接:一种改进视觉变换语音情感识别的新方法","authors":"Jeongho Kim, Seung-Ho Lee","doi":"10.1109/ICEIC57457.2023.10049941","DOIUrl":null,"url":null,"abstract":"Recently, in speech emotion recognition, a Transformer-based method using spectrogram images instead of sound data showed improved accuracy than Convolutional Neural Networks (CNNs). Vision Transformer (ViT), a Transformer-based method, achieves high classification accuracy by using divided patches from the input image, but has a problem in that pixel position information is not retained due to embedding layers such as linear projection. Therefore, in this paper, we propose a novel method of improve ViT-based speech emotion recognition using coordinate information concatenate. Since the proposed method retains pixel position information by concatenating coordinate information to the input image, the accuracy of CREMA-D is greatly improved by 82.96% compared to the state-of-art about CREMA-D. As a result, it proved that the coordinate information concatenate proposed in this paper is effective not only for CNNs but also for Transformers.","PeriodicalId":373752,"journal":{"name":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","volume":"53 26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"CoordViT: A Novel Method of Improve Vision Transformer-Based Speech Emotion Recognition using Coordinate Information Concatenate\",\"authors\":\"Jeongho Kim, Seung-Ho Lee\",\"doi\":\"10.1109/ICEIC57457.2023.10049941\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, in speech emotion recognition, a Transformer-based method using spectrogram images instead of sound data showed improved accuracy than Convolutional Neural Networks (CNNs). Vision Transformer (ViT), a Transformer-based method, achieves high classification accuracy by using divided patches from the input image, but has a problem in that pixel position information is not retained due to embedding layers such as linear projection. Therefore, in this paper, we propose a novel method of improve ViT-based speech emotion recognition using coordinate information concatenate. Since the proposed method retains pixel position information by concatenating coordinate information to the input image, the accuracy of CREMA-D is greatly improved by 82.96% compared to the state-of-art about CREMA-D. As a result, it proved that the coordinate information concatenate proposed in this paper is effective not only for CNNs but also for Transformers.\",\"PeriodicalId\":373752,\"journal\":{\"name\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"volume\":\"53 26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Electronics, Information, and Communication (ICEIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEIC57457.2023.10049941\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Electronics, Information, and Communication (ICEIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIC57457.2023.10049941","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

最近，在语音情感识别中，一种基于transformer的方法使用频谱图图像代替声音数据，其准确性比卷积神经网络(cnn)有所提高。Vision Transformer (ViT)是一种基于Transformer的分类方法，通过对输入图像进行分割，获得了较高的分类精度，但由于线性投影等嵌入层的存在，导致像素位置信息无法保留。因此，本文提出了一种基于坐标信息拼接的语音情感识别方法。由于该方法通过将坐标信息与输入图像拼接，保留了像素位置信息，因此与目前的CREMA-D方法相比，准确率提高了82.96%。结果表明，本文提出的坐标信息拼接方法不仅对cnn有效，对变压器也有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CoordViT: A Novel Method of Improve Vision Transformer-Based Speech Emotion Recognition using Coordinate Information Concatenate

Recently, in speech emotion recognition, a Transformer-based method using spectrogram images instead of sound data showed improved accuracy than Convolutional Neural Networks (CNNs). Vision Transformer (ViT), a Transformer-based method, achieves high classification accuracy by using divided patches from the input image, but has a problem in that pixel position information is not retained due to embedding layers such as linear projection. Therefore, in this paper, we propose a novel method of improve ViT-based speech emotion recognition using coordinate information concatenate. Since the proposed method retains pixel position information by concatenating coordinate information to the input image, the accuracy of CREMA-D is greatly improved by 82.96% compared to the state-of-art about CREMA-D. As a result, it proved that the coordinate information concatenate proposed in this paper is effective not only for CNNs but also for Transformers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 International Conference on Electronics, Information, and Communication (ICEIC)

自引率

0.00%

发文量