{"title":"基于小训练数据的情感识别并行网络学习","authors":"Arata Ochi, Xin Kang","doi":"10.1109/ICSAI57119.2022.10005394","DOIUrl":null,"url":null,"abstract":"Speech emotion recognition (SER) classifies speech into emotion categories such as “happy”, “sad”, and “angry”. Speech emotion recognition has attracted more and more attention in recent years as a challenging pattern recognition task, but its performance is limited by the amount of training data. In this paper, we propose a parallel network consisting of a CNN and a Transformer that receives two types of inputs. The Convolutional Neural Network (CNN) accurately recognizes emotions from the speech data using a mel-spectrogram feature. The transformer uses Multi-Attention from Mel-Frequency Cepstrum Coefficient (MFCC) to realize the extraction of emotional semantic information in a sequence. Experiments are carried out on the Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset. The results demonstrate the effectiveness of the proposed method and show significant improvement over previous results with fewer data and less training time without data augmentation.","PeriodicalId":339547,"journal":{"name":"2022 8th International Conference on Systems and Informatics (ICSAI)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning a Parallel Network for Emotion Recognition Based on Small Training Data\",\"authors\":\"Arata Ochi, Xin Kang\",\"doi\":\"10.1109/ICSAI57119.2022.10005394\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech emotion recognition (SER) classifies speech into emotion categories such as “happy”, “sad”, and “angry”. Speech emotion recognition has attracted more and more attention in recent years as a challenging pattern recognition task, but its performance is limited by the amount of training data. In this paper, we propose a parallel network consisting of a CNN and a Transformer that receives two types of inputs. The Convolutional Neural Network (CNN) accurately recognizes emotions from the speech data using a mel-spectrogram feature. The transformer uses Multi-Attention from Mel-Frequency Cepstrum Coefficient (MFCC) to realize the extraction of emotional semantic information in a sequence. Experiments are carried out on the Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset. The results demonstrate the effectiveness of the proposed method and show significant improvement over previous results with fewer data and less training time without data augmentation.\",\"PeriodicalId\":339547,\"journal\":{\"name\":\"2022 8th International Conference on Systems and Informatics (ICSAI)\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 8th International Conference on Systems and Informatics (ICSAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSAI57119.2022.10005394\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 8th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI57119.2022.10005394","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning a Parallel Network for Emotion Recognition Based on Small Training Data
Speech emotion recognition (SER) classifies speech into emotion categories such as “happy”, “sad”, and “angry”. Speech emotion recognition has attracted more and more attention in recent years as a challenging pattern recognition task, but its performance is limited by the amount of training data. In this paper, we propose a parallel network consisting of a CNN and a Transformer that receives two types of inputs. The Convolutional Neural Network (CNN) accurately recognizes emotions from the speech data using a mel-spectrogram feature. The transformer uses Multi-Attention from Mel-Frequency Cepstrum Coefficient (MFCC) to realize the extraction of emotional semantic information in a sequence. Experiments are carried out on the Ryerson Audio-Visual Database of Emotion Speech and Song (RAVDESS) dataset. The results demonstrate the effectiveness of the proposed method and show significant improvement over previous results with fewer data and less training time without data augmentation.