An autoencoder-based feature level fusion for speech emotion recognition

IF 7.5 2区 计算机科学 Q1 TELECOMMUNICATIONS Digital Communications and Networks Pub Date : 2024-10-01 DOI:10.1016/j.dcan.2022.10.018
Peng Shixin, Chen Kai, Tian Tian, Chen Jingying
{"title":"An autoencoder-based feature level fusion for speech emotion recognition","authors":"Peng Shixin,&nbsp;Chen Kai,&nbsp;Tian Tian,&nbsp;Chen Jingying","doi":"10.1016/j.dcan.2022.10.018","DOIUrl":null,"url":null,"abstract":"<div><div>Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.</div></div>","PeriodicalId":48631,"journal":{"name":"Digital Communications and Networks","volume":"10 5","pages":"Pages 1341-1351"},"PeriodicalIF":7.5000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Communications and Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352864822002279","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Although speech emotion recognition is challenging, it has broad application prospects in human-computer interaction. Building a system that can accurately and stably recognize emotions from human languages can provide a better user experience. However, the current unimodal emotion feature representations are not distinctive enough to accomplish the recognition, and they do not effectively simulate the inter-modality dynamics in speech emotion recognition tasks. This paper proposes a multimodal method that utilizes both audio and semantic content for speech emotion recognition. The proposed method consists of three parts: two high-level feature extractors for text and audio modalities, and an autoencoder-based feature fusion. For audio modality, we propose a structure called Temporal Global Feature Extractor (TGFE) to extract the high-level features of the time-frequency domain relationship from the original speech signal. Considering that text lacks frequency information, we use only a Bidirectional Long Short-Term Memory network (BLSTM) and attention mechanism to simulate an intra-modal dynamic. Once these steps have been accomplished, the high-level text and audio features are sent to the autoencoder in parallel to learn their shared representation for final emotion classification. We conducted extensive experiments on three public benchmark datasets to evaluate our method. The results on Interactive Emotional Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD) outperform the existing method. Additionally, the results of CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) are competitive. Furthermore, experimental results show that compared to unimodal information and autoencoder-based feature level fusion, the joint multimodal information (audio and text) improves the overall performance and can achieve greater accuracy than simple feature concatenation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于自动编码器的语音情感识别特征级融合
尽管语音情感识别具有挑战性,但它在人机交互领域却有着广阔的应用前景。建立一个能从人类语言中准确、稳定地识别情感的系统,能为用户提供更好的体验。然而,目前的单模态情感特征表征不够鲜明,无法有效模拟语音情感识别任务中的跨模态动态。本文提出了一种利用音频和语义内容进行语音情感识别的多模态方法。该方法由三部分组成:两个用于文本和音频模式的高级特征提取器,以及一个基于自动编码器的特征融合器。对于音频模式,我们提出了一种名为时域全局特征提取器(TGFE)的结构,用于从原始语音信号中提取时频域关系的高级特征。考虑到文本缺乏频率信息,我们仅使用双向长短期记忆网络(BLSTM)和注意力机制来模拟模态内动态。完成这些步骤后,高级文本和音频特征将并行发送给自动编码器,以学习它们的共享表示,从而进行最终的情感分类。我们在三个公共基准数据集上进行了广泛的实验,以评估我们的方法。交互式情感动作捕捉(IEMOCAP)和多模态情感线数据集(MELD)的结果优于现有方法。此外,CMU 多模态意见级情感强度(CMU-MOSI)的结果也很有竞争力。此外,实验结果表明,与单模态信息和基于自动编码器的特征级融合相比,联合多模态信息(音频和文本)提高了整体性能,比简单的特征串联获得更高的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Digital Communications and Networks
Digital Communications and Networks Computer Science-Hardware and Architecture
CiteScore
12.80
自引率
5.10%
发文量
915
审稿时长
30 weeks
期刊介绍: Digital Communications and Networks is a prestigious journal that emphasizes on communication systems and networks. We publish only top-notch original articles and authoritative reviews, which undergo rigorous peer-review. We are proud to announce that all our articles are fully Open Access and can be accessed on ScienceDirect. Our journal is recognized and indexed by eminent databases such as the Science Citation Index Expanded (SCIE) and Scopus. In addition to regular articles, we may also consider exceptional conference papers that have been significantly expanded. Furthermore, we periodically release special issues that focus on specific aspects of the field. In conclusion, Digital Communications and Networks is a leading journal that guarantees exceptional quality and accessibility for researchers and scholars in the field of communication systems and networks.
期刊最新文献
Editorial Board A novel handover scheme for millimeter wave network: An approach of integrating reinforcement learning and optimization Dynamic adversarial jamming-based reinforcement learning for designing constellations A secure double spectrum auction scheme Intelligent cache and buffer optimization for mobile VR adaptive transmission in 5G edge computing networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1