A dynamic-static feature fusion learning network for speech emotion recognition

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2025-02-27 DOI:10.1016/j.neucom.2025.129836
Peiyun Xue , Xiang Gao , Jing Bai , Zhenan Dong , Zhiyu Wang , Jiangshuai Xu
{"title":"A dynamic-static feature fusion learning network for speech emotion recognition","authors":"Peiyun Xue ,&nbsp;Xiang Gao ,&nbsp;Jing Bai ,&nbsp;Zhenan Dong ,&nbsp;Zhiyu Wang ,&nbsp;Jiangshuai Xu","doi":"10.1016/j.neucom.2025.129836","DOIUrl":null,"url":null,"abstract":"<div><div>Speech is a paramount mode of human communication, and enhancing the quality and fluency of Human-Computer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the <em>RAVDESS</em> dataset, and 94.95 % WA and 94.56 % UA on the <em>EmoDB</em> dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"633 ","pages":"Article 129836"},"PeriodicalIF":5.5000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225005089","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Speech is a paramount mode of human communication, and enhancing the quality and fluency of Human-Computer Interaction (HCI) greatly benefits from the significant contribution of Speech Emotion Recognition (SER). Feature representation poses a persistent challenge in SER. A single feature is difficult to adequately represent speech emotion, while directly concatenating multiple features may overlook the complementary nature and introduce interference due to redundant information. Towards these difficulties, this paper proposes a Multi-feature Learning network based on Dynamic-Static feature Fusion (ML-DSF) to obtain an effective hybrid feature representation for SER. Firstly, a Time-Frequency domain Self-Calibration Module (TFSC) is proposed to help the traditional convolutional neural networks in extracting static image features from the Log-Mel spectrograms. Then, a Lightweight Temporal Convolutional Network (L-TCNet) is used to acquire multi-scale dynamic temporal causal knowledge from the Mel Frequency Cepstrum Coefficients (MFCC). At last, both extracted features groups are fed into a connection attention module, optimized by Principal Component Analysis (PCA), facilitating emotion classification by reducing redundant information and enhancing the complementary information between features. For ensuring the independence of feature extraction, this paper adopts the training separation strategy. Evaluating the proposed model on two public datasets yielded a Weighted Accuracy (WA) of 93.33 % and an Unweighted Accuracy (UA) of 93.12 % on the RAVDESS dataset, and 94.95 % WA and 94.56 % UA on the EmoDB dataset. The obtained results outperformed the State-Of-The-Art (SOTA) findings. Meanwhile, the effectiveness of each module is validated by ablation experiments, and the generalization analysis is carried out on the cross-corpus SER tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语音是人类交流的重要方式,而提高人机交互(HCI)的质量和流畅性则极大地得益于语音情感识别(SER)的重要贡献。在 SER 中,特征表示是一个长期的挑战。单一特征难以充分代表语音情感,而直接连接多个特征可能会忽略互补性,并因冗余信息而引入干扰。针对这些难题,本文提出了一种基于动态-静态特征融合的多特征学习网络(ML-DSF),为 SER 获得有效的混合特征表示。首先,本文提出了时频域自校准模块(TFSC),以帮助传统卷积神经网络从 Log-Mel 光谱图中提取静态图像特征。然后,使用轻量级时域卷积网络 (L-TCNet) 从 Mel Frequency Cepstrum Coefficients (MFCC) 中获取多尺度动态时域因果知识。最后,将提取的两组特征输入连接注意模块,通过主成分分析(PCA)进行优化,减少冗余信息,增强特征间的互补信息,从而促进情绪分类。为确保特征提取的独立性,本文采用了训练分离策略。在两个公开数据集上对所提出的模型进行了评估,结果表明在 RAVDESS 数据集上的加权准确率(WA)为 93.33 %,非加权准确率(UA)为 93.12 %;在 EmoDB 数据集上的加权准确率(WA)为 94.95 %,非加权准确率(UA)为 94.56 %。所获得的结果优于最新研究成果(SOTA)。同时,通过消融实验验证了各模块的有效性,并在跨语料库 SER 任务中进行了泛化分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
Editorial Board Momentum gradient-based untargeted poisoning attack on hypergraph neural networks Memo-UNet: Leveraging historical information for enhanced wave height prediction LN-DETR: An efficient Transformer architecture for lung nodule detection with multi-scale feature fusion CMGN: Text GNN and RWKV MLP-mixer combined with cross-feature fusion for fake news detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1