Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh
{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":null,"url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\nperformance over handcrafted acoustic features in speech emotion recognition\n(SER). However, unlike acoustic features with clear physical meaning, these\nembeddings lack clear interpretability. Explaining these embeddings is crucial\nfor building trust in healthcare and security applications and advancing the\nscientific understanding of the acoustic information that is encoded in them.\nThis paper proposes a modified probing approach to explain deep learning\nembeddings in the SER space. We predict interpretable acoustic features (e.g.,\nf0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\nembedding dimensions identified as most important for predicting each emotion.\nIf the subset of the most important dimensions better predicts a given emotion\nthan all dimensions and also predicts specific acoustic features more\naccurately, we infer those acoustic features are important for the embedding\nmodel for the given task. We conducted experiments using the WavLM embeddings\nand eGeMAPS acoustic features as audio representations, applying our method to\nthe RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\ndemonstrate that Energy, Frequency, Spectral, and Temporal categories of\nacoustic features provide diminishing information to SER in that order,\ndemonstrating the utility of the probing classifier method to relate embeddings\nto interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pre-trained deep learning embeddings have consistently shown superior performance over handcrafted acoustic features in speech emotion recognition (SER). However, unlike acoustic features with clear physical meaning, these embeddings lack clear interpretability. Explaining these embeddings is crucial for building trust in healthcare and security applications and advancing the scientific understanding of the acoustic information that is encoded in them. This paper proposes a modified probing approach to explain deep learning embeddings in the SER space. We predict interpretable acoustic features (e.g., f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the embedding dimensions identified as most important for predicting each emotion. If the subset of the most important dimensions better predicts a given emotion than all dimensions and also predicts specific acoustic features more accurately, we infer those acoustic features are important for the embedding model for the given task. We conducted experiments using the WavLM embeddings and eGeMAPS acoustic features as audio representations, applying our method to the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we demonstrate that Energy, Frequency, Spectral, and Temporal categories of acoustic features provide diminishing information to SER in that order, demonstrating the utility of the probing classifier method to relate embeddings to interpretable acoustic features.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过预测可解释的声学特征来解释用于语音情感识别的深度学习嵌入式算法
在语音情感识别(SER)中,预训练的深度学习嵌入一直显示出优于手工制作的声学特征的性能。然而,与具有明确物理意义的声学特征不同,这些嵌入缺乏明确的可解释性。解释这些嵌入对于在医疗保健和安全应用中建立信任以及推进对其中编码的声学信息的科学理解至关重要。如果最重要维度的子集比所有维度都能更好地预测特定情绪,并且能更准确地预测特定声学特征,那么我们就能推断出这些声学特征对于特定任务的嵌入模型非常重要。我们使用 WavLM 嵌入和 eGeMAPS 声音特征作为音频表示进行了实验,并将我们的方法应用于 RAVDESS 和 SAVEE 情感语音数据集。基于这一评估,我们证明了声学特征的能量、频率、频谱和时间类别依次为 SER 提供了递减信息,这证明了探测分类器方法在将嵌入与可解释的声学特征相关联方面的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1