Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh
{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":null,"url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\nperformance over handcrafted acoustic features in speech emotion recognition\n(SER). However, unlike acoustic features with clear physical meaning, these\nembeddings lack clear interpretability. Explaining these embeddings is crucial\nfor building trust in healthcare and security applications and advancing the\nscientific understanding of the acoustic information that is encoded in them.\nThis paper proposes a modified probing approach to explain deep learning\nembeddings in the SER space. We predict interpretable acoustic features (e.g.,\nf0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\nembedding dimensions identified as most important for predicting each emotion.\nIf the subset of the most important dimensions better predicts a given emotion\nthan all dimensions and also predicts specific acoustic features more\naccurately, we infer those acoustic features are important for the embedding\nmodel for the given task. We conducted experiments using the WavLM embeddings\nand eGeMAPS acoustic features as audio representations, applying our method to\nthe RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\ndemonstrate that Energy, Frequency, Spectral, and Temporal categories of\nacoustic features provide diminishing information to SER in that order,\ndemonstrating the utility of the probing classifier method to relate embeddings\nto interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09511","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Pre-trained deep learning embeddings have consistently shown superior
performance over handcrafted acoustic features in speech emotion recognition
(SER). However, unlike acoustic features with clear physical meaning, these
embeddings lack clear interpretability. Explaining these embeddings is crucial
for building trust in healthcare and security applications and advancing the
scientific understanding of the acoustic information that is encoded in them.
This paper proposes a modified probing approach to explain deep learning
embeddings in the SER space. We predict interpretable acoustic features (e.g.,
f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the
embedding dimensions identified as most important for predicting each emotion.
If the subset of the most important dimensions better predicts a given emotion
than all dimensions and also predicts specific acoustic features more
accurately, we infer those acoustic features are important for the embedding
model for the given task. We conducted experiments using the WavLM embeddings
and eGeMAPS acoustic features as audio representations, applying our method to
the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we
demonstrate that Energy, Frequency, Spectral, and Temporal categories of
acoustic features provide diminishing information to SER in that order,
demonstrating the utility of the probing classifier method to relate embeddings
to interpretable acoustic features.