{"title":"Dynamic Speech Emotion Recognition Using A Conditional Neural Process","authors":"Luz Martinez-Lucas, Carlos Busso","doi":"10.1109/icassp48485.2024.10447805","DOIUrl":null,"url":null,"abstract":"The problem of predicting emotional attributes from speech has often focused on predicting a single value from a sentence or short speaking turn. These methods often ignore that natural emotions are both dynamic and dependent on context. To model the dynamic nature of emotions, we can treat the prediction of emotion from speech as a time-series problem. We refer to the problem of predicting these emotional traces as dynamic speech emotion recognition. Previous studies in this area have used models that treat all emotional traces as coming from the same underlying distribution. Since emotions are dependent on contextual information, these methods might obscure the context of an emotional interaction. This paper uses a neural process model with a segment-level speech emotion recognition (SER) model for this problem. This type of model leverages information from the time-series and predictions from the SER model to learn a prior that defines a distribution over emotional traces. Our proposed model performs 21% better than a bidirectional long short-term memory (BiLSTM) baseline when predicting emotional traces for valence.","PeriodicalId":517764,"journal":{"name":"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"9 10","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icassp48485.2024.10447805","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The problem of predicting emotional attributes from speech has often focused on predicting a single value from a sentence or short speaking turn. These methods often ignore that natural emotions are both dynamic and dependent on context. To model the dynamic nature of emotions, we can treat the prediction of emotion from speech as a time-series problem. We refer to the problem of predicting these emotional traces as dynamic speech emotion recognition. Previous studies in this area have used models that treat all emotional traces as coming from the same underlying distribution. Since emotions are dependent on contextual information, these methods might obscure the context of an emotional interaction. This paper uses a neural process model with a segment-level speech emotion recognition (SER) model for this problem. This type of model leverages information from the time-series and predictions from the SER model to learn a prior that defines a distribution over emotional traces. Our proposed model performs 21% better than a bidirectional long short-term memory (BiLSTM) baseline when predicting emotional traces for valence.
从语音中预测情感属性的问题通常集中在从句子或简短的发言中预测单个值上。这些方法往往忽略了自然情感是动态的,而且依赖于语境。为了模拟情感的动态性质,我们可以将从语音中预测情感视为一个时间序列问题。我们将预测这些情绪轨迹的问题称为动态语音情绪识别。以往在这一领域的研究使用的模型将所有情绪轨迹视为来自相同的基本分布。由于情感依赖于上下文信息,这些方法可能会模糊情感交互的上下文。本文针对这一问题使用了带有分段级语音情感识别(SER)模型的神经过程模型。这种模型利用时间序列信息和 SER 模型的预测信息来学习先验,从而定义情绪痕迹的分布。与双向长短期记忆(BiLSTM)基线相比,我们提出的模型在预测情绪踪迹的价值时,表现要好 21%。