Speech emotion recognition in real static and dynamic human-robot interaction scenarios

IF 3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2025-01-01 Epub Date: 2024-05-22 DOI:10.1016/j.csl.2024.101666

Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma

{"title":"Speech emotion recognition in real static and dynamic human-robot interaction scenarios","authors":"Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma","doi":"10.1016/j.csl.2024.101666","DOIUrl":null,"url":null,"abstract":"<div><p>The use of speech-based solutions is an appealing alternative to communicate in human-robot interaction (HRI). An important challenge in this area is processing distant speech which is often noisy, and affected by reverberation and time-varying acoustic channels. It is important to investigate effective speech solutions, especially in dynamic environments where the robots and the users move, changing the distance and orientation between a speaker and the microphone. This paper addresses this problem in the context of speech emotion recognition (SER), which is an important task to understand the intention of the message and the underlying mental state of the user. We propose a novel setup with a PR2 robot that moves as target speech and ambient noise are simultaneously recorded. Our study not only analyzes the detrimental effect of distance speech in this dynamic robot-user setting for speech emotion recognition but also provides solutions to attenuate its effect. We evaluate the use of two beamforming schemes to spatially filter the speech signal using either delay-and-sum (D&S) or minimum variance distortionless response (MVDR). We consider the original training speech recorded in controlled situations, and simulated conditions where the training utterances are processed to simulate the target acoustic environment. We consider the case where the robot is moving (dynamic case) and not moving (static case). For speech emotion recognition, we explore two state-of-the-art classifiers using hand-crafted features implemented with the ladder network strategy and learned features implemented with the wav2vec 2.0 feature representation. MVDR led to a signal-to-noise ratio higher than the basic D&S method. However, both approaches provided very similar average concordance correlation coefficient (CCC) improvements equal to 116 % with the HRI subsets using the ladder network trained with the original MSP-Podcast training utterances. For the wav2vec 2.0-based model, only D&S led to improvements. Surprisingly, the static and dynamic HRI testing subsets resulted in a similar average concordance correlation coefficient. Finally, simulating the acoustic environment in the training dataset provided the highest average concordance correlation coefficient scores with the HRI subsets that are just 29 % and 22 % lower than those obtained with the original training/testing utterances, with ladder network and wav2vec 2.0, respectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101666"},"PeriodicalIF":3.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000494/pdfft?md5=10d8a0faec641adaf8be74271eaf5174&pid=1-s2.0-S0885230824000494-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000494","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/22 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The use of speech-based solutions is an appealing alternative to communicate in human-robot interaction (HRI). An important challenge in this area is processing distant speech which is often noisy, and affected by reverberation and time-varying acoustic channels. It is important to investigate effective speech solutions, especially in dynamic environments where the robots and the users move, changing the distance and orientation between a speaker and the microphone. This paper addresses this problem in the context of speech emotion recognition (SER), which is an important task to understand the intention of the message and the underlying mental state of the user. We propose a novel setup with a PR2 robot that moves as target speech and ambient noise are simultaneously recorded. Our study not only analyzes the detrimental effect of distance speech in this dynamic robot-user setting for speech emotion recognition but also provides solutions to attenuate its effect. We evaluate the use of two beamforming schemes to spatially filter the speech signal using either delay-and-sum (D&S) or minimum variance distortionless response (MVDR). We consider the original training speech recorded in controlled situations, and simulated conditions where the training utterances are processed to simulate the target acoustic environment. We consider the case where the robot is moving (dynamic case) and not moving (static case). For speech emotion recognition, we explore two state-of-the-art classifiers using hand-crafted features implemented with the ladder network strategy and learned features implemented with the wav2vec 2.0 feature representation. MVDR led to a signal-to-noise ratio higher than the basic D&S method. However, both approaches provided very similar average concordance correlation coefficient (CCC) improvements equal to 116 % with the HRI subsets using the ladder network trained with the original MSP-Podcast training utterances. For the wav2vec 2.0-based model, only D&S led to improvements. Surprisingly, the static and dynamic HRI testing subsets resulted in a similar average concordance correlation coefficient. Finally, simulating the acoustic environment in the training dataset provided the highest average concordance correlation coefficient scores with the HRI subsets that are just 29 % and 22 % lower than those obtained with the original training/testing utterances, with ladder network and wav2vec 2.0, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

真实静态和动态人机交互场景中的语音情感识别

在人机交互（HRI）中，使用基于语音的解决方案是一种颇具吸引力的交流方式。该领域的一个重要挑战是处理远处的语音，因为远处的语音通常有噪声，并受到混响和时变声道的影响。研究有效的语音解决方案非常重要，尤其是在动态环境中，机器人和用户会移动，从而改变扬声器和麦克风之间的距离和方向。本文在语音情感识别（SER）的背景下探讨了这一问题，SER 是理解信息意图和用户潜在心理状态的一项重要任务。我们提出了一种新颖的设置，即在同时记录目标语音和环境噪声时，使用 PR2 机器人进行移动。我们的研究不仅分析了在这种机器人-用户动态环境下距离语音对语音情感识别的不利影响，还提供了削弱其影响的解决方案。我们评估了两种波束成形方案的使用情况，这两种方案分别使用延迟和（D&S）或最小方差无失真响应（MVDR）对语音信号进行空间过滤。我们考虑了在受控情况下录制的原始训练语音，以及对训练语音进行处理以模拟目标声学环境的模拟条件。我们考虑了机器人移动（动态情况）和不移动（静态情况）的情况。在语音情感识别方面，我们使用梯形网络策略实现的手工创建特征和使用 wav2vec 2.0 特征表示法实现的学习特征，探索了两种最先进的分类器。MVDR 的信噪比高于基本的 D&S 方法。不过，这两种方法在使用原始 MSP-Podcast 训练语料训练的梯形网络进行 HRI 子集时，平均一致性相关系数 (CCC) 的改进幅度非常相似，都是 116%。对于基于 wav2vec 2.0 的模型，只有 D&S 方法有所改进。令人惊讶的是，静态和动态 HRI 测试子集的平均一致性相关系数相似。最后，在训练数据集中模拟声学环境提供了最高的平均一致性相关系数，与原始训练/测试语料相比，梯形网络和 wav2vec 2.0 的 HRI 子集的平均一致性相关系数仅分别低 29% 和 22%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.