Multimodal evaluation of customer satisfaction from voicemails using speech and language representations

IF 2.9 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC Digital Signal Processing Pub Date : 2024-10-28 DOI:10.1016/j.dsp.2024.104820
Luis Felipe Parra-Gallego , Tomás Arias-Vergara , Juan Rafael Orozco-Arroyave
{"title":"Multimodal evaluation of customer satisfaction from voicemails using speech and language representations","authors":"Luis Felipe Parra-Gallego ,&nbsp;Tomás Arias-Vergara ,&nbsp;Juan Rafael Orozco-Arroyave","doi":"10.1016/j.dsp.2024.104820","DOIUrl":null,"url":null,"abstract":"<div><div>Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104820"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424004457","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Customer satisfaction (CS) evaluation in call centers is essential for assessing service quality but commonly relies on human evaluations. Automatic evaluation systems can be used to perform CS analyses, enabling the evaluation of larger datasets. This research paper focuses on CS analysis through a multimodal approach that employs speech and language representations derived from the real-world voicemails. Additionally, given the similarity between the evaluation of a provided service (which may elicit different emotions in customers) and the automatic classification of emotions in speech, we also explore the topic of emotion recognition with the well-known corpus IEMOCAP which comprises 4-classes corresponding to different emotional states. We incorporated a language representation with word embeddings based on a CNN-LSTM model, and three different self-supervised learning (SSL) speech encoders, namely Wav2Vec2.0, HuBERT, and WavLM. A bidirectional alignment network based on attention mechanisms is employed for synchronizing speech and language representations. Three different fusion strategies are also explored in the paper. According to our results, the GGF model outperformed both, unimodal and other multimodal methods in the 4-class emotion recognition task on the IEMOCAP dataset and the binary CS classification task on the KONECTADB dataset. The study also demonstrated superior performance of our methodology compared to previous works on KONECTADB in both unimodal and multimodal approaches.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用语音和语言表征对语音邮件中的客户满意度进行多模态评估
呼叫中心的客户满意度(CS)评估对于评估服务质量至关重要,但通常依赖于人工评估。自动评估系统可用于执行 CS 分析,从而对更大的数据集进行评估。本研究论文侧重于通过多模态方法进行 CS 分析,该方法采用了从真实世界语音邮件中提取的语音和语言表征。此外,鉴于对所提供服务的评估(可能会引发客户的不同情绪)与语音中情绪的自动分类之间存在相似性,我们还利用著名的语料库 IEMOCAP 探索了情绪识别的主题,该语料库由对应于不同情绪状态的 4 个类别组成。我们采用了基于 CNN-LSTM 模型的单词嵌入语言表示法,以及三种不同的自监督学习(SSL)语音编码器,即 Wav2Vec2.0、HuBERT 和 WavLM。在同步语音和语言表征时,采用了基于注意力机制的双向对齐网络。文中还探讨了三种不同的融合策略。研究结果表明,在 IEMOCAP 数据集的四类情感识别任务和 KONECTADB 数据集的二元 CS 分类任务中,GGF 模型的表现优于单模态方法和其他多模态方法。这项研究还表明,与之前在 KONECTADB 上使用的单模态和多模态方法相比,我们的方法具有更优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Digital Signal Processing
Digital Signal Processing 工程技术-工程:电子与电气
CiteScore
5.30
自引率
17.20%
发文量
435
审稿时长
66 days
期刊介绍: Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,
期刊最新文献
Adaptive polarimetric persymmetric detection for distributed subspace targets in lognormal texture clutter MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement PV-YOLO: A lightweight pedestrian and vehicle detection model based on improved YOLOv8 Efficient recurrent real video restoration IGGCN: Individual-guided graph convolution network for pedestrian trajectory prediction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1