Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations

Proceedings of the 1st International Workshop on Multimodal Conversational AI Pub Date : 2020-10-16 DOI:10.1145/3423325.3423737

Ilja Popovic, D. Culibrk, Milan Mirković, S. Vukmirović

{"title":"Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations","authors":"Ilja Popovic, D. Culibrk, Milan Mirković, S. Vukmirović","doi":"10.1145/3423325.3423737","DOIUrl":null,"url":null,"abstract":"Conversational emotion and sentiment analysis approaches rely on Natural Language Understanding (NLU) and audio processing components to achieve the goal of detecting emotions and sentiment based on what is being said. While there has been marked progress in pushing the state-of-the-art of theses methods on benchmark multimodal data sets, such as the Multimodal EmotionLines Dataset (MELD), the advances still seem to lag behind what has been achieved in the domain of mainstream Automatic Speech Recognition (ASR) and NLU applications and we were unable to identify any widely used products, services or production-ready systems that would enable the user to reliably detect emotions from audio recordings of multi-party conversations. Published, state-of-the-art scientific studies of multi-view emotion recognition seem to take it for granted that a human-generated or edited transcript is available as input to the NLU modules, providing no information of what happens in a realistic application scenario, where audio only is available and the NLU processing has to rely on text generated by ASR. Motivated by this insight, we present a study designed to evaluate the possibility of applying widely-used state-of-the-art commercial ASR products as the initial audio processing component in an emotion-from-speech detection system. We propose an approach which relies on commercially available products and services, such as Google Speech-to-Text, Mozilla DeepSpeech and the NVIDIA NeMo toolkit to process the audio and applies state-of-the-art NLU approaches for emotion recognition, in order to quickly create a robust, production-ready emotion-from-speech detection system applicable to multi-party conversations.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3423325.3423737","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Conversational emotion and sentiment analysis approaches rely on Natural Language Understanding (NLU) and audio processing components to achieve the goal of detecting emotions and sentiment based on what is being said. While there has been marked progress in pushing the state-of-the-art of theses methods on benchmark multimodal data sets, such as the Multimodal EmotionLines Dataset (MELD), the advances still seem to lag behind what has been achieved in the domain of mainstream Automatic Speech Recognition (ASR) and NLU applications and we were unable to identify any widely used products, services or production-ready systems that would enable the user to reliably detect emotions from audio recordings of multi-party conversations. Published, state-of-the-art scientific studies of multi-view emotion recognition seem to take it for granted that a human-generated or edited transcript is available as input to the NLU modules, providing no information of what happens in a realistic application scenario, where audio only is available and the NLU processing has to rely on text generated by ASR. Motivated by this insight, we present a study designed to evaluate the possibility of applying widely-used state-of-the-art commercial ASR products as the initial audio processing component in an emotion-from-speech detection system. We propose an approach which relies on commercially available products and services, such as Google Speech-to-Text, Mozilla DeepSpeech and the NVIDIA NeMo toolkit to process the audio and applies state-of-the-art NLU approaches for emotion recognition, in order to quickly create a robust, production-ready emotion-from-speech detection system applicable to multi-party conversations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于自动语音识别和自然语言理解的多人对话情感检测

会话情感和情绪分析方法依赖于自然语言理解(NLU)和音频处理组件来实现基于所说内容检测情感和情绪的目标。虽然在基准多模态数据集(如多模态EmotionLines数据集(MELD))上推动这些方法的最新技术方面取得了显著进展，但这些进展似乎仍然落后于主流自动语音识别(ASR)和NLU应用领域取得的成就，而且我们无法识别任何广泛使用的产品。使用户能够可靠地从多方对话的录音中检测情绪的服务或生产就绪系统。已发表的关于多视角情感识别的最新科学研究似乎理所当然地认为，人工生成或编辑的文本可以作为NLU模块的输入，而没有提供在现实应用场景中发生的信息，在现实应用场景中，只有音频可用，NLU处理必须依赖于ASR生成的文本。受此启发，我们提出了一项研究，旨在评估将广泛使用的最先进的商业ASR产品作为语音情感检测系统中初始音频处理组件的可能性。我们提出了一种方法，该方法依赖于商业上可用的产品和服务，如谷歌Speech-to-Text、Mozilla DeepSpeech和NVIDIA NeMo工具包来处理音频，并应用最先进的NLU方法进行情感识别，以便快速创建一个适用于多人对话的强大的、生产就绪的语音情感检测系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 1st International Workshop on Multimodal Conversational AI

自引率

0.00%

发文量

期刊最新文献

Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations FUN-Agent: A 2020 HUMAINE Competition Entrant Augment Machine Intelligence with Multimodal Information Assisted Speech to Enable Second Language Motivation and Design of the Conversational Components of DraftAgent for Human-Agent Negotiation