Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents' Speech?

IF 2.1 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Applied Perception Pub Date : 2021-10-31 DOI:10.1145/3486580

Jonathan Ehret, A. Bönsch, Lukas Aspöck, Christine T. Röhr, S. Baumann, M. Grice, J. Fels, T. Kuhlen

{"title":"Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents' Speech?","authors":"Jonathan Ehret, A. Bönsch, Lukas Aspöck, Christine T. Röhr, S. Baumann, M. Grice, J. Fels, T. Kuhlen","doi":"10.1145/3486580","DOIUrl":null,"url":null,"abstract":"\n For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three\n Speech\n levels and the two\n Embodiment\n levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the\n prosodic realisation\n is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.\n","PeriodicalId":50921,"journal":{"name":"ACM Transactions on Applied Perception","volume":"50 1","pages":"21:1-21:15"},"PeriodicalIF":2.1000,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Applied Perception","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3486580","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 8

Abstract

For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

韵律和身体化是否影响会话主体的自然语音感知?

对于会话代理的语音，要么所有可能的句子都必须由配音演员预先录制，要么可以合成所需的话语。虽然合成语音在生产中更加灵活和经济，但由于各种语言层面的错误，它也可能降低代理的感知自然度。在我们的文章中，我们感兴趣的是韵律充足和不足的影响，特别是在重音位置方面，对代理的感知自然度和活力。我们比较了(1)不充分的韵律，由现成的文本到语音(TTS)引擎生成与合成输出;(2)被训练有素的人模仿的同样不充分的韵律;(3)这些说话者产生了足够的韵律。演讲要么是纯音频的，要么是拟人化的实体，以研究这些虚拟实体的同时视觉表现的潜在掩蔽效应。为此，我们对40名参与者进行了一项在线研究，他们听了四个不同的对话，每个对话分别以三个言语水平和两个体现水平呈现。结果证实，人类语言中适当的韵律被认为比人类(2)和合成语音(1)中不适当的韵律更自然(并且智能体被认为更有活力)。因此，仅仅使用人类语音是不够的，智能体的语音被认为是自然的——韵律实现是否充分是决定性的。此外，令人惊讶的是，我们没有发现说话者化身的掩盖效应，因为当虚拟代理可见时，与纯音频条件相比，韵律不足的人声和合成声音都不会被判断为更自然。相反，当有虚拟代理陪同时，人类的声音甚至被认为不那么“有活力”。总而言之，我们的研究结果一方面强调了韵律对感知自然度的重要性，特别是在短语中重要单词的重音方面，而另一方面则表明虚拟代理的体现在声音的自然度评级中起着次要作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Applied Perception 工程技术-计算机：软件工程

CiteScore

3.70

自引率

0.00%

发文量

审稿时长

12 months

期刊介绍： ACM Transactions on Applied Perception (TAP) aims to strengthen the synergy between computer science and psychology/perception by publishing top quality papers that help to unify research in these fields. The journal publishes inter-disciplinary research of significant and lasting value in any topic area that spans both Computer Science and Perceptual Psychology. All papers must incorporate both perceptual and computer science components.