韵律和身体化是否影响会话主体的自然语音感知?

IF 1.9 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Applied Perception Pub Date : 2021-10-31 DOI:10.1145/3486580
Jonathan Ehret, A. Bönsch, Lukas Aspöck, Christine T. Röhr, S. Baumann, M. Grice, J. Fels, T. Kuhlen
{"title":"韵律和身体化是否影响会话主体的自然语音感知?","authors":"Jonathan Ehret, A. Bönsch, Lukas Aspöck, Christine T. Röhr, S. Baumann, M. Grice, J. Fels, T. Kuhlen","doi":"10.1145/3486580","DOIUrl":null,"url":null,"abstract":"\n For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three\n Speech\n levels and the two\n Embodiment\n levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the\n prosodic realisation\n is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.\n","PeriodicalId":50921,"journal":{"name":"ACM Transactions on Applied Perception","volume":"50 1","pages":"21:1-21:15"},"PeriodicalIF":1.9000,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents' Speech?\",\"authors\":\"Jonathan Ehret, A. Bönsch, Lukas Aspöck, Christine T. Röhr, S. Baumann, M. Grice, J. Fels, T. Kuhlen\",\"doi\":\"10.1145/3486580\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three\\n Speech\\n levels and the two\\n Embodiment\\n levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the\\n prosodic realisation\\n is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.\\n\",\"PeriodicalId\":50921,\"journal\":{\"name\":\"ACM Transactions on Applied Perception\",\"volume\":\"50 1\",\"pages\":\"21:1-21:15\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2021-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Applied Perception\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3486580\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Applied Perception","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3486580","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 8

摘要

对于会话代理的语音,要么所有可能的句子都必须由配音演员预先录制,要么可以合成所需的话语。虽然合成语音在生产中更加灵活和经济,但由于各种语言层面的错误,它也可能降低代理的感知自然度。在我们的文章中,我们感兴趣的是韵律充足和不足的影响,特别是在重音位置方面,对代理的感知自然度和活力。我们比较了(1)不充分的韵律,由现成的文本到语音(TTS)引擎生成与合成输出;(2)被训练有素的人模仿的同样不充分的韵律;(3)这些说话者产生了足够的韵律。演讲要么是纯音频的,要么是拟人化的实体,以研究这些虚拟实体的同时视觉表现的潜在掩蔽效应。为此,我们对40名参与者进行了一项在线研究,他们听了四个不同的对话,每个对话分别以三个言语水平和两个体现水平呈现。结果证实,人类语言中适当的韵律被认为比人类(2)和合成语音(1)中不适当的韵律更自然(并且智能体被认为更有活力)。因此,仅仅使用人类语音是不够的,智能体的语音被认为是自然的——韵律实现是否充分是决定性的。此外,令人惊讶的是,我们没有发现说话者化身的掩盖效应,因为当虚拟代理可见时,与纯音频条件相比,韵律不足的人声和合成声音都不会被判断为更自然。相反,当有虚拟代理陪同时,人类的声音甚至被认为不那么“有活力”。总而言之,我们的研究结果一方面强调了韵律对感知自然度的重要性,特别是在短语中重要单词的重音方面,而另一方面则表明虚拟代理的体现在声音的自然度评级中起着次要作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents' Speech?
For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Applied Perception
ACM Transactions on Applied Perception 工程技术-计算机:软件工程
CiteScore
3.70
自引率
0.00%
发文量
22
审稿时长
12 months
期刊介绍: ACM Transactions on Applied Perception (TAP) aims to strengthen the synergy between computer science and psychology/perception by publishing top quality papers that help to unify research in these fields. The journal publishes inter-disciplinary research of significant and lasting value in any topic area that spans both Computer Science and Perceptual Psychology. All papers must incorporate both perceptual and computer science components.
期刊最新文献
Understanding the Impact of Visual and Kinematic Information on the Perception of Physicality Errors Decoding Functional Brain Data for Emotion Recognition: A Machine Learning Approach Assessing Human Reactions in a Virtual Crowd Based on Crowd Disposition, Perceived Agency, and User Traits Color Hint-guided Ink Wash Painting Colorization with Ink Style Prediction Mechanism Adaptation to Simulated Hypergravity in a Virtual Reality Throwing Task
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1