使用自然会话数据合成轮转提示

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-12

Johannah O'Mahony, Catherine Lai, Simon King

{"title":"使用自然会话数据合成轮转提示","authors":"Johannah O'Mahony, Catherine Lai, Simon King","doi":"10.21437/ssw.2023-12","DOIUrl":null,"url":null,"abstract":"As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Synthesising turn-taking cues using natural conversational data\",\"authors\":\"Johannah O'Mahony, Catherine Lai, Simon King\",\"doi\":\"10.21437/ssw.2023-12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.\",\"PeriodicalId\":346639,\"journal\":{\"name\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"12th ISCA Speech Synthesis Workshop (SSW2023)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/ssw.2023-12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着孤立话语的语音合成质量达到高水平的自然度，更多的工作集中在上下文依赖性会话语音的合成上。人们对语境在会话中的作用仍然知之甚少，许多语境因素会影响话语的韵律实现。大多数结合上下文的研究使用前一个上下文的丰富的声学或文本嵌入，然后证明了整体自然度的改善。这样的研究并没有提供上下文嵌入代表什么，或者它如何影响话语的实现的信息。所以，我们把焦点缩小到一个单一的，明确的环境因素。在目前的工作中，这是轮流进行的。我们将一个语音合成模型设置为一个话语是否为turn-final。客观测量和有针对性的主观评价被用来证明该模型可以合成被听众感知的轮流线索，结果依赖于说话者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Synthesising turn-taking cues using natural conversational data

As speech synthesis quality reaches high levels of naturalness for isolated utterances, more work is focusing on the synthesis of context-dependent conversational speech. The role of context in conversation is still poorly understood and many contextual factors can affect an utterances’s prosodic realisation. Most studies incorporating context use rich acoustic or textual embeddings of the previous context, then demonstrate improvements in overall naturalness. Such studies are not informative about what the context embedding represents, or how it affects an utterance’s realisation. So instead, we narrow the focus to a single, explicit contextual factor. In the current work, this is turn-taking. We condition a speech synthesis model on whether an utterance is turn-final. Objective measures and targeted subjective evaluation are used to demonstrate that the model can synthesise turn-taking cues which are perceived by listeners, with results being speaker-dependent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量

期刊最新文献

Re-examining the quality dimensions of synthetic speech Synthesising turn-taking cues using natural conversational data Diffusion Transformer for Adaptive Text-to-Speech Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping Audiobook synthesis with long-form neural text-to-speech