Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-2

Jason Fong, Hao Tang, Simon King

{"title":"Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations","authors":"Jason Fong, Hao Tang, Simon King","doi":"10.21437/ssw.2023-2","DOIUrl":null,"url":null,"abstract":"Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Spell4TTS:声学信息拼写，提高文本到语音的发音

确保准确的发音对于高质量的文本到语音(TTS)至关重要。这通常需要一个基于音素的亲发音词典，这是一项劳动密集型的工作，而且创建成本很高。先前的研究建议使用字素代替音素，但不可避免的发音错误无法修复，因为不再有发音字典。作为一种替代方案，基于语音的自监督学习(SSL)模型已经被提出用于发音控制，但是这些模型的训练计算成本很高，产生的表示不容易解释，并且捕获不需要的非音位信息。为了解决这些限制，我们提出了Spell4TTS，一种生成声学信息单词拼写的新方法。拼写既可解释又易于编辑。该方法可应用于任何现有的预建TTS系统。我们的实验表明，该方法创建的单词拼写导致的TTS发音错误比原始拼写或自动语音识别基线更少。此外，我们观察到，通过对SSL语音表示空间中的候选词进行排名，以及通过对我们的方法设计的排名靠前的拼写进行Human-in-the-Loop筛选，发音可以进一步增强。通过处理单词(由字符组成)的拼写，该方法降低了针对发音资源有限的语言开发TTS系统的入门门槛。它可以减少创建和维护发音字典所涉及的时间和成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量

期刊最新文献

Re-examining the quality dimensions of synthetic speech Synthesising turn-taking cues using natural conversational data Diffusion Transformer for Adaptive Text-to-Speech Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping Audiobook synthesis with long-form neural text-to-speech