Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations

Jason Fong, Hao Tang, Simon King
{"title":"Spell4TTS: Acoustically-informed spellings for improving text-to-speech pronunciations","authors":"Jason Fong, Hao Tang, Simon King","doi":"10.21437/ssw.2023-2","DOIUrl":null,"url":null,"abstract":"Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ensuring accurate pronunciation is critical for high-quality text-to-speech (TTS). This typically requires a phoneme-based pro-nunciation dictionary, which is labour-intensive and costly to create. Previous work has suggested using graphemes instead of phonemes, but the inevitable pronunciation errors that occur cannot be fixed, since there is no longer a pronunciation dictionary. As an alternative, speech-based self-supervised learning (SSL) models have been proposed for pronunciation control, but these models are computationally expensive to train, produce representations that are not easily interpretable, and capture unwanted non-phonemic information. To address these limitations, we propose Spell4TTS, a novel method that generates acoustically-informed word spellings. Spellings are both inter-pretable and easily edited. The method could be applied to any existing pre-built TTS system. Our experiments show that the method creates word spellings that lead to fewer TTS pronunciation errors than the original spellings, or an Automatic Speech Recognition baseline. Additionally, we observe that pronunciation can be further enhanced by ranking candidates in the space of SSL speech representations, and by incorporating Human-in-the-Loop screening over the top-ranked spellings devised by our method. By working with spellings of words (composed of characters), the method lowers the entry barrier for TTS sys-tem development for languages with limited pronunciation resources. It should reduce the time and cost involved in creating and maintaining pronunciation dictionaries.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Spell4TTS:声学信息拼写,提高文本到语音的发音
确保准确的发音对于高质量的文本到语音(TTS)至关重要。这通常需要一个基于音素的亲发音词典,这是一项劳动密集型的工作,而且创建成本很高。先前的研究建议使用字素代替音素,但不可避免的发音错误无法修复,因为不再有发音字典。作为一种替代方案,基于语音的自监督学习(SSL)模型已经被提出用于发音控制,但是这些模型的训练计算成本很高,产生的表示不容易解释,并且捕获不需要的非音位信息。为了解决这些限制,我们提出了Spell4TTS,一种生成声学信息单词拼写的新方法。拼写既可解释又易于编辑。该方法可应用于任何现有的预建TTS系统。我们的实验表明,该方法创建的单词拼写导致的TTS发音错误比原始拼写或自动语音识别基线更少。此外,我们观察到,通过对SSL语音表示空间中的候选词进行排名,以及通过对我们的方法设计的排名靠前的拼写进行Human-in-the-Loop筛选,发音可以进一步增强。通过处理单词(由字符组成)的拼写,该方法降低了针对发音资源有限的语言开发TTS系统的入门门槛。它可以减少创建和维护发音字典所涉及的时间和成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Re-examining the quality dimensions of synthetic speech Synthesising turn-taking cues using natural conversational data Diffusion Transformer for Adaptive Text-to-Speech Adaptive Duration Modification of Speech using Masked Convolutional Networks and Open-Loop Time Warping Audiobook synthesis with long-form neural text-to-speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1