{"title":"结合多方言音素级 BERT 的跨方言音高附着语言文本到语音技术","authors":"Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.07265","DOIUrl":null,"url":null,"abstract":"We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize\nlearned speakers' voices in non-native dialects, especially in pitch-accent\nlanguages. CD-TTS is important for developing voice agents that naturally\ncommunicate with people across regions. We present a novel TTS model comprising\nthree sub-modules to perform competitively at this task. We first train a\nbackbone TTS model to synthesize dialect speech from a text conditioned on\nphoneme-level accent latent variables (ALVs) extracted from speech by a\nreference encoder. Then, we train an ALV predictor to predict ALVs tailored to\na target dialect from input text leveraging our novel multi-dialect\nphoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the\neffectiveness of our model by comparing it with a baseline derived from\nconventional dialect TTS methods. The results show that our model improves the\ndialectal naturalness of synthetic speech in CD-TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT\",\"authors\":\"Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari\",\"doi\":\"arxiv-2409.07265\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize\\nlearned speakers' voices in non-native dialects, especially in pitch-accent\\nlanguages. CD-TTS is important for developing voice agents that naturally\\ncommunicate with people across regions. We present a novel TTS model comprising\\nthree sub-modules to perform competitively at this task. We first train a\\nbackbone TTS model to synthesize dialect speech from a text conditioned on\\nphoneme-level accent latent variables (ALVs) extracted from speech by a\\nreference encoder. Then, we train an ALV predictor to predict ALVs tailored to\\na target dialect from input text leveraging our novel multi-dialect\\nphoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the\\neffectiveness of our model by comparing it with a baseline derived from\\nconventional dialect TTS methods. The results show that our model improves the\\ndialectal naturalness of synthetic speech in CD-TTS.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07265\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT
We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize
learned speakers' voices in non-native dialects, especially in pitch-accent
languages. CD-TTS is important for developing voice agents that naturally
communicate with people across regions. We present a novel TTS model comprising
three sub-modules to perform competitively at this task. We first train a
backbone TTS model to synthesize dialect speech from a text conditioned on
phoneme-level accent latent variables (ALVs) extracted from speech by a
reference encoder. Then, we train an ALV predictor to predict ALVs tailored to
a target dialect from input text leveraging our novel multi-dialect
phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the
effectiveness of our model by comparing it with a baseline derived from
conventional dialect TTS methods. The results show that our model improves the
dialectal naturalness of synthetic speech in CD-TTS.