Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky
{"title":"Audiobook synthesis with long-form neural text-to-speech","authors":"Weicheng Zhang, Cheng-chieh Yeh, Will Beckman, T. Raitio, Ramya Rasipuram, L. Golipour, David Winarsky","doi":"10.21437/ssw.2023-22","DOIUrl":null,"url":null,"abstract":"Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite recent advances in text-to-speech (TTS) technology, auto-narration of long-form content such as books remains a challenge. The goal of this work is to enhance neural TTS to be suitable for long-form content such as audiobooks. In addition to high quality, we aim to provide a compelling and engaging listening experience with expressivity that spans beyond a single sentence to a paragraph level so that the user can not only follow the story but also enjoy listening to it. Towards that goal, we made four enhancements to our baseline TTS system: incorporation of BERT embeddings, explicit prosody prediction from text, long-context modeling over multiple sentences, and pre-training on long-form data. We propose an evaluation framework tailored to long-form content that evaluates the synthesis on segments spanning multiple paragraphs and focuses on elements such as comprehension, ease of listening, ability to keep attention, and enjoyment. The evaluation results show that the proposed approach outperforms the baseline on all evaluated metrics, with an absolute 0.47 MOS gain in overall quality. Ablation studies further confirm the effectiveness of the proposed enhancements.