{"title":"语料库大小和内容对单元选择MaryTTS语音感知质量的影响","authors":"Florian Hinterleitner, Benjamin Weiss, S. Möller","doi":"10.1109/SLT.2016.7846336","DOIUrl":null,"url":null,"abstract":"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.","PeriodicalId":281635,"journal":{"name":"2016 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice\",\"authors\":\"Florian Hinterleitner, Benjamin Weiss, S. Möller\",\"doi\":\"10.1109/SLT.2016.7846336\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.\",\"PeriodicalId\":281635,\"journal\":{\"name\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"11 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2016.7846336\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2016.7846336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Influence of corpus size and content on the perceptual quality of a unit selection MaryTTS voice
State-of-the-art approaches on text-to-speech (TTS) synthesis like unit selection and HMM synthesis are data-driven. Therefore, they use a prerecorded speech corpus of natural speech to build a voice. This paper investigates the influence of the size of the speech corpus on five different perceptual quality dimensions. Six German unit selection voices were created based on subsets of different sizes of the same speech corpus using the MaryTTS synthesis platform. Statistical analysis showed a significant influence of the size of the speech corpus on all of the five dimensions. Surprisingly the voice created from the second largest speech corpus reached the best ratings in almost all dimensions, with the rating in the dimension fluency and intelligibility being significantly higher than the ratings of any other voice. Moreover, we could also verify a significant effect of the synthesized utterance on four of the five perceptual quality dimensions.