Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute
{"title":"这声音怎么这么难听?最先进的文本到语音系统的多维分析","authors":"Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute","doi":"10.1109/SLT.2012.6424229","DOIUrl":null,"url":null,"abstract":"This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems\",\"authors\":\"Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute\",\"doi\":\"10.1109/SLT.2012.6424229\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.\",\"PeriodicalId\":375378,\"journal\":{\"name\":\"2012 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT.2012.6424229\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT.2012.6424229","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems
This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.