{"title":"VLSP 2021 - TTS Challenge: Vietnamese Spontaneous Speech Synthesis","authors":"Nguyen Thi Thu Trang, H. Nguyen","doi":"10.25073/2588-1086/vnucsce.358","DOIUrl":null,"url":null,"abstract":"Text-To-Speech (TTS) was one of nine shared tasks in the eighth annual international VLSP 2021 workshop. All three previous TTS shared tasks were conducted on reading datasets. However, the synthetic voices were not natural enough for spoken dialog systems where the computer must talk to the human in a conversation. Speech datasets recorded in a spontaneous environment help a TTS system to produce more natural voices in speaking style, speaking rate, intonation... Therefore, in this shared task, participants were asked to build a TTS system from a spontaneous speech dataset. This 7.5-hour dataset was collected from a channel of a famous youtuber \"Giang ơi...\"and then pre-processed to build utterances and their corresponding texts. Main challenges at this task this year were: (i) inconsistency in speaking rate, intensity, stress and prosody across the dataset, (ii) background noises or mixed with other voices, and (iii) inaccurate transcripts. A total of 43 teams registered to participate in this shared task, and finally, 8 submissions were evaluated online with perceptual tests. Two types of perceptual tests were conducted: (i) MOS test for naturalness and (ii) SUS (Semantically Unpredictable Sentences) test for intelligibility. The best SUS intelligibility TTS system had a syllable error rate of 15%, while the best MOS score on dialog utterances was 3.98 over 4.54 points on a 5-point MOS scale. The prosody and speaking rate of synthetic voices were similar to the natural one. However, there were still some distorted segments and background noises in most of TTS systems, a half of which had a syllable error rate of at least 30%.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"VNU Journal of Science: Computer Science and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25073/2588-1086/vnucsce.358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Text-To-Speech (TTS) was one of nine shared tasks in the eighth annual international VLSP 2021 workshop. All three previous TTS shared tasks were conducted on reading datasets. However, the synthetic voices were not natural enough for spoken dialog systems where the computer must talk to the human in a conversation. Speech datasets recorded in a spontaneous environment help a TTS system to produce more natural voices in speaking style, speaking rate, intonation... Therefore, in this shared task, participants were asked to build a TTS system from a spontaneous speech dataset. This 7.5-hour dataset was collected from a channel of a famous youtuber "Giang ơi..."and then pre-processed to build utterances and their corresponding texts. Main challenges at this task this year were: (i) inconsistency in speaking rate, intensity, stress and prosody across the dataset, (ii) background noises or mixed with other voices, and (iii) inaccurate transcripts. A total of 43 teams registered to participate in this shared task, and finally, 8 submissions were evaluated online with perceptual tests. Two types of perceptual tests were conducted: (i) MOS test for naturalness and (ii) SUS (Semantically Unpredictable Sentences) test for intelligibility. The best SUS intelligibility TTS system had a syllable error rate of 15%, while the best MOS score on dialog utterances was 3.98 over 4.54 points on a 5-point MOS scale. The prosody and speaking rate of synthetic voices were similar to the natural one. However, there were still some distorted segments and background noises in most of TTS systems, a half of which had a syllable error rate of at least 30%.