Bao Pang, Jun Teng, Qingyang Xu, Yong Song, Xianfeng Yuan, Yibin Li
{"title":"用于机器人人机交互的中文个性化文本到语音合成","authors":"Bao Pang, Jun Teng, Qingyang Xu, Yong Song, Xianfeng Yuan, Yibin Li","doi":"10.1049/csy2.12098","DOIUrl":null,"url":null,"abstract":"<p>Speech interaction is an important means of robot interaction. With the rapid development of deep learning, end-to-end speech synthesis methods based on this technique have gradually become mainstream. Chinese deep learning-based speech synthesis techniques suffer from problems such as unstable synthesised speech, poor naturalness and poor personalised speech synthesis, which do not satisfy some practical application scenarios. Hence, an F-MelGAN model is adopted to improve the performance of Chinese speech synthesis. A post-processing network is used to refine the Mel-spectrum predicted by the decoder and alleviate the Mel-spectrum distortion phenomenon. A phoneme-level and sentence-level combined module is proposed to model the personalised style of speakers. A combination of an acoustic conditioning network, speaker encoder network GCNet and feedback-constrained training is proposed to solve the problem of poor personalised speech synthesis and achieve personalised speech customisation in Chinese. Experimental results show that the whole model can generate high-quality speech with high speaker similarity for both speakers that appear in the training process and speakers that never appear in the training process.</p>","PeriodicalId":34110,"journal":{"name":"IET Cybersystems and Robotics","volume":"5 3","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12098","citationCount":"0","resultStr":"{\"title\":\"Chinese personalised text-to-speech synthesis for robot human–machine interaction\",\"authors\":\"Bao Pang, Jun Teng, Qingyang Xu, Yong Song, Xianfeng Yuan, Yibin Li\",\"doi\":\"10.1049/csy2.12098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Speech interaction is an important means of robot interaction. With the rapid development of deep learning, end-to-end speech synthesis methods based on this technique have gradually become mainstream. Chinese deep learning-based speech synthesis techniques suffer from problems such as unstable synthesised speech, poor naturalness and poor personalised speech synthesis, which do not satisfy some practical application scenarios. Hence, an F-MelGAN model is adopted to improve the performance of Chinese speech synthesis. A post-processing network is used to refine the Mel-spectrum predicted by the decoder and alleviate the Mel-spectrum distortion phenomenon. A phoneme-level and sentence-level combined module is proposed to model the personalised style of speakers. A combination of an acoustic conditioning network, speaker encoder network GCNet and feedback-constrained training is proposed to solve the problem of poor personalised speech synthesis and achieve personalised speech customisation in Chinese. Experimental results show that the whole model can generate high-quality speech with high speaker similarity for both speakers that appear in the training process and speakers that never appear in the training process.</p>\",\"PeriodicalId\":34110,\"journal\":{\"name\":\"IET Cybersystems and Robotics\",\"volume\":\"5 3\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/csy2.12098\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Cybersystems and Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Cybersystems and Robotics","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/csy2.12098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Chinese personalised text-to-speech synthesis for robot human–machine interaction
Speech interaction is an important means of robot interaction. With the rapid development of deep learning, end-to-end speech synthesis methods based on this technique have gradually become mainstream. Chinese deep learning-based speech synthesis techniques suffer from problems such as unstable synthesised speech, poor naturalness and poor personalised speech synthesis, which do not satisfy some practical application scenarios. Hence, an F-MelGAN model is adopted to improve the performance of Chinese speech synthesis. A post-processing network is used to refine the Mel-spectrum predicted by the decoder and alleviate the Mel-spectrum distortion phenomenon. A phoneme-level and sentence-level combined module is proposed to model the personalised style of speakers. A combination of an acoustic conditioning network, speaker encoder network GCNet and feedback-constrained training is proposed to solve the problem of poor personalised speech synthesis and achieve personalised speech customisation in Chinese. Experimental results show that the whole model can generate high-quality speech with high speaker similarity for both speakers that appear in the training process and speakers that never appear in the training process.