TTS - VLSP 2021: The NAVI’s Text-To-Speech System for Vietnamese

VNU Journal of Science: Computer Science and Communication Engineering Pub Date : 2022-06-30 DOI:10.25073/2588-1086/vnucsce.347

Nguyen Le Minh, An Quoc Do, Viet Q. Vu, Huyen Thuc Khanh Vo

{"title":"TTS - VLSP 2021: The NAVI’s Text-To-Speech System for Vietnamese","authors":"Nguyen Le Minh, An Quoc Do, Viet Q. Vu, Huyen Thuc Khanh Vo","doi":"10.25073/2588-1086/vnucsce.347","DOIUrl":null,"url":null,"abstract":"The Association for Vietnamese Language and Speech Processing (VLSP) has organized a series of workshops intending to bring together researchers and professionals working in NLP and attempt a synthesis of research in the Vietnamese language. One of the shared tasks held at the eighth workshop is TTS [14] using a dataset that only consists of spontaneous audio. This poses a challenge for current TTS models since they only perform well constructing reading-style speech (e.g, audiobook). Not only that, the quality of the audio provided by the dataset has a huge impact on the performance of the model. Specifically, samples with noisy backgrounds or with multiple voices speaking at the same time will deteriorate the performance of our model. In this paper, we describe our approach to tackle this problem: we first preprocess the training data then use it to train a FastSpeech2 [10] acoustic model with some replacements in the external aligner model, finally we use HiFiGAN [4] vocoder to construct the waveform. According to the official evaluation of VLSP 2021 competition in the TTS task, our approach achieves 3.729 in-domain MOS, 3.557 out-of-domain MOS, and 79.70% SUS score. Audio samples are available at https://navi-tts.github.io/.","PeriodicalId":416488,"journal":{"name":"VNU Journal of Science: Computer Science and Communication Engineering","volume":"141 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"VNU Journal of Science: Computer Science and Communication Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25073/2588-1086/vnucsce.347","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Association for Vietnamese Language and Speech Processing (VLSP) has organized a series of workshops intending to bring together researchers and professionals working in NLP and attempt a synthesis of research in the Vietnamese language. One of the shared tasks held at the eighth workshop is TTS [14] using a dataset that only consists of spontaneous audio. This poses a challenge for current TTS models since they only perform well constructing reading-style speech (e.g, audiobook). Not only that, the quality of the audio provided by the dataset has a huge impact on the performance of the model. Specifically, samples with noisy backgrounds or with multiple voices speaking at the same time will deteriorate the performance of our model. In this paper, we describe our approach to tackle this problem: we first preprocess the training data then use it to train a FastSpeech2 [10] acoustic model with some replacements in the external aligner model, finally we use HiFiGAN [4] vocoder to construct the waveform. According to the official evaluation of VLSP 2021 competition in the TTS task, our approach achieves 3.729 in-domain MOS, 3.557 out-of-domain MOS, and 79.70% SUS score. Audio samples are available at https://navi-tts.github.io/.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TTS - VLSP 2021: NAVI的越南语文本转语音系统

越南语和语音处理协会(VLSP)组织了一系列讲习班，旨在将从事自然语言处理的研究人员和专业人员聚集在一起，并试图综合越南语的研究。在第八届研讨会上举行的共享任务之一是TTS[14]，使用仅由自发音频组成的数据集。这对当前的TTS模型提出了挑战，因为它们只能很好地构建阅读风格的语音(例如有声读物)。不仅如此，数据集提供的音频质量对模型的性能也有巨大的影响。具体来说，具有嘈杂背景或同时有多个声音说话的样本会降低我们模型的性能。在本文中，我们描述了我们解决这个问题的方法:我们首先对训练数据进行预处理，然后使用它来训练fastspeech h2[10]声学模型，并在外部对准器模型中进行一些替换，最后我们使用HiFiGAN[4]声码器来构建波形。根据官方对TTS任务中VLSP 2021竞赛的评估，我们的方法实现了3.729域内MOS, 3.557域外MOS和79.70% SUS得分。音频样本可在https://navi-tts.github.io/上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

VNU Journal of Science: Computer Science and Communication Engineering

自引率

0.00%

发文量