12th ISCA Speech Synthesis Workshop (SSW2023)最新文献

英文中文

Federated Learning for Human-in-the-Loop Many-to-Many Voice Conversion 人在循环中多对多语音转换的联邦学习

12th ISCA Speech Synthesis Workshop (SSW2023)

Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-15

Ryunosuke Hirai, Yuki Saito, H. Saruwatari

We propose a method for training a many-to-many voice conversion (VC) model that can additionally learn users’ voices while protecting the privacy of their data. Conventional many-to-many VC methods train a VC model using a publicly available or proprietary multi-speaker corpus. However, they do not always achieve high-quality VC for input speech from various users. Our method is based on federated learning, a framework of distributed machine learning where a developer and users cooperatively train a machine learning model while protecting the privacy of user-owned data. We present a proof-of-concept method on the basis of StarGANv2-VC (i.e., Fed-StarGANv2-VC) and demonstrate that our method can achieve speaker similarity comparable to conventional non-federated StarGANv2-VC.

我们提出了一种训练多对多语音转换(VC)模型的方法，该模型可以在保护用户数据隐私的同时额外学习用户的声音。传统的多对多VC方法使用公开可用或专有的多说话人语料库来训练VC模型。然而，对于不同用户的输入语音，它们并不总能实现高质量的VC。我们的方法基于联邦学习，这是一种分布式机器学习框架，开发人员和用户合作训练机器学习模型，同时保护用户拥有数据的隐私。我们提出了一种基于StarGANv2-VC(即Fed-StarGANv2-VC)的概念验证方法，并证明我们的方法可以实现与传统非联邦StarGANv2-VC相当的说话人相似性。

引用次数: 0

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications 用于低资源设备上应用程序的轻量级端到端文本到语音合成

12th ISCA Speech Synthesis Workshop (SSW2023)

Pub Date : 2023-08-26 DOI: 10.21437/ssw.2023-35

Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu, Jaime Lorenzo-Trueba

Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to 90% smaller in terms of model parameters and 10 × faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.

最近的研究表明，与基于级联或两阶段方法的传统神经文本到语音(TTS)系统相比，以端到端(E2E)方式直接从文本建模原始波形可以产生更自然的语音。然而，当前的端到端最先进的模型计算复杂，内存消耗大，不适合低资源场景下的实时脱机设备应用。为了解决这个问题，我们提出了一个轻量级的E2E-TTS (LE2E)模型，该模型可以产生高质量的语音，需要最少的计算资源。我们在LJSpeech数据集上评估了所提出的模型，并表明它达到了最先进的性能，同时在模型参数方面缩小了90%，在实时因子方面提高了10倍。此外，我们证明，与用两阶段方法训练的等效体系结构相比，所提出的E2E训练范式实现了更好的质量。我们的研究结果表明，LE2E是一种很有前途的方法，可以为设备上应用开发实时、高质量、低资源的TTS应用。

{"title":"Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications","authors":"Biel Tura Vecino, Adam Gabrys, Daniel Matwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu, Jaime Lorenzo-Trueba","doi":"10.21437/ssw.2023-35","DOIUrl":"https://doi.org/10.21437/ssw.2023-35","url":null,"abstract":"Recent works have shown that modelling raw waveform directly from text in an end-to-end (E2E) fashion produces more natural-sounding speech than traditional neural text-to-speech (TTS) systems based on a cascade or two-stage approach. However, current E2E state-of-the-art models are computationally complex and memory-consuming, making them unsuitable for real-time offline on-device applications in low-resource scenarios. To address this issue, we propose a Lightweight E2E-TTS (LE2E) model that generates high-quality speech requiring minimal computational resources. We evaluate the proposed model on the LJSpeech dataset and show that it achieves state-of-the-art performance while being up to 90% smaller in terms of model parameters and 10 × faster in real-time-factor. Furthermore, we demonstrate that the proposed E2E training paradigm achieves better quality compared to an equivalent architecture trained in a two-stage approach. Our results suggest that LE2E is a promising approach for developing real-time, high quality, low-resource TTS applications for on-device applications.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128172769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module MooseNet:基于PLDA模块的可训练合成语音度量

12th ISCA Speech Synthesis Workshop (SSW2023)

Pub Date : 2023-01-17 DOI: 10.21437/ssw.2023-8

Ondvrej Pl'atek, Ondrej Dusek

We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.

我们提出了MooseNet，一个可训练的语音指标，预测听众的平均意见得分(MOS)。我们提出了一种新的方法，在自监督学习(SSL)神经网络(NN)模型的嵌入上使用概率线性判别分析(PLDA)生成模型。我们表明，当只训练136个话语(大约一分钟的训练时间)时，PLDA与非微调SSL模型一起工作得很好，并且PLDA不断改进各种神经MOS预测模型，甚至是具有任务特定微调的最先进模型。我们的烧蚀研究表明，在低资源情况下，PLDA训练优于SSL模型微调。我们还使用方便的优化器选择和额外的对比和多任务训练目标来改进SSL模型微调。带有PLDA模块的经过微调的MooseNet NN达到了最佳效果，超过了VoiceMOS Challenge数据上的SSL基线。

引用次数: 1

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling 利用长内容和多说话人多风格建模提高神经TTS的质量

12th ISCA Speech Synthesis Workshop (SSW2023)

Pub Date : 2022-12-20 DOI: 10.21437/ssw.2023-23

T. Raitio, Javier Latorre, Andrea Davis, Tuuli H. Morrill, L. Golipour

Neural text-to-speech (TTS) can provide quality close to natural speech if an adequate amount of high-quality speech material is available for training. However, acquiring speech data for TTS training is costly and time-consuming, especially if the goal is to generate different speaking styles. In this work, we show that we can transfer speaking style across speakers and improve the quality of synthetic speech by training a multi-speaker multi-style (MSMS) model with long-form recordings, in addition to regular TTS recordings. In particular, we show that 1) multi-speaker modeling improves the overall TTS quality, 2) the proposed MSMS approach outperforms pre-training and fine-tuning approach when utilizing additional multi-speaker data, and 3) long-form speaking style is highly rated regardless of the target text domain.

如果有足够数量的高质量语音材料可供训练，神经文本到语音(TTS)可以提供接近自然语音的质量。然而，获取用于TTS训练的语音数据是昂贵且耗时的，特别是如果目标是生成不同的说话风格。在这项工作中，我们表明，除了常规的TTS录音外，我们还可以通过长格式录音训练多扬声器多风格(MSMS)模型，在说话者之间传递说话风格，并提高合成语音的质量。特别是，我们发现1)多说话人建模提高了整体TTS质量;2)在使用额外的多说话人数据时，所提出的MSMS方法优于预训练和微调方法;3)无论目标文本域如何，长格式说话风格都得到了很高的评价。

引用次数: 0

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion 用语言语音正则化和伪填充停顿插入提高自发语音合成的鲁棒性

12th ISCA Speech Synthesis Workshop (SSW2023)

Pub Date : 2022-10-18 DOI: 10.21437/ssw.2023-10

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, H. Saruwatari

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the naturalness of synthetic speech with ground-truth and predicted FPs by 0.24 and 0.26, respectively.

本文提出了一种基于语言语音正则化的训练方法，该方法提高了带有填充停顿(FP)插入的自发语音合成方法的鲁棒性。自发语音合成旨在产生具有类似人类不流畅的语音，例如FPs。由于使用丰富的FP词汇对自发语音的复杂数据分布进行建模是具有挑战性的，因此插入FP的合成语音的质量往往受到限制。为了解决这个问题，我们提出了一种合成自发语音的方法，提高了对不同FP插入的鲁棒性。正则化用于稳定语言语音(即非fp)元素的合成。为了进一步提高对不同FP插入的鲁棒性，它利用使用FP单词预测模型采样的伪FPs和真实FPs。实验表明，该方法将合成语音的自然度和预测FPs分别提高了0.24和0.26。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀