Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":null,"url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\nfrom the text domain to the speech domain. While developing TTS architectures\nthat train and test on the same set of speakers has seen significant\nimprovements, out-of-domain speaker performance still faces enormous\nlimitations. Domain adaptation on a new set of speakers can be achieved by\nfine-tuning the whole model for each new domain, thus making it\nparameter-inefficient. This problem can be solved by Adapters that provide a\nparameter-efficient alternative to domain adaptation. Although famous in NLP,\nspeech synthesis has not seen much improvement from Adapters. In this work, we\npresent HyperTTS, which comprises a small learnable network, \"hypernetwork\",\nthat generates parameters of the Adapter blocks, allowing us to condition\nAdapters on speaker representations and making them dynamic. Extensive\nevaluations of two domain adaptation settings demonstrate its effectiveness in\nachieving state-of-the-art performance in the parameter-efficient regime. We\nalso compare different variants of HyperTTS, comparing them with baselines in\ndifferent studies. Promising results on the dynamic adaptation of adapter\nparameters using hypernetworks open up new avenues for domain-generic\nmulti-speaker TTS systems. The audio samples and code are available at\nhttps://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.04645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal
from the text domain to the speech domain. While developing TTS architectures
that train and test on the same set of speakers has seen significant
improvements, out-of-domain speaker performance still faces enormous
limitations. Domain adaptation on a new set of speakers can be achieved by
fine-tuning the whole model for each new domain, thus making it
parameter-inefficient. This problem can be solved by Adapters that provide a
parameter-efficient alternative to domain adaptation. Although famous in NLP,
speech synthesis has not seen much improvement from Adapters. In this work, we
present HyperTTS, which comprises a small learnable network, "hypernetwork",
that generates parameters of the Adapter blocks, allowing us to condition
Adapters on speaker representations and making them dynamic. Extensive
evaluations of two domain adaptation settings demonstrate its effectiveness in
achieving state-of-the-art performance in the parameter-efficient regime. We
also compare different variants of HyperTTS, comparing them with baselines in
different studies. Promising results on the dynamic adaptation of adapter
parameters using hypernetworks open up new avenues for domain-generic
multi-speaker TTS systems. The audio samples and code are available at
https://github.com/declare-lab/HyperTTS.