HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":null,"url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\nfrom the text domain to the speech domain. While developing TTS architectures\nthat train and test on the same set of speakers has seen significant\nimprovements, out-of-domain speaker performance still faces enormous\nlimitations. Domain adaptation on a new set of speakers can be achieved by\nfine-tuning the whole model for each new domain, thus making it\nparameter-inefficient. This problem can be solved by Adapters that provide a\nparameter-efficient alternative to domain adaptation. Although famous in NLP,\nspeech synthesis has not seen much improvement from Adapters. In this work, we\npresent HyperTTS, which comprises a small learnable network, \"hypernetwork\",\nthat generates parameters of the Adapter blocks, allowing us to condition\nAdapters on speaker representations and making them dynamic. Extensive\nevaluations of two domain adaptation settings demonstrate its effectiveness in\nachieving state-of-the-art performance in the parameter-efficient regime. We\nalso compare different variants of HyperTTS, comparing them with baselines in\ndifferent studies. Promising results on the dynamic adaptation of adapter\nparameters using hypernetworks open up new avenues for domain-generic\nmulti-speaker TTS systems. The audio samples and code are available at\nhttps://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.04645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HyperTTS:利用超网络实现文本到语音的参数高效适配
神经语音合成或文本到语音(TTS)旨在将信号从文本域转换到语音域。虽然开发在同一组扬声器上进行训练和测试的 TTS 架构取得了显著进步,但域外扬声器的性能仍然面临巨大限制。要在一组新的扬声器上实现领域适应,就必须针对每个新领域对整个模型进行微调,从而使其参数效率低下。这个问题可以通过适配器来解决,它为领域适应提供了一种参数效率高的替代方案。虽然 Adapters 在 NLP 领域很有名,但语音合成领域还没有看到 Adapters 有什么改进。在这项工作中,我们提出了 HyperTTS,它由一个小型可学习网络 "超网络 "组成,可生成适配器模块的参数,从而使我们能够根据说话者的表征对适配器进行调节,并使其动态化。对两个领域适应性设置的广泛评估证明了它在参数效率机制中实现最先进性能的有效性。我们还比较了 HyperTTS 的不同变体,并将它们与其他研究的基线进行了比较。利用超网络对适配器参数进行动态调整的研究结果令人鼓舞,为领域通用多扬声器 TTS 系统开辟了新途径。音频样本和代码可在https://github.com/declare-lab/HyperTTS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1