U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-06 DOI:10.1109/TASLP.2024.3453606

Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie

{"title":"U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning","authors":"Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie","doi":"10.1109/TASLP.2024.3453606","DOIUrl":null,"url":null,"abstract":"Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of \n<italic>zero-shot speaker and style cloning</i>\n is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose \n<italic>U-Style</i>\n, which employs Grad-TTS as the backbone, particularly cascading a \n<italic>speaker-specific encoder</i>\n and a \n<italic>style-specific encoder</i>\n between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4026-4035"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10669040/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style , which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

U-Style：级联 U 网与多级扬声器和风格建模，实现零镜头语音克隆

零镜头说话人克隆的目的是，在仅有一个说话人语音参考的情况下，为 TTS 系统构建过程中未见过的任何目标说话人合成语音。尽管在实际应用中更为实用，但目前的零镜头方法仍会产生自然度和说话人相似度不理想的语音。此外，在零拍摄设置中赋予目标说话人任意说话风格的问题也尚未得到考虑。这是因为零镜头说话人和风格克隆的独特挑战在于如何从代表任意说话人和任意风格的简短参考资料中学习到分离的说话人和风格表征。为了应对这一挑战，我们提出了 U-Style，它采用 Grad-TTS 作为骨干，特别是在文本编码器和扩散解码器之间级联了特定于说话人的编码器和特定于风格的编码器。因此，利用信号扰动，U-Style 被明确分解为特定于说话人和特定于风格的建模部分，实现了更好的说话人和风格分离。为了提高未见说话人和风格的建模能力，这两个编码器通过跳接 U 网进行多层次的说话人和风格建模，将表征提取和信息重构过程融入其中。此外，为了提高合成语音的自然度，我们在这两种编码器中分别采用了基于均值的实例归一化和风格自适应层归一化来进行表征提取和条件自适应。实验表明，U-Style 在未见说话人克隆方面的自然度和说话人相似度明显优于最先进的方法。值得注意的是，U-Style 可以将未见过的源扬声器的风格转移到另一个未见过的目标扬声器上，从而在零镜头语音克隆中实现所需的扬声器音色和风格的灵活组合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach