AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

arXiv - CS - Sound Pub Date : 2024-09-13 DOI:arxiv-2409.09098

Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun

引用次数: 0

Abstract

While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AccentBox：实现高保真零重音生成

虽然最近的零镜头文本到语音（ZS-TTS）模型实现了较高的自然度和说话人相似度，但它们在口音保真度和控制方面存在不足。为了解决这个问题，我们提出了零镜头口音生成技术，它将外来口音转换（FAC）、带口音的 TTS 和 ZS-TTS 结合在一起，并采用了新颖的两阶段流水线。在第一阶段，我们在口音识别（AID）方面达到了最先进的水平（SOTA），在未见过的说话者身上获得了 0.56 的 f1 分数。在第二阶段，我们以 AID 模型提取的预训练的与说话人无关的重音嵌入为 ZS-TTS 系统的条件。所提出的系统在固有口音/交叉口音生成方面实现了更高的口音保真度，并能生成未见过的口音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量