基于特征分解的零镜头语音转换

IF 2.4 3区计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-09-27 DOI:10.1016/j.specom.2024.103143

Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao

{"title":"基于特征分解的零镜头语音转换","authors":"Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao","doi":"10.1016/j.specom.2024.103143","DOIUrl":null,"url":null,"abstract":"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103143"},"PeriodicalIF":2.4000,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zero-shot voice conversion based on feature disentanglement\",\"authors\":\"Na Guo , Jianguo Wei , Yongwei Li , Wenhuan Lu , Jianhua Tao\",\"doi\":\"10.1016/j.specom.2024.103143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.</div></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"165 \",\"pages\":\"Article 103143\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-09-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324001146\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324001146","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

语音转换（VC）的目的是在不修改语言内容的情况下，将源说话者的语音转换为目标说话者的语音。零镜头语音转换在语音转换任务中备受关注，因为它可以实现对训练阶段未出现的说话者的转换。尽管之前的零镜头语音转换方法取得了重大进展，但在分离说话人信息和内容信息方面仍有改进空间。在本文中，我们提出了一种基于特征分离的零镜头 VC 方法。所提模型使用扬声器编码器提取扬声器嵌入，引入混合扬声器层归一化以消除内容编码中的残余扬声器信息，并采用自适应注意力权重归一化进行转换。此外，还引入了动态卷积，以改进语音内容建模，同时只需少量参数。实验证明，拟议模型的性能优于几种最先进的模型，既能实现与目标说话人的高度相似，又能实现可懂度。此外，我们模型的解码速度也远高于现有的先进模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Zero-shot voice conversion based on feature disentanglement

Voice conversion (VC) aims to convert the voice from a source speaker to a target speaker without modifying the linguistic content. Zero-shot voice conversion has attracted significant attention in the task of VC because it can achieve conversion for speakers who did not appear during the training stage. Despite the significant progress made by previous methods in zero-shot VC, there is still room for improvement in separating speaker information and content information. In this paper, we propose a zero-shot VC method based on feature disentanglement. The proposed model uses a speaker encoder for extracting speaker embeddings, introduces mixed speaker layer normalization to eliminate residual speaker information in content encoding, and employs adaptive attention weight normalization for conversion. Furthermore, dynamic convolution is introduced to improve speech content modeling while requiring a small number of parameters. The experiments demonstrate that performance of the proposed model is superior to several state-of-the-art models, achieving both high similarity with the target speaker and intelligibility. In addition, the decoding speed of our model is much higher than the existing state-of-the-art models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.