综合源-目标说话人语音转换分析

He Pan, Yangjie Wei, Nan Guan, Yi Wang
{"title":"综合源-目标说话人语音转换分析","authors":"He Pan, Yangjie Wei, Nan Guan, Yi Wang","doi":"10.5013/ijssst.a.15.06.05","DOIUrl":null,"url":null,"abstract":"Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1","PeriodicalId":14286,"journal":{"name":"International journal of simulation: systems, science & technology","volume":"87 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive Source-Target Speaker Voice Conversion Analysis\",\"authors\":\"He Pan, Yangjie Wei, Nan Guan, Yi Wang\",\"doi\":\"10.5013/ijssst.a.15.06.05\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1\",\"PeriodicalId\":14286,\"journal\":{\"name\":\"International journal of simulation: systems, science & technology\",\"volume\":\"87 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of simulation: systems, science & technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5013/ijssst.a.15.06.05\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of simulation: systems, science & technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5013/ijssst.a.15.06.05","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

第四节给出了实验结果和评价。第五部分为结论。二。!基本语音转换流程a .!心理物理学研究表明,人类对语音信号的声音频率内容的感知并不遵循线性尺度。因此,对于每个具有实际频率f (Hz)的音调,主观音高是用“梅尔”音阶来衡量的。1
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Comprehensive Source-Target Speaker Voice Conversion Analysis
Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Business Process Automation: Automating the Analysis of Anomaly Data The Emergence of Quaternary-Based Computational-Strata from a Symmetrical Multi-Layered Model of Light Understanding the Importance of Efficient Visitor Flow Within Tokyo Skytree A Cross-Layer Architecture with Service Adaptability for Wireless Multimedia Networks Split Step Fourier Method Application, Reducing Pulse Broadening Effect for a Single Mode Optical Fiber
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1