{"title":"综合源-目标说话人语音转换分析","authors":"He Pan, Yangjie Wei, Nan Guan, Yi Wang","doi":"10.5013/ijssst.a.15.06.05","DOIUrl":null,"url":null,"abstract":"Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1","PeriodicalId":14286,"journal":{"name":"International journal of simulation: systems, science & technology","volume":"87 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive Source-Target Speaker Voice Conversion Analysis\",\"authors\":\"He Pan, Yangjie Wei, Nan Guan, Yi Wang\",\"doi\":\"10.5013/ijssst.a.15.06.05\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1\",\"PeriodicalId\":14286,\"journal\":{\"name\":\"International journal of simulation: systems, science & technology\",\"volume\":\"87 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of simulation: systems, science & technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5013/ijssst.a.15.06.05\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of simulation: systems, science & technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5013/ijssst.a.15.06.05","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Voice conversion system modifies a speaker’s voice to be perceived as another speaker uttered, and now it is widely used in many real applications. However, most research only focuses on one aspect performance of voice conversion system, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps, including acoustic features selection and extraction, voice conversion model construction, and target speech synthesis, and a complete and optimal source-target speaker voice conversion is proposed. First, a simple and direct serial feature fusion form consisting of prosodic feature, spectrum parameter and spectral envelope characteristic, is proposed. Then, to void the discontinuity and spectrum distortion of a converted speech, D_GMM (Dynamic Gaussian Mixture Model) considering dynamic information between frames is presented. Subsequently, for speech synthesis, STRAIGHT algorithm synthesizer with feature combination is modified. Finally, the objective contrast experiment shows that our new source-target voice conversion process achieves better performance than the conventional methods. In addition, both objective evaluation (speaker recognition system) and subjective evaluation are used to evaluate the quality of converted speech, and experimental result shows that the converted speech has higher target speaker individuality and speech quality. Keywords-Voice Conversion, Serial Feature Fusion, D_GMM, STRAIGHT Synthesis, Speaker Recognition. I.! INTRODUCTION Source-target speaker voice conversion is a technique that modifies a source speaker’s speech to make it sound like that uttered by a target speaker without changing the speech content [1]. In the last two decades, much attention has been attracted to it due to its wide potential application areas, such as dubbing in films, restoring damaged voice and disguising personal voice [2]. Normally, source-target speaker voice conversion consists of three key steps: (1) selection and extraction of representative acoustic features; (2) construction of voice conversion model; (3) synthesis of the target speech. Firstly, acoustic features selection and extraction is to select acoustic features those can represent a speaker’s individual identity, and extract them correctly. Many researchers have proved that prosodic features, formant frequency and spectral parameters are the most important features used in real applications [3-4]. However, most study only focus on one or two of them to represent a speaker’s individual identity [5-6], that causing the covered acoustic features to be incomprehensive and the final synthesized speech to be inaccurate. Secondly, a voice conversion model consists of a sequence of mapping rules between the source speaker and the target speaker. Gaussian mixture model, which converts acoustic parameters based on the minimum mean square error, is one of the most widely used models in voice conversion research [5-7] because of its statistical framework and stable performance. But GMM's usefulness also has some limitations. For example, GMM is on the base of single conversion frame, where the sequential information between frames is ignored. So it may lead to spectrum distortion and speech quality deterioration. Finally, there are many methods of speech synthesis [8], among which speech transformation and representation using adaptive interpolation of weighted spectrum, or STRAIGHT, is a comparatively mature algorithm [9]. STRAIGHT can divide speech signal into independent spectral parameters and F0 parameters, and in synthesis process, it can modify duration, F0, and spectral parameters flexibly. In addition, it will not cause obvious deterioration of speech quality. Although source-target speaker voice conversion methods are widely used in real application, most research focuses on the performance of one aspect, such as conversion model, rare theoretical analysis and experimental comparison on the whole source-target speaker voice conversion process has been introduced. Therefore, in this paper, a comprehensive analysis on source-target speaker voice conversion is conducted based on three key steps above, a sequence of evaluation results based on theoretical analysis and experiments are attained, and a complete and optimal source-target speaker voice conversion is proposed. This paper is organized as follows. In Section II, the basic principle of voice conversion including feature extraction, GMM model and speech synthesis, is HE PAN at al: COMPREHENSIVE SOURCE-TARGET SPEAKER VOICE CONVERSION ANALYSIS DOI 10.5013/IJSSST.a.15.06.05 ISSN: 1473-804x online, 1473-8031 print 41 introduced. In Section III, a comprehensive and improved source-target speaker voice conversion is proposed, as well as voice conversion evaluation criterion. In Section IV, experimental results and evaluations are presented. Section V is conclusion. II.! BASIC VOICE CONVERSION PROCESS A.! Acoustic Feature Extraction Psychophysical studies have shown that human perception of the sound frequency contents for speech signals does not follow a linear scale. Thus for each tone with an actual frequency, f, in Hz, a subjective pitch is measured on a scale called ‘Mel’ scale. 1