{"title":"一种基于神经网络的最优非线性融合语音基音检测算法","authors":"Ziba Imani, S. J. Kabudian","doi":"10.1109/KBEI.2019.8734917","DOIUrl":null,"url":null,"abstract":"Fundamental frequency estimation is one of the most important issues in the field of speech processing. An accurate estimate of the fundamental frequency plays a key role in the field of speech and music analysis. So far, various methods have been proposed in the time- and frequency-domain. However, the main challenge is the strong noises in speech signals. In this paper, to improve the accuracy of fundamental frequency estimation, we propose a method for optimal nonlinear combination of fundamental frequency estimation methods, in noisy signals. In this method, to discriminate voiced frames from unvoiced frames in a better way, the Voiced/Unvoiced (V/U) scores of four pitch detection methods are combined with nonlinear fusion. These methods are: Autocorrelation (AC), Yin, YAAPT and SWIPE. After identifying the Voiced/Unvoiced label of each frame, the fundamental frequency (F0) of the frame is estimated using the SWIPE method. The optimal function for nonlinear combination is determined using Multi-Layer Perceptron (MLP) neural network (NN). To evaluate the proposed method, 10 speech files (5 female and 5 male voices) are selected from the PTDB-TUG standard database and the results are presented in terms of GPE, VDE, PTE and FFE standard error criteria. The results indicate that our proposed method relatively reduced the aforementioned criteria (averaged in various SNRs) by 25.06%, 20.92%, 13.94%, and 25.94% respectively, which demonstrate the effectiveness of the proposed method in comparison to state-of-the-art methods.","PeriodicalId":339990,"journal":{"name":"2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Neural Network-Based Optimal Nonlinear Fusion of Speech Pitch Detection Algorithms\",\"authors\":\"Ziba Imani, S. J. Kabudian\",\"doi\":\"10.1109/KBEI.2019.8734917\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fundamental frequency estimation is one of the most important issues in the field of speech processing. An accurate estimate of the fundamental frequency plays a key role in the field of speech and music analysis. So far, various methods have been proposed in the time- and frequency-domain. However, the main challenge is the strong noises in speech signals. In this paper, to improve the accuracy of fundamental frequency estimation, we propose a method for optimal nonlinear combination of fundamental frequency estimation methods, in noisy signals. In this method, to discriminate voiced frames from unvoiced frames in a better way, the Voiced/Unvoiced (V/U) scores of four pitch detection methods are combined with nonlinear fusion. These methods are: Autocorrelation (AC), Yin, YAAPT and SWIPE. After identifying the Voiced/Unvoiced label of each frame, the fundamental frequency (F0) of the frame is estimated using the SWIPE method. The optimal function for nonlinear combination is determined using Multi-Layer Perceptron (MLP) neural network (NN). To evaluate the proposed method, 10 speech files (5 female and 5 male voices) are selected from the PTDB-TUG standard database and the results are presented in terms of GPE, VDE, PTE and FFE standard error criteria. The results indicate that our proposed method relatively reduced the aforementioned criteria (averaged in various SNRs) by 25.06%, 20.92%, 13.94%, and 25.94% respectively, which demonstrate the effectiveness of the proposed method in comparison to state-of-the-art methods.\",\"PeriodicalId\":339990,\"journal\":{\"name\":\"2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/KBEI.2019.8734917\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KBEI.2019.8734917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Neural Network-Based Optimal Nonlinear Fusion of Speech Pitch Detection Algorithms
Fundamental frequency estimation is one of the most important issues in the field of speech processing. An accurate estimate of the fundamental frequency plays a key role in the field of speech and music analysis. So far, various methods have been proposed in the time- and frequency-domain. However, the main challenge is the strong noises in speech signals. In this paper, to improve the accuracy of fundamental frequency estimation, we propose a method for optimal nonlinear combination of fundamental frequency estimation methods, in noisy signals. In this method, to discriminate voiced frames from unvoiced frames in a better way, the Voiced/Unvoiced (V/U) scores of four pitch detection methods are combined with nonlinear fusion. These methods are: Autocorrelation (AC), Yin, YAAPT and SWIPE. After identifying the Voiced/Unvoiced label of each frame, the fundamental frequency (F0) of the frame is estimated using the SWIPE method. The optimal function for nonlinear combination is determined using Multi-Layer Perceptron (MLP) neural network (NN). To evaluate the proposed method, 10 speech files (5 female and 5 male voices) are selected from the PTDB-TUG standard database and the results are presented in terms of GPE, VDE, PTE and FFE standard error criteria. The results indicate that our proposed method relatively reduced the aforementioned criteria (averaged in various SNRs) by 25.06%, 20.92%, 13.94%, and 25.94% respectively, which demonstrate the effectiveness of the proposed method in comparison to state-of-the-art methods.