Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-08-01 DOI:10.1109/TASL.2013.2255275

O. Schleusing, T. Kinnunen, B. Story, J. Vesin

{"title":"Joint Source-Filter Optimization for Accurate Vocal Tract Estimation Using Differential Evolution","authors":"O. Schleusing, T. Kinnunen, B. Story, J. Vesin","doi":"10.1109/TASL.2013.2255275","DOIUrl":null,"url":null,"abstract":"In this work, we present a joint source-filter optimization approach for separating voiced speech into vocal tract (VT) and voice source components. The presented method is pitch-synchronous and thereby exhibits a high robustness against vocal jitter, shimmer and other glottal variations while covering various voice qualities. The voice source is modeled using the Liljencrants-Fant (LF) model, which is integrated into a time-varying auto-regressive speech production model with exogenous input (ARX). The non-convex optimization problem of finding the optimal model parameters is addressed by a heuristic, evolutionary optimization method called differential evolution. The optimization method is first validated in a series of experiments with synthetic speech. Estimated glottal source and VT parameters are the criteria used for comparison with the iterative adaptive inverse filter (IAIF) method and the linear prediction (LP) method under varying conditions such as jitter, fundamental frequency (f0) as well as environmental and glottal noise. The results show that the proposed method largely reduces the bias and standard deviation of estimated VT coefficients and glottal source parameters. Furthermore, the performance of the source-filter separation is evaluated in experiments using speech generated with a physical model of speech production. The proposed method reliably estimates glottal flow waveforms and lower formant frequencies. Results obtained for higher formant frequencies indicate that research on more accurate voice source models and their interaction with the VT is necessary to improve the source-filter separation. The proposed optimization approach promises to be a useful tool for future research addressing this topic.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1560-1572"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255275","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2255275","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

In this work, we present a joint source-filter optimization approach for separating voiced speech into vocal tract (VT) and voice source components. The presented method is pitch-synchronous and thereby exhibits a high robustness against vocal jitter, shimmer and other glottal variations while covering various voice qualities. The voice source is modeled using the Liljencrants-Fant (LF) model, which is integrated into a time-varying auto-regressive speech production model with exogenous input (ARX). The non-convex optimization problem of finding the optimal model parameters is addressed by a heuristic, evolutionary optimization method called differential evolution. The optimization method is first validated in a series of experiments with synthetic speech. Estimated glottal source and VT parameters are the criteria used for comparison with the iterative adaptive inverse filter (IAIF) method and the linear prediction (LP) method under varying conditions such as jitter, fundamental frequency (f0) as well as environmental and glottal noise. The results show that the proposed method largely reduces the bias and standard deviation of estimated VT coefficients and glottal source parameters. Furthermore, the performance of the source-filter separation is evaluated in experiments using speech generated with a physical model of speech production. The proposed method reliably estimates glottal flow waveforms and lower formant frequencies. Results obtained for higher formant frequencies indicate that research on more accurate voice source models and their interaction with the VT is necessary to improve the source-filter separation. The proposed optimization approach promises to be a useful tool for future research addressing this topic.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于差分进化的声道精确估计联合源-滤波器优化

在这项工作中，我们提出了一种联合源滤波器优化方法，用于将语音分离为声道(VT)和声源组件。所提出的方法是音高同步的，因此对声音抖动、闪烁和其他声门变化具有很高的鲁棒性，同时涵盖了各种声音质量。语音源使用Liljencrants-Fant (LF)模型建模，该模型集成到具有外生输入(ARX)的时变自回归语音产生模型中。寻找最优模型参数的非凸优化问题通过一种称为微分进化的启发式进化优化方法来解决。通过一系列的人工语音实验，验证了优化方法的有效性。估计的声门源和声门参数是在抖动、基频(f0)以及环境噪声和声门噪声等不同条件下与迭代自适应反滤波(IAIF)方法和线性预测(LP)方法进行比较的标准。结果表明，该方法在很大程度上减小了估计的VT系数和声门源参数的偏差和标准差。此外，在使用语音产生的物理模型生成语音的实验中评估了源-滤波器分离的性能。该方法能可靠地估计声门流波形和较低的形成峰频率。在较高的形成峰频率下得到的结果表明，有必要研究更精确的声源模型及其与VT的相互作用，以改善源-滤波器分离。提出的优化方法有望成为解决该主题的未来研究的有用工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.