This paper presents the Bayesian speech duration modeling and learning for hidden Markov model (HMM) based speech recognition. We focus on the sequential learning of HMM state duration using quasi-Bayes (QB) estimate. The adapted duration models are robust to nonstationary speaking rates and noise conditions. In this study, the Gaussian, Poisson, and gamma distributions are investigated to characterize the duration models. The maximum a posteriori (MAP) estimate of gamma duration model is developed. To exploit the sequential learning, we adopt the Poisson duration model incorporated with gamma prior density, which belongs to the conjugate prior family. When the adaptation data are sequentially observed, the gamma posterior density is produced with twofold advantages. One is to determine the optimal QB duration parameter, which can be merged in HMMs for speech recognition. The other one is to build the updating mechanism of gamma prior statistics for sequential learning. EM algorithm is applied to fulfill QB parameter estimation. The adaptation of overall HMM parameters can be performed simultaneously. In the experiments, the proposed adaptive duration model improves the speech recognition performance of Mandarin broadcast news and noisy connected digits. The batch and sequential learning are respectively investigated for MAP and QB duration models.
{"title":"Bayesian learning of speech duration models","authors":"Jen-Tzung Chien, Chih-Hsien Huang","doi":"10.1109/TSA.2003.818114","DOIUrl":"https://doi.org/10.1109/TSA.2003.818114","url":null,"abstract":"This paper presents the Bayesian speech duration modeling and learning for hidden Markov model (HMM) based speech recognition. We focus on the sequential learning of HMM state duration using quasi-Bayes (QB) estimate. The adapted duration models are robust to nonstationary speaking rates and noise conditions. In this study, the Gaussian, Poisson, and gamma distributions are investigated to characterize the duration models. The maximum a posteriori (MAP) estimate of gamma duration model is developed. To exploit the sequential learning, we adopt the Poisson duration model incorporated with gamma prior density, which belongs to the conjugate prior family. When the adaptation data are sequentially observed, the gamma posterior density is produced with twofold advantages. One is to determine the optimal QB duration parameter, which can be merged in HMMs for speech recognition. The other one is to build the updating mechanism of gamma prior statistics for sequential learning. EM algorithm is applied to fulfill QB parameter estimation. The adaptation of overall HMM parameters can be performed simultaneously. In the experiments, the proposed adaptive duration model improves the speech recognition performance of Mandarin broadcast news and noisy connected digits. The batch and sequential learning are respectively investigated for MAP and QB duration models.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"75 1","pages":"558-567"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72710935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present two efficient strategies for likelihood computation and decoding in a continuous speech recognizer using an underlying nonlinear state-space dynamic model for the hidden speech dynamics. The state-space model has been specially constructed so as to be suitable for the conversational or casual style of speech where phonetic reduction abounds. Two specific decoding algorithms, based on optimal state-sequence estimation for the nonlinear state-space model, are derived, implemented, and evaluated. They successfully overcome the exponential growth in the original search paths by using the path-merging approaches derived from Bayes' rule. We have tested and compared the two algorithms using the speech data from the Switchboard corpus, confirming their effectiveness. Conversational speech recognition experiments using the Switchboard corpus further demonstrated that the use of the new decoding strategies is capable of reducing the recognizer's word error rate compared with two baseline recognizers, including the HMM system and the nonlinear state-space model using the HMM-produced phonetic boundaries, under identical test conditions.
{"title":"Efficient decoding strategies for conversational speech recognition using a constrained nonlinear state-space model","authors":"Jeff Z. Ma, L. Deng","doi":"10.1109/TSA.2003.818075","DOIUrl":"https://doi.org/10.1109/TSA.2003.818075","url":null,"abstract":"In this paper, we present two efficient strategies for likelihood computation and decoding in a continuous speech recognizer using an underlying nonlinear state-space dynamic model for the hidden speech dynamics. The state-space model has been specially constructed so as to be suitable for the conversational or casual style of speech where phonetic reduction abounds. Two specific decoding algorithms, based on optimal state-sequence estimation for the nonlinear state-space model, are derived, implemented, and evaluated. They successfully overcome the exponential growth in the original search paths by using the path-merging approaches derived from Bayes' rule. We have tested and compared the two algorithms using the speech data from the Switchboard corpus, confirming their effectiveness. Conversational speech recognition experiments using the Switchboard corpus further demonstrated that the use of the new decoding strategies is capable of reducing the recognizer's word error rate compared with two baseline recognizers, including the HMM system and the nonlinear state-space model using the HMM-produced phonetic boundaries, under identical test conditions.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"376 1","pages":"590-602"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74596321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and BCC side information. The BCC side information has a low data rate and it is derived from the multichannel encoder input signal. A natural application of BCC is multichannel audio data rate reduction since only a single down-mixed audio channel needs to be transmitted. An alternative BCC scheme for efficient joint transmission of independent source signals supports flexible spatial rendering at the decoder. This paper (Part I) discusses the most relevant binaural perception phenomena exploited by BCC. Based on that, it presents a psychoacoustically motivated approach for designing a BCC analyzer and synthesizer. This leads to a reference implementation for analysis and synthesis of stereophonic audio signals based on a Cochlear Filter Bank. BCC synthesizer implementations based on the FFT are presented as low-complexity alternatives. A subjective audio quality assessment of these implementations shows the robust performance of BCC for critical speech and audio material. Moreover, the results suggest that the performance given by the reference synthesizer is not significantly compromised when using a low-complexity FFT-based synthesizer. The companion paper (Part II) generalizes BCC analysis and synthesis for multichannel audio and proposes complete BCC schemes including quantization and coding. Part II also describes an alternative BCC scheme with flexible rendering capability at the decoder and proposes several applications for both BCC schemes.
{"title":"Binaural cue coding-Part I: psychoacoustic fundamentals and design principles","authors":"F. Baumgarte, C. Faller","doi":"10.1109/TSA.2003.818109","DOIUrl":"https://doi.org/10.1109/TSA.2003.818109","url":null,"abstract":"Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and BCC side information. The BCC side information has a low data rate and it is derived from the multichannel encoder input signal. A natural application of BCC is multichannel audio data rate reduction since only a single down-mixed audio channel needs to be transmitted. An alternative BCC scheme for efficient joint transmission of independent source signals supports flexible spatial rendering at the decoder. This paper (Part I) discusses the most relevant binaural perception phenomena exploited by BCC. Based on that, it presents a psychoacoustically motivated approach for designing a BCC analyzer and synthesizer. This leads to a reference implementation for analysis and synthesis of stereophonic audio signals based on a Cochlear Filter Bank. BCC synthesizer implementations based on the FFT are presented as low-complexity alternatives. A subjective audio quality assessment of these implementations shows the robust performance of BCC for critical speech and audio material. Moreover, the results suggest that the performance given by the reference synthesizer is not significantly compromised when using a low-complexity FFT-based synthesizer. The companion paper (Part II) generalizes BCC analysis and synthesis for multichannel audio and proposes complete BCC schemes including quantization and coding. Part II also describes an alternative BCC scheme with flexible rendering capability at the decoder and proposes several applications for both BCC schemes.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"90 1","pages":"509-519"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87478222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we analyze a two-channel generalized sidelobe canceller with post-filtering in nonstationary noise environments. The post-filtering includes detection of transients at the beamformer output and reference signal, a comparison of their transient power, estimation of the signal presence probability, estimation of the noise spectrum, and spectral enhancement for minimizing the mean-square error of the log-spectra. Transients are detected based on a measure of their local nonstationarity, and classified as desired or interfering based on the transient beam-to-reference ratio. We introduce a transient discrimination quality measure, which quantifies the beamformer's capability to recognize noise transients as distinct from signal transients. Evaluating this measure in various noise fields shows that desired and interfering transients can generally be differentiated within a wide range of frequencies. To further improve the transient noise reduction at low and high frequencies in case the signal is wideband, we estimate for each time frame a global likelihood of signal presence. The global likelihood is associated with the transient beam-to-reference ratios in frequencies, where the transient discrimination quality is high. Experimental results demonstrate the usefulness of the proposed approach in various car environments.
{"title":"Analysis of two-channel generalized sidelobe canceller (GSC) with post-filtering","authors":"I. Cohen","doi":"10.1109/TSA.2003.818105","DOIUrl":"https://doi.org/10.1109/TSA.2003.818105","url":null,"abstract":"In this paper, we analyze a two-channel generalized sidelobe canceller with post-filtering in nonstationary noise environments. The post-filtering includes detection of transients at the beamformer output and reference signal, a comparison of their transient power, estimation of the signal presence probability, estimation of the noise spectrum, and spectral enhancement for minimizing the mean-square error of the log-spectra. Transients are detected based on a measure of their local nonstationarity, and classified as desired or interfering based on the transient beam-to-reference ratio. We introduce a transient discrimination quality measure, which quantifies the beamformer's capability to recognize noise transients as distinct from signal transients. Evaluating this measure in various noise fields shows that desired and interfering transients can generally be differentiated within a wide range of frequencies. To further improve the transient noise reduction at low and high frequencies in case the signal is wideband, we estimate for each time frame a global likelihood of signal presence. The global likelihood is associated with the transient beam-to-reference ratios in frequencies, where the transient discrimination quality is high. Experimental results demonstrate the usefulness of the proposed approach in various car environments.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"51 1","pages":"684-699"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86447726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces new recognition strategies based on reasoning about results obtained with different Language Models (LMs). Strategies are built following the conjecture that the consensus among the results obtained with different models gives rise to different situations in which hypothesized sentences have different word error rates (WER) and may be further processed with other LMs. New LMs are built by data augmentation using ideas from latent semantic analysis and trigram analogy. Situations are defined by expressing the consensus among the recognition results produced with different LMs and by the amount of unobserved trigrams in the hypothesized sentence. The diagnostic power of the use of observed trigrams or their corresponding class trigrams is compared with that of situations based on values of sentence posterior probabilities. In order to avoid or correct errors due to syntactic inconsistence of the recognized sentence, automata, obtained by explanation-based learning, are introduced and used in certain conditions. Semantic Classification Trees are introduced to provide sentence patterns expressing constraints of long distance syntactic coherence. Results on a dialogue corpus provided by France Telecom R&D have shown that starting with a WER of 21.87% on a test set of 1422 sentences, it is possible to subdivide the sentences into three sets characterized by automatically recognized situations. The first one has a coverage of 68% with a WER of 7.44%. The second one has various types of sentences with a WER around 20%. The third one contains 13% of the sentences that should be rejected with a WER around 49%. The second set characterizes sentences that should be processed with particular care by the dialogue interpreter with the possibility of asking a confirmation from the user.
{"title":"On the use of linguistic consistency in systems for human-computer dialogues","authors":"Y. Estève, C. Raymond, R. Mori, D. Janiszek","doi":"10.1109/TSA.2003.818318","DOIUrl":"https://doi.org/10.1109/TSA.2003.818318","url":null,"abstract":"This paper introduces new recognition strategies based on reasoning about results obtained with different Language Models (LMs). Strategies are built following the conjecture that the consensus among the results obtained with different models gives rise to different situations in which hypothesized sentences have different word error rates (WER) and may be further processed with other LMs. New LMs are built by data augmentation using ideas from latent semantic analysis and trigram analogy. Situations are defined by expressing the consensus among the recognition results produced with different LMs and by the amount of unobserved trigrams in the hypothesized sentence. The diagnostic power of the use of observed trigrams or their corresponding class trigrams is compared with that of situations based on values of sentence posterior probabilities. In order to avoid or correct errors due to syntactic inconsistence of the recognized sentence, automata, obtained by explanation-based learning, are introduced and used in certain conditions. Semantic Classification Trees are introduced to provide sentence patterns expressing constraints of long distance syntactic coherence. Results on a dialogue corpus provided by France Telecom R&D have shown that starting with a WER of 21.87% on a test set of 1422 sentences, it is possible to subdivide the sentences into three sets characterized by automatically recognized situations. The first one has a coverage of 68% with a WER of 7.44%. The second one has various types of sentences with a WER around 20%. The third one contains 13% of the sentences that should be rejected with a WER around 49%. The second set characterizes sentences that should be processed with particular care by the dialogue interpreter with the possibility of asking a confirmation from the user.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"6 1","pages":"746-756"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76143608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a cascaded RLS-LMS predictor for lossless audio coding. In this proposed predictor, a high-order LMS predictor is employed to model the ample tonal and harmonic components of the audio signal for optimal prediction gain performance. To solve the slow convergence problem of the LMS algorithm with colored inputs, a low-order RLS predictor is cascaded prior to the LMS predictor to remove the spectral tilt of the audio signal. This cascaded RLS-LMS structure effectively mitigates the slow convergence problem of the LMS algorithm and provides superior prediction gain performance compared with the conventional LMS predictor, resulting in a better overall compression performance.
{"title":"Lossless compression of digital audio using cascaded RLS-LMS prediction","authors":"R. Yu, C. Ko","doi":"10.1109/TSA.2003.818111","DOIUrl":"https://doi.org/10.1109/TSA.2003.818111","url":null,"abstract":"This paper proposes a cascaded RLS-LMS predictor for lossless audio coding. In this proposed predictor, a high-order LMS predictor is employed to model the ample tonal and harmonic components of the audio signal for optimal prediction gain performance. To solve the slow convergence problem of the LMS algorithm with colored inputs, a low-order RLS predictor is cascaded prior to the LMS predictor to remove the spectral tilt of the audio signal. This cascaded RLS-LMS structure effectively mitigates the slow convergence problem of the LMS algorithm and provides superior prediction gain performance compared with the conventional LMS predictor, resulting in a better overall compression performance.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"31 1","pages":"532-537"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78952899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a novel technique for estimating the signal power spectral density to be used in the transfer function of a microphone array post-filter. The technique is a generalization of the existing Zelinski post-filter, which uses the auto- and cross-spectral densities of the array inputs to estimate the signal and noise spectral densities. The Zelinski technique, however, assumes zero cross-correlation between the noise on different sensors. This assumption is inaccurate, particularly at low frequencies and for arrays with closely spaced sensors, and thus the corresponding post-filter is suboptimal in realistic noise conditions. In this paper, a more general expression of the post-filter estimation is developed based on an assumed knowledge of the complex coherence of the noise field. This general expression can be used to construct a more appropriate post-filter in a variety of different noise fields. In experiments using real noise recordings from a computer office, the modified post-filter results in significant improvement in terms of objective speech quality measures and speech recognition performance using a diffuse noise model.
{"title":"Microphone array post-filter based on noise field coherence","authors":"I. McCowan, H. Bourlard","doi":"10.1109/TSA.2003.818212","DOIUrl":"https://doi.org/10.1109/TSA.2003.818212","url":null,"abstract":"This paper introduces a novel technique for estimating the signal power spectral density to be used in the transfer function of a microphone array post-filter. The technique is a generalization of the existing Zelinski post-filter, which uses the auto- and cross-spectral densities of the array inputs to estimate the signal and noise spectral densities. The Zelinski technique, however, assumes zero cross-correlation between the noise on different sensors. This assumption is inaccurate, particularly at low frequencies and for arrays with closely spaced sensors, and thus the corresponding post-filter is suboptimal in realistic noise conditions. In this paper, a more general expression of the post-filter estimation is developed based on an assumed knowledge of the complex coherence of the noise field. This general expression can be used to construct a more appropriate post-filter in a variety of different noise fields. In experiments using real noise recordings from a computer office, the modified post-filter results in significant improvement in terms of objective speech quality measures and speech recognition performance using a diffuse noise model.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"48 1","pages":"709-716"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85464928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A method that computes linear digital filter networks containing delay-free loops is proposed. Compared to existing techniques the proposed method does not require a rearrangement of the network structure, conversely it makes use of matrices describing this structure and specifying the connections between the filter blocks forming the network. For this reason the efficiency of the method becomes interesting when the filter blocks are densely interconnected. The Triangular Waveguide Mesh is an example of "dense" filter network: Using the proposed method we can compute a transformed, delay-free version of this mesh, obtaining simulations that are significantly more accurate compared to those provided by the traditional, explicitly computable formulation of the triangular mesh.
{"title":"Computation of linear filter networks containing delay-free loops, with an application to the waveguide mesh","authors":"Federico Fontana","doi":"10.1109/TSA.2003.818033","DOIUrl":"https://doi.org/10.1109/TSA.2003.818033","url":null,"abstract":"A method that computes linear digital filter networks containing delay-free loops is proposed. Compared to existing techniques the proposed method does not require a rearrangement of the network structure, conversely it makes use of matrices describing this structure and specifying the connections between the filter blocks forming the network. For this reason the efficiency of the method becomes interesting when the filter blocks are densely interconnected. The Triangular Waveguide Mesh is an example of \"dense\" filter network: Using the proposed method we can compute a transformed, delay-free version of this mesh, obtaining simulations that are significantly more accurate compared to those provided by the traditional, explicitly computable formulation of the triangular mesh.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"42 1","pages":"774-782"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89066728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The autocorrelation method of linear prediction (LP) analysis relies on a window for data extraction. We propose an approach to optimize the window which is based on gradient-descent. It is shown that the optimized window can enhance the performance of LP-based speech coding algorithms; in most instances, improvement in performance comes at no additional computational cost, since it merely requires a window replacement.
{"title":"Window optimization in linear prediction analysis","authors":"W. Chu","doi":"10.1109/TSA.2003.818213","DOIUrl":"https://doi.org/10.1109/TSA.2003.818213","url":null,"abstract":"The autocorrelation method of linear prediction (LP) analysis relies on a window for data extraction. We propose an approach to optimize the window which is based on gradient-descent. It is shown that the optimized window can enhance the performance of LP-based speech coding algorithms; in most instances, improvement in performance comes at no additional computational cost, since it merely requires a window replacement.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"24 1","pages":"626-635"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87186278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a nonlinear acoustic echo cancellation algorithm, mainly focused on loudspeaker distortions. The proposed system is composed of two distinct modules organized in a cascaded structure: a nonlinear module based on polynomial Volterra filters models the loudspeaker, and a second module of standard linear filtering identifies the impulse response of the acoustic path. The tracking of the overall system model is achieved by a modified normalized-least mean square algorithm for which equations are derived. Stability conditions are given, and particular attention is placed on the transient behavior of cascaded filters. Finally, results of real data recorded with Alcatel GSM material are presented.
{"title":"Nonlinear acoustic echo cancellation based on Volterra filters","authors":"A. Guérin, G. Faucon, R. Bouquin-Jeannès","doi":"10.1109/TSA.2003.818077","DOIUrl":"https://doi.org/10.1109/TSA.2003.818077","url":null,"abstract":"This paper describes a nonlinear acoustic echo cancellation algorithm, mainly focused on loudspeaker distortions. The proposed system is composed of two distinct modules organized in a cascaded structure: a nonlinear module based on polynomial Volterra filters models the loudspeaker, and a second module of standard linear filtering identifies the impulse response of the acoustic path. The tracking of the overall system model is achieved by a modified normalized-least mean square algorithm for which equations are derived. Stability conditions are given, and particular attention is placed on the transient behavior of cascaded filters. Finally, results of real data recorded with Alcatel GSM material are presented.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"14 1","pages":"672-683"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73131099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}