Presented is a new coding paradigm, multimode transform predictive coding (MTPC), which combines speech and audio coding principles in a single coding structure. The paradigm is an adaptive coding paradigm which automatically adjusts how different coding modules are used based on the input signal. This allows MTPC coders to robustly handle a wider range of signals than single configuration (mode) transform predictive coding (TPC) designs. A wideband MTPC coder design targeting two-way communication applications and bitrates from 13 to 40 kbit/s is also presented. Subjective absolute category rating test results on speech, speech in noise and music demonstrate that the performance at 16, 24 and 32 kbit/s meets or exceeds that of ITU-T Rec. G.722 at 48, 56 and 64 kbit/s respectively for many coding conditions. Subjective Reference-ABx (R-ABx) tests are also included to show the potential advantages of the multimode coder over a single mode TPC coder. Finally, possible improvements in the MTPC coder design for applications such as broadcasting, which are less sensitive to delay and encoder complexity, are discussed.
{"title":"The multimode transform predictive coding paradigm","authors":"S. Ramprashad","doi":"10.1109/TSA.2003.809195","DOIUrl":"https://doi.org/10.1109/TSA.2003.809195","url":null,"abstract":"Presented is a new coding paradigm, multimode transform predictive coding (MTPC), which combines speech and audio coding principles in a single coding structure. The paradigm is an adaptive coding paradigm which automatically adjusts how different coding modules are used based on the input signal. This allows MTPC coders to robustly handle a wider range of signals than single configuration (mode) transform predictive coding (TPC) designs. A wideband MTPC coder design targeting two-way communication applications and bitrates from 13 to 40 kbit/s is also presented. Subjective absolute category rating test results on speech, speech in noise and music demonstrate that the performance at 16, 24 and 32 kbit/s meets or exceeds that of ITU-T Rec. G.722 at 48, 56 and 64 kbit/s respectively for many coding conditions. Subjective Reference-ABx (R-ABx) tests are also included to show the potential advantages of the multimode coder over a single mode TPC coder. Finally, possible improvements in the MTPC coder design for applications such as broadcasting, which are less sensitive to delay and encoder complexity, are discussed.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"69 1","pages":"117-129"},"PeriodicalIF":0.0,"publicationDate":"2003-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80283918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For high quality acoustic echo cancellation long echoes have to be suppressed. classical LMS-based adaptive filters are not attractive as they are suboptimal from a computational point of view. Multirate adaptive filters such as the partitioned block frequency-domain adaptive filter (PBFDAF) are good alternatives and are widely used in commercial echo cancellers nowadays. In this paper the PBFDRAP is analyzed, which combines frequency-domain adaptive filtering with so-called "row action projection." Fast versions of the algorithm are derived and it is shown that the PBFDRAP outperforms the PBFDAF in a realistic echo cancellation setup.
{"title":"Iterated partitioned block frequency-domain adaptive filtering for acoustic echo cancellation","authors":"K. Eneman, M. Moonen","doi":"10.1109/TSA.2003.809194","DOIUrl":"https://doi.org/10.1109/TSA.2003.809194","url":null,"abstract":"For high quality acoustic echo cancellation long echoes have to be suppressed. classical LMS-based adaptive filters are not attractive as they are suboptimal from a computational point of view. Multirate adaptive filters such as the partitioned block frequency-domain adaptive filter (PBFDAF) are good alternatives and are widely used in commercial echo cancellers nowadays. In this paper the PBFDRAP is analyzed, which combines frequency-domain adaptive filtering with so-called \"row action projection.\" Fast versions of the algorithm are derived and it is shown that the PBFDRAP outperforms the PBFDAF in a realistic echo cancellation setup.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"9 1","pages":"143-158"},"PeriodicalIF":0.0,"publicationDate":"2003-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79552227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chulhee Lee, Donghoon Hyun, E. Choi, Jinwook Go, Chungyong Lee
We propose a method to minimize the loss of information during the feature extraction stage in speech recognition by optimizing the parameters of the mel-cepstrum transformation, a transform which is widely used in speech recognition. Typically, the mel-cepstrum is obtained by critical band filters whose characteristics play an important role in converting a speech signal into a sequence of vectors. First, we analyze the performance of the mel-cepstrum by changing the parameters of the filters such as shape, center frequency, and bandwidth. Then we propose an algorithm to optimize the parameters of the filters using the simplex method. Experiments with Korean digit words show that the recognition rate improved by about 4-7%.
{"title":"Optimizing feature extraction for speech recognition","authors":"Chulhee Lee, Donghoon Hyun, E. Choi, Jinwook Go, Chungyong Lee","doi":"10.1109/TSA.2002.805644","DOIUrl":"https://doi.org/10.1109/TSA.2002.805644","url":null,"abstract":"We propose a method to minimize the loss of information during the feature extraction stage in speech recognition by optimizing the parameters of the mel-cepstrum transformation, a transform which is widely used in speech recognition. Typically, the mel-cepstrum is obtained by critical band filters whose characteristics play an important role in converting a speech signal into a sequence of vectors. First, we analyze the performance of the mel-cepstrum by changing the parameters of the filters such as shape, center frequency, and bandwidth. Then we propose an algorithm to optimize the parameters of the filters using the simplex method. Experiments with Korean digit words show that the recognition rate improved by about 4-7%.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"45 1","pages":"80-87"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86923435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We report on our research concerning the calibration of physical models for sound synthesis. We combine waveguide physical modeling synthesis with formant filtering, by dividing the nonlinear description of the reed mechanism into a nonlinear part and an input-dependent linear filter. We elaborate on the calibration of the model and assess its performance by comparing it to a single-reed, cylindrical bore instrument, the clarinet.
{"title":"A formant filtered physical model for wind instruments","authors":"A. Nackaerts, B. Moor, R. Lauwereins","doi":"10.1109/TSA.2002.807351","DOIUrl":"https://doi.org/10.1109/TSA.2002.807351","url":null,"abstract":"We report on our research concerning the calibration of physical models for sound synthesis. We combine waveguide physical modeling synthesis with formant filtering, by dividing the nonlinear description of the reed mechanism into a nonlinear part and an input-dependent linear filter. We elaborate on the calibration of the model and assess its performance by comparing it to a single-reed, cylindrical bore instrument, the clarinet.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"12 1","pages":"36-44"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74452203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many practical cases for active noise control (ANC), the online secondary path modeling methods that use auxiliary noise are often applied. However, the auxiliary noise contributes to residual noise, and thus deteriorates the noise control performance of ANC systems. Moreover, a sudden and large change in the secondary path leads to easy divergence of the existing online secondary path modeling methods. To mitigate these problems, this paper proposes a new online secondary path modeling method with auxiliary noise power scheduling and adaptive filter norm manipulation. The auxiliary noise power is scheduled based on the convergence status of an ANC system with consideration of the variation of the primary noise. The purpose is to alleviate the increment of the residual noise due to the auxiliary noise. In addition, the norm manipulation is applied to adaptive filters in the ANC system. The objective is to avoid over-updates of adaptive filters due to the sudden large change in the secondary path and thus prevent the ANC system from diverging. Computer simulations show the effectiveness and robustness of the proposed method.
{"title":"A robust online secondary path modeling method with auxiliary noise power scheduling strategy and norm constraint manipulation","authors":"Ming Zhang, H. Lan, W. Ser","doi":"10.1109/TSA.2003.805643","DOIUrl":"https://doi.org/10.1109/TSA.2003.805643","url":null,"abstract":"In many practical cases for active noise control (ANC), the online secondary path modeling methods that use auxiliary noise are often applied. However, the auxiliary noise contributes to residual noise, and thus deteriorates the noise control performance of ANC systems. Moreover, a sudden and large change in the secondary path leads to easy divergence of the existing online secondary path modeling methods. To mitigate these problems, this paper proposes a new online secondary path modeling method with auxiliary noise power scheduling and adaptive filter norm manipulation. The auxiliary noise power is scheduled based on the convergence status of an ANC system with consideration of the variation of the primary noise. The purpose is to alleviate the increment of the residual noise due to the auxiliary noise. In addition, the norm manipulation is applied to adaptive filters in the ANC system. The objective is to avoid over-updates of adaptive filters due to the sudden large change in the secondary path and thus prevent the ANC system from diverging. Computer simulations show the effectiveness and robustness of the proposed method.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"88 1","pages":"45-53"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81244473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present an enhancement front-end for speech codecs, which consists of the integrated elements of noise reduction and echo cancellation. By including these elements, the front-end performs the task of mitigating the objectionable effects of the two major factors, i.e., noise and echo, which adversely affect the quality of most transmission systems, especially when low bit rate codecs are used. The use of this front-end is demonstrated with the 7.4 kbps IS-641 codec (enhanced full-rate standard for IS-136 systems). The integrated speech-processing unit has the advantage of utilizing the synergy among its components: the voice activity detector in the speech codec, the noise reduction, and the echo canceller. This synergy manifests itself both in the form of a reduction of the overall computational complexity by the use of a number of shared elements among the unit's various components, as well as an improved performance resulting from these components working together. The system displays high performance in both clean and noisy environments and it works well with low bit rate codecs.
{"title":"Noise reduction and echo cancellation front-end for speech codecs","authors":"F. Basbug, K. Swaminathan, S. Nandkumar","doi":"10.1109/TSA.2002.807350","DOIUrl":"https://doi.org/10.1109/TSA.2002.807350","url":null,"abstract":"We present an enhancement front-end for speech codecs, which consists of the integrated elements of noise reduction and echo cancellation. By including these elements, the front-end performs the task of mitigating the objectionable effects of the two major factors, i.e., noise and echo, which adversely affect the quality of most transmission systems, especially when low bit rate codecs are used. The use of this front-end is demonstrated with the 7.4 kbps IS-641 codec (enhanced full-rate standard for IS-136 systems). The integrated speech-processing unit has the advantage of utilizing the synergy among its components: the voice activity detector in the speech codec, the noise reduction, and the echo canceller. This synergy manifests itself both in the form of a reduction of the overall computational complexity by the use of a number of shared elements among the unit's various components, as well as an improved performance resulting from these components working together. The system displays high performance in both clean and noisy environments and it works well with low bit rate codecs.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"178 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79964723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient algorithms for the exact and approximate computation of the symmetrical Kullback-Leibler (1998) measure for spectral distances are presented for linear predictive coding (LPC) spectra. A interpretation of this measure is given in terms of the poles of the spectra. The performances of the algorithms in terms of accuracy and computational complexity are assessed for the application of computing concatenation costs in unit-selection-based speech synthesis. With the same complexity and storage requirements, the exact method is superior in terms of accuracy.
{"title":"On the computation of the Kullback-Leibler measure for spectral distances","authors":"R. Veldhuis, E. Klabbers","doi":"10.1109/TSA.2002.805641","DOIUrl":"https://doi.org/10.1109/TSA.2002.805641","url":null,"abstract":"Efficient algorithms for the exact and approximate computation of the symmetrical Kullback-Leibler (1998) measure for spectral distances are presented for linear predictive coding (LPC) spectra. A interpretation of this measure is given in terms of the poles of the spectra. The performances of the algorithms in terms of accuracy and computational complexity are assessed for the application of computing concatenation costs in unit-selection-based speech synthesis. With the same complexity and storage requirements, the exact method is superior in terms of accuracy.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"53 39 1","pages":"100-103"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80481225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper shows how discriminative training can significantly improve classifiers used in natural language processing, using as an example the task of natural language call routing, where callers are transferred to desired departments based on natural spoken responses to an open-ended "How may I direct your call?" prompt. With vector-based natural language call routing, callers are transferred using a routing matrix trained on statistics of occurrence of words and word sequences in a training corpus. By re-training the routing matrix parameters using a minimum classification error criterion, a relative error rate reduction of 10-30% was achieved on a banking task. Increased robustness was demonstrated in that with 10% rejection, the error rate was reduced by 40%. Discriminative training also improves portability; we were able to train call routers with the highest known performance using as input only text transcription of routed calls, without any human intervention or knowledge about what terms are important or irrelevant for the routing task. This strategy was validated with both the banking task and a more difficult task involving calls to operators in the UK. The proposed formulation is applicable to algorithms addressing a broad range of speech understanding, information retrieval, and topic identification problems.
{"title":"Discriminative training of natural language call routers","authors":"H. Kuo, Chin-Hui Lee","doi":"10.1109/TSA.2002.807352","DOIUrl":"https://doi.org/10.1109/TSA.2002.807352","url":null,"abstract":"This paper shows how discriminative training can significantly improve classifiers used in natural language processing, using as an example the task of natural language call routing, where callers are transferred to desired departments based on natural spoken responses to an open-ended \"How may I direct your call?\" prompt. With vector-based natural language call routing, callers are transferred using a routing matrix trained on statistics of occurrence of words and word sequences in a training corpus. By re-training the routing matrix parameters using a minimum classification error criterion, a relative error rate reduction of 10-30% was achieved on a banking task. Increased robustness was demonstrated in that with 10% rejection, the error rate was reduced by 40%. Discriminative training also improves portability; we were able to train call routers with the highest known performance using as input only text transcription of routed calls, without any human intervention or knowledge about what terms are important or irrelevant for the routing task. This strategy was validated with both the banking task and a more difficult task involving calls to operators in the UK. The proposed formulation is applicable to algorithms addressing a broad range of speech understanding, information retrieval, and topic identification problems.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"73 1","pages":"24-35"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82043531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Mark de Haan, N. Grbic, I. Claesson, S. Nordholm
This paper presents a new method for the design of oversampled uniform DFT-filter banks for the special application of subband adaptive beamforming with microphone arrays. Since array applications rely on the fact that different source positions give rise to different signal delays, a beamformer alters the phase information of the signals. This in turn leads to signal degradations when perfect reconstruction filter banks are used for the subband decomposition and reconstruction. The objective of the filter bank design is to minimize the magnitude of all aliasing components individually, such that aliasing distortion is minimized although phase alterations occur in the subbands. The proposed method is evaluated in a car hands-free mobile telephony environment and the results show that the proposed method offers better performance regarding suppression levels of disturbing signals and much less distortion to the source speech.
{"title":"Filter bank design for subband adaptive microphone arrays","authors":"Jan Mark de Haan, N. Grbic, I. Claesson, S. Nordholm","doi":"10.1109/TSA.2002.807353","DOIUrl":"https://doi.org/10.1109/TSA.2002.807353","url":null,"abstract":"This paper presents a new method for the design of oversampled uniform DFT-filter banks for the special application of subband adaptive beamforming with microphone arrays. Since array applications rely on the fact that different source positions give rise to different signal delays, a beamformer alters the phase information of the signals. This in turn leads to signal degradations when perfect reconstruction filter banks are used for the subband decomposition and reconstruction. The objective of the filter bank design is to minimize the magnitude of all aliasing components individually, such that aliasing distortion is minimized although phase alterations occur in the subbands. The proposed method is evaluated in a car hands-free mobile telephony environment and the results show that the proposed method offers better performance regarding suppression levels of disturbing signals and much less distortion to the source speech.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 1","pages":"14-23"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90146045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The uncertainty in parameter estimation due to the adverse environments deteriorates the classification performance for speech recognition. It becomes crucial to incorporate the parameter uncertainty into decision so that the classification robustness can be assured. We propose a novel linear regression based Bayesian predictive classification (LRBPC) for robust speech recognition. This framework is constructed under the paradigm of linear regression adaptation of speech hidden Markov models (HMMs). Because the regression mapping between HMMs and adaptation data is ill posed, we properly characterize the uncertainty of regression parameters using a joint Gaussian distribution . A closed-form predictive distribution can be derived to set up the LRBPC decision for speech recognition. Such decision is robust compared to the plug-in maximum a posteriori (MAP) decision adopted in the maximum likelihood linear regression (MLLR) and MAP linear regression (MAPLR). Since the specified distribution belongs to the conjugate prior family, the evolutionary hyperparameters are established. With the statistically rich hyperparameters, the LRBPC achieves decision robustness. In the experiments, we find that LRBPC decision in cases of general linear regression as well as single variable linear regression attains significantly better recognition performance than MLLR and MAPLR adaptation.
{"title":"Linear regression based Bayesian predictive classification for speech recognition","authors":"Jen-Tzung Chien","doi":"10.1109/TSA.2002.805640","DOIUrl":"https://doi.org/10.1109/TSA.2002.805640","url":null,"abstract":"The uncertainty in parameter estimation due to the adverse environments deteriorates the classification performance for speech recognition. It becomes crucial to incorporate the parameter uncertainty into decision so that the classification robustness can be assured. We propose a novel linear regression based Bayesian predictive classification (LRBPC) for robust speech recognition. This framework is constructed under the paradigm of linear regression adaptation of speech hidden Markov models (HMMs). Because the regression mapping between HMMs and adaptation data is ill posed, we properly characterize the uncertainty of regression parameters using a joint Gaussian distribution . A closed-form predictive distribution can be derived to set up the LRBPC decision for speech recognition. Such decision is robust compared to the plug-in maximum a posteriori (MAP) decision adopted in the maximum likelihood linear regression (MLLR) and MAP linear regression (MAPLR). Since the specified distribution belongs to the conjugate prior family, the evolutionary hyperparameters are established. With the statistically rich hyperparameters, the LRBPC achieves decision robustness. In the experiments, we find that LRBPC decision in cases of general linear regression as well as single variable linear regression attains significantly better recognition performance than MLLR and MAPLR adaptation.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"63 1","pages":"70-79"},"PeriodicalIF":0.0,"publicationDate":"2003-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90590519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}