A model-based framework of classification error rate estimation is proposed for speech and speaker recognition. It aims at predicting the run-time performance of a hidden Markov model (HMM) based recognition system for a given task vocabulary and grammar without the need of running recognition experiments using a separate set of testing samples. This is highly desirable both in theory and in practice. However, the error rate expression in HMM-based speech recognition systems has no closed form solution due to the complexity of the multi-class comparison process and the need for dynamic time warping to handle various speech patterns. To alleviate the difficulty, we propose a one-dimensional model-based misclassification measure to evaluate the distance between a particular model of interest and a combination of many of its competing models. The error rate for a class characterized by the HMM is then the value of a smoothed zero-one error function given the misclassification measure. The overall error rate of the task vocabulary could then be computed as a function of all the available class error rates. The key here is to evaluate the misclassification measure in terms of the parameters of environmental-matched models without running recognition experiments, where the models are adapted by very limited data that could be just the testing utterance itself. In this paper, we show how the misclassification measure could be approximated by first computing the distance between two mixture Gaussian densities, then between two HMMs with mixture Gaussian state observation densities and finally between two sequences of HMMs. The misclassification measure is then converted into classification error rate. When comparing the error rate obtained in running actual experiments and that of the new framework, the proposed algorithm accurately estimates the classification error rate for many types of speech and speaker recognition problems. Based on the same framework, it is also demonstrated that the error rate of a recognition system in a noisy environment could also be predicted.
{"title":"A study on model-based error rate estimation for automatic speech recognition","authors":"C. Huang, Hsiao-Chuan Wang, Chin-Hui Lee","doi":"10.1109/TSA.2003.818030","DOIUrl":"https://doi.org/10.1109/TSA.2003.818030","url":null,"abstract":"A model-based framework of classification error rate estimation is proposed for speech and speaker recognition. It aims at predicting the run-time performance of a hidden Markov model (HMM) based recognition system for a given task vocabulary and grammar without the need of running recognition experiments using a separate set of testing samples. This is highly desirable both in theory and in practice. However, the error rate expression in HMM-based speech recognition systems has no closed form solution due to the complexity of the multi-class comparison process and the need for dynamic time warping to handle various speech patterns. To alleviate the difficulty, we propose a one-dimensional model-based misclassification measure to evaluate the distance between a particular model of interest and a combination of many of its competing models. The error rate for a class characterized by the HMM is then the value of a smoothed zero-one error function given the misclassification measure. The overall error rate of the task vocabulary could then be computed as a function of all the available class error rates. The key here is to evaluate the misclassification measure in terms of the parameters of environmental-matched models without running recognition experiments, where the models are adapted by very limited data that could be just the testing utterance itself. In this paper, we show how the misclassification measure could be approximated by first computing the distance between two mixture Gaussian densities, then between two HMMs with mixture Gaussian state observation densities and finally between two sequences of HMMs. The misclassification measure is then converted into classification error rate. When comparing the error rate obtained in running actual experiments and that of the new framework, the proposed algorithm accurately estimates the classification error rate for many types of speech and speaker recognition problems. Based on the same framework, it is also demonstrated that the error rate of a recognition system in a noisy environment could also be predicted.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"18 1","pages":"581-589"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74614774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coloration is a phenomenon in which timbre changes when reflected and direct sounds are mixed. We studied the relationship between the perception of coloration and direction for two sounds. Our psychological experiments using 11 subjects suggested that a 50% threshold of coloration appears to have no difference depending on direction. When the level ratio of two sounds is closer to 0 dB, a difference appears: If direct sound comes from a lateral direction and reflected sound comes from the opposite direction, coloration perception does not increase monotonically even if the ratio approaches 0 dB. We assumed that the difference depending on direction resulted from the directional dependence of the spectrum including the head-related transfer function (HRTF) and proposed a numerical model for predicting psychological results using the comb structure on the spectrum observed at the eardrum. We measured spectra using a head and torso simulator (HATS) and calculated the area, eventually finding a quantitative relationship between the area and psychological results and proposing a prediction model based on this relationship.
{"title":"Coloration perception depending on sound direction","authors":"Y. Seki, Kiyohide Ito","doi":"10.1109/TSA.2003.818032","DOIUrl":"https://doi.org/10.1109/TSA.2003.818032","url":null,"abstract":"Coloration is a phenomenon in which timbre changes when reflected and direct sounds are mixed. We studied the relationship between the perception of coloration and direction for two sounds. Our psychological experiments using 11 subjects suggested that a 50% threshold of coloration appears to have no difference depending on direction. When the level ratio of two sounds is closer to 0 dB, a difference appears: If direct sound comes from a lateral direction and reflected sound comes from the opposite direction, coloration perception does not increase monotonically even if the ratio approaches 0 dB. We assumed that the difference depending on direction resulted from the directional dependence of the spectrum including the head-related transfer function (HRTF) and proposed a numerical model for predicting psychological results using the comb structure on the spectrum observed at the eardrum. We measured spectra using a head and torso simulator (HATS) and calculated the area, eventually finding a quantitative relationship between the area and psychological results and proposing a prediction model based on this relationship.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"52 1","pages":"817-825"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79821828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various interpolated three-dimensional (3-D) digital waveguide mesh algorithms are elaborated. We introduce an optimized technique that improves a formerly proposed trilinearly interpolated 3-D mesh and renders the mesh more homogeneous in different directions. Furthermore, various sparse versions of the interpolated mesh algorithm are investigated, which reduce the computational complexity at the expense of accuracy. Frequency-warping techniques are used to shift the frequencies of the output signal of the mesh in order to cancel the effect of dispersion error. The extensions improve the accuracy of 3-D digital waveguide mesh simulations enough so that in the future it can be used for acoustical simulations needed in the design of listening rooms, for example.
{"title":"Interpolated rectangular 3-D digital waveguide mesh algorithms with frequency warping","authors":"L. Savioja, V. Välimäki","doi":"10.1109/TSA.2003.818028","DOIUrl":"https://doi.org/10.1109/TSA.2003.818028","url":null,"abstract":"Various interpolated three-dimensional (3-D) digital waveguide mesh algorithms are elaborated. We introduce an optimized technique that improves a formerly proposed trilinearly interpolated 3-D mesh and renders the mesh more homogeneous in different directions. Furthermore, various sparse versions of the interpolated mesh algorithm are investigated, which reduce the computational complexity at the expense of accuracy. Frequency-warping techniques are used to shift the frequencies of the output signal of the mesh in order to cancel the effect of dispersion error. The extensions improve the accuracy of 3-D digital waveguide mesh simulations enough so that in the future it can be used for acoustical simulations needed in the design of listening rooms, for example.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"3 1","pages":"783-790"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88902457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The major drawback of most noise reduction methods in speech applications is the annoying residual noise known as musical noise. A potential solution to this artifact is the incorporation of a human hearing model in the suppression filter design. However, since the available models are usually developed in the frequency domain, it is not clear how they can be applied in the signal subspace approach for speech enhancement. In this paper, we present a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter. This filter yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved. The proposed method can also be used with the general case of colored noise. Spectrogram illustrations and listening test results are given to show the superiority of the proposed method over the conventional signal subspace approach.
{"title":"Incorporating the human hearing properties in the signal subspace approach for speech enhancement","authors":"F. Jabloun, B. Champagne","doi":"10.1109/TSA.2003.818031","DOIUrl":"https://doi.org/10.1109/TSA.2003.818031","url":null,"abstract":"The major drawback of most noise reduction methods in speech applications is the annoying residual noise known as musical noise. A potential solution to this artifact is the incorporation of a human hearing model in the suppression filter design. However, since the available models are usually developed in the frequency domain, it is not clear how they can be applied in the signal subspace approach for speech enhancement. In this paper, we present a Frequency to Eigendomain Transformation (FET) which permits to calculate a perceptually based eigenfilter. This filter yields an improved result where better shaping of the residual noise, from a perceptual perspective, is achieved. The proposed method can also be used with the general case of colored noise. Spectrogram illustrations and listening test results are given to show the superiority of the proposed method over the conventional signal subspace approach.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"2070 1","pages":"700-708"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91329942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Toledano, L. A. H. Gómez, Luis Villarrubia Grande
This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.
{"title":"Automatic phonetic segmentation","authors":"D. Toledano, L. A. H. Gómez, Luis Villarrubia Grande","doi":"10.1109/TSA.2003.813579","DOIUrl":"https://doi.org/10.1109/TSA.2003.813579","url":null,"abstract":"This paper presents the results and conclusions of a thorough study on automatic phonetic segmentation. It starts with a review of the state of the art in this field. Then, it analyzes the most frequently used approach-based on a modified Hidden Markov Model (HMM) phonetic recognizer. For this approach, a statistical correction procedure is proposed to compensate for the systematic errors produced by context-dependent HMMs, and the use of speaker adaptation techniques is considered to increase segmentation precision. Finally, this paper explores the possibility of locally refining the boundaries obtained with the former techniques. A general framework is proposed for the local refinement of boundaries, and the performance of several pattern classification approaches (fuzzy logic, neural networks and Gaussian mixture models) is compared within this framework. The resulting phonetic segmentation scheme was able to increase the performance of a baseline HMM segmentation tool from 27.12%, 79.27%, and 97.75% of automatic boundary marks with errors smaller than 5, 20, and 50 ms, respectively, to 65.86%, 96.01%, and 99.31% in speaker-dependent mode, which is a reasonably good approximation to manual segmentation.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"445 1","pages":"617-625"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77852288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Rao, S. Ahmadi, J. Linden, A. Gersho, V. Cuperman, R. Heidari
A novel paradigm based on pitch-adaptive windows is proposed for solving the problem of encoding the fixed codebook excitation in low bit-rate CELP coders. In this method, the nonzero excitation in the fixed codebook is substantially localized to a set of time intervals called windows. The positions of the windows are adaptive to the pitch peaks in the linear prediction residual signal. Thus, high coding efficiency is achieved by allocating most of the available FCB bits to the perceptually important segments of the excitation signal. The pitch-adaptive method is adopted in the design of a novel multimode variable-rate speech coder applicable to CDMA-based cellular telephony. Results demonstrate that the adaptive windows method yields excellent voice quality and intelligibility at average bit-rates in the range of 2.5-4.0 kbps.
{"title":"Pitch adaptive windows for improved excitation coding in low-rate CELP coders","authors":"A. Rao, S. Ahmadi, J. Linden, A. Gersho, V. Cuperman, R. Heidari","doi":"10.1109/TSA.2003.815530","DOIUrl":"https://doi.org/10.1109/TSA.2003.815530","url":null,"abstract":"A novel paradigm based on pitch-adaptive windows is proposed for solving the problem of encoding the fixed codebook excitation in low bit-rate CELP coders. In this method, the nonzero excitation in the fixed codebook is substantially localized to a set of time intervals called windows. The positions of the windows are adaptive to the pitch peaks in the linear prediction residual signal. Thus, high coding efficiency is achieved by allocating most of the available FCB bits to the perceptually important segments of the excitation signal. The pitch-adaptive method is adopted in the design of a novel multimode variable-rate speech coder applicable to CDMA-based cellular telephony. Results demonstrate that the adaptive windows method yields excellent voice quality and intelligibility at average bit-rates in the range of 2.5-4.0 kbps.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"33 1","pages":"648-659"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82933760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Broadband microphone arrays has important applications such as hands-free mobile telephony, voice interface to personal computers and video conference equipment. This problem can be tackled in different ways. In this paper, a general broadband beamformer design problem is considered. The problem is posed as a Chebyshev minimax problem. Using the l/sub 1/-norm measure or the real rotation theorem, we show that it can be converted into a semi-infinite linear programming problem. A numerical scheme using a set of adaptive grids is applied. The scheme is proven to be convergent when a certain grid refinement is used. The method can be applied to the design of multidimensional digital finite-impulse response (FIR) filters with arbitrarily specified amplitude and phase.
{"title":"Near-field broadband beamformer design via multidimensional semi-infinite-linear programming techniques","authors":"K. Yiu, Xiaoqi Yang, S. Nordholm, K. Teo","doi":"10.1109/TSA.2003.815527","DOIUrl":"https://doi.org/10.1109/TSA.2003.815527","url":null,"abstract":"Broadband microphone arrays has important applications such as hands-free mobile telephony, voice interface to personal computers and video conference equipment. This problem can be tackled in different ways. In this paper, a general broadband beamformer design problem is considered. The problem is posed as a Chebyshev minimax problem. Using the l/sub 1/-norm measure or the real rotation theorem, we show that it can be converted into a semi-infinite linear programming problem. A numerical scheme using a set of adaptive grids is applied. The scheme is proven to be convergent when a certain grid refinement is used. The method can be applied to the design of multidimensional digital finite-impulse response (FIR) filters with arbitrarily specified amplitude and phase.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"11 1","pages":"725-732"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89495175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.
{"title":"Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition","authors":"L. Deng, J. Droppo, A. Acero","doi":"10.1109/TSA.2003.818076","DOIUrl":"https://doi.org/10.1109/TSA.2003.818076","url":null,"abstract":"We describe a novel algorithm for recursive estimation of nonstationary acoustic noise which corrupts clean speech, and a successful application of the algorithm in the speech feature enhancement framework of noise-normalized SPLICE for robust speech recognition. The noise estimation algorithm makes use of a nonlinear model of the acoustic environment in the cepstral domain. Central to the algorithm is the innovative iterative stochastic approximation technique that improves piecewise linear approximation to the nonlinearity involved and that subsequently increases the accuracy for noise estimation. We report comprehensive experiments on SPLICE-based, noise-robust speech recognition for the AURORA2 task using the results of iterative stochastic approximation. The effectiveness of the new technique is demonstrated in comparison with a more traditional, MMSE noise estimation algorithm under otherwise identical conditions. The word error rate reduction achieved by iterative stochastic approximation for recursive noise estimation in the framework of noise-normalized SPLICE is 27.9% for the multicondition training mode, and 67.4% for the clean-only training mode, respectively, compared with the results using the standard cepstra with no speech enhancement and using the baseline HMM supplied by AURORA2. These represent the best performance in the clean-training category of the September-2001 AURORA2 evaluation. The relative error rate reduction achieved by using the same noise estimate is increased to 48.40% and 76.86%, respectively, for the two training modes after using a better designed HMM system. The experimental results demonstrated the crucial importance of using the newly introduced iterations in improving the earlier stochastic approximation technique, and showed sensitivity of the noise estimation algorithm's performance to the forgetting factor embedded in the algorithm.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"110 1","pages":"568-580"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84247680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and side information. The companion paper (Part I) covers the psychoacoustic fundamentals of this method and outlines principles for the design of BCC schemes. The BCC analysis and synthesis methods of Part I are motivated and presented in the framework of stereophonic audio coding. This paper, Part II, generalizes the basic BCC schemes presented in Part I. It includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis. A scheme for multichannel audio coding is presented. Moreover, a modified scheme is derived that allows flexible rendering of the spatial image at the receiver supporting dynamic control. All aspects of complete BCC encoder and decoder implementations are discussed, such as down-mixing of the input signals, low complexity estimation of the spatial cues, and quantization and coding of the side information. Application examples are given and the performance of the coder implementations are evaluated and discussed based on subjective listening test results.
{"title":"Binaural cue coding-Part II: Schemes and applications","authors":"C. Faller, F. Baumgarte","doi":"10.1109/TSA.2003.818108","DOIUrl":"https://doi.org/10.1109/TSA.2003.818108","url":null,"abstract":"Binaural Cue Coding (BCC) is a method for multichannel spatial rendering based on one down-mixed audio channel and side information. The companion paper (Part I) covers the psychoacoustic fundamentals of this method and outlines principles for the design of BCC schemes. The BCC analysis and synthesis methods of Part I are motivated and presented in the framework of stereophonic audio coding. This paper, Part II, generalizes the basic BCC schemes presented in Part I. It includes BCC for multichannel signals and employs an enhanced set of perceptual spatial cues for BCC synthesis. A scheme for multichannel audio coding is presented. Moreover, a modified scheme is derived that allows flexible rendering of the spatial image at the receiver supporting dynamic control. All aspects of complete BCC encoder and decoder implementations are discussed, such as down-mixing of the input signals, low complexity estimation of the spatial cues, and quantization and coding of the side information. Application examples are given and the performance of the coder implementations are evaluated and discussed based on subjective listening test results.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"65 1","pages":"520-531"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90485495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional acoustic source localization algorithms attempt to find the current location of the acoustic source using data collected at an array of sensors at the current time only. In the presence of strong multipath, these traditional algorithms often erroneously locate a multipath reflection rather than the true source location. A recently proposed approach that appears promising in overcoming this drawback of traditional algorithms, is a state-space approach using particle filtering. In this paper we formulate a general framework for tracking an acoustic source using particle filters. We discuss four specific algorithms that fit within this framework, and demonstrate their performance using both simulated reverberant data and data recorded in a moderately reverberant office room (with a measured reverberation time of 0.39 s). The results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.
{"title":"Particle filtering algorithms for tracking an acoustic source in a reverberant environment","authors":"D. Ward, E. Lehmann, R. C. Williamson","doi":"10.1109/TSA.2003.818112","DOIUrl":"https://doi.org/10.1109/TSA.2003.818112","url":null,"abstract":"Traditional acoustic source localization algorithms attempt to find the current location of the acoustic source using data collected at an array of sensors at the current time only. In the presence of strong multipath, these traditional algorithms often erroneously locate a multipath reflection rather than the true source location. A recently proposed approach that appears promising in overcoming this drawback of traditional algorithms, is a state-space approach using particle filtering. In this paper we formulate a general framework for tracking an acoustic source using particle filters. We discuss four specific algorithms that fit within this framework, and demonstrate their performance using both simulated reverberant data and data recorded in a moderately reverberant office room (with a measured reverberation time of 0.39 s). The results indicate that the proposed family of algorithms are able to accurately track a moving source in a moderately reverberant room.","PeriodicalId":13155,"journal":{"name":"IEEE Trans. Speech Audio Process.","volume":"83 1","pages":"826-836"},"PeriodicalIF":0.0,"publicationDate":"2003-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77281096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}