Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701880
A. Touazi, M. Debyeche
In this paper, we propose a low bit-rate compression scheme in distributed speech recognition (DSR) system based on polynomial interpolation. Dimensionality reduction of a set of successive Mel frequency cepstral coefficients (MFCCs) is achieved by performing polynomial least squares fitting. A conventional vector quantization (VQ) is applied to the polynomial coefficients to achieve more than 58% of bandwidth reduction as compared to ETSI advanced front-end (ETSI-AFE) encoder. Evaluation performance has been conducted on the Aurora-2 database in clean and multi-condition training modes. With respect to ETSI-AFE, the results obtained with the proposed encoder show no significant degradation in term of overall recognition accuracy.
{"title":"A polynomial interpolation-based scheme for reducing bandwidth in distributed speech recognition system","authors":"A. Touazi, M. Debyeche","doi":"10.1109/WASPAA.2013.6701880","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701880","url":null,"abstract":"In this paper, we propose a low bit-rate compression scheme in distributed speech recognition (DSR) system based on polynomial interpolation. Dimensionality reduction of a set of successive Mel frequency cepstral coefficients (MFCCs) is achieved by performing polynomial least squares fitting. A conventional vector quantization (VQ) is applied to the polynomial coefficients to achieve more than 58% of bandwidth reduction as compared to ETSI advanced front-end (ETSI-AFE) encoder. Evaluation performance has been conducted on the Aurora-2 database in clean and multi-condition training modes. With respect to ETSI-AFE, the results obtained with the proposed encoder show no significant degradation in term of overall recognition accuracy.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132736249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701891
Umut Simsekli, Jonathan Le Roux, J. Hershey
Many kinds of non-negative data, such as power spectra and count data, have been modeled using non-negative matrix factorization. Even though this modeling paradigm has yielded successful applications, it falls short when the data have certain hierarchical and temporal structure. In this study, we propose a novel dynamical system model that can handle these kinds of complex structures that often arise in non-negative data. We show that our model can be extended to handle heterogeneous data for data-driven regularization. We present convergence-guaranteed update rules for each latent factor. In order to assess the performance, we evaluate our model on the transcription of classical piano pieces, and show that it outperforms related models. We also illustrate that the performance can be further improved by making use of symbolic data.
{"title":"Hierarchical and coupled non-negative dynamical systems with application to audio modeling","authors":"Umut Simsekli, Jonathan Le Roux, J. Hershey","doi":"10.1109/WASPAA.2013.6701891","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701891","url":null,"abstract":"Many kinds of non-negative data, such as power spectra and count data, have been modeled using non-negative matrix factorization. Even though this modeling paradigm has yielded successful applications, it falls short when the data have certain hierarchical and temporal structure. In this study, we propose a novel dynamical system model that can handle these kinds of complex structures that often arise in non-negative data. We show that our model can be extended to handle heterogeneous data for data-driven regularization. We present convergence-guaranteed update rules for each latent factor. In order to assess the performance, we evaluate our model on the transcription of classical piano pieces, and show that it outperforms related models. We also illustrate that the performance can be further improved by making use of symbolic data.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132748246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701879
François Rigaud, Angélique Dremeau, B. David, L. Daudet
The paper introduces a probabilistic model for the analysis of line spectra - defined here as a set of frequencies of spectral peaks with significant energy. This model is detailed in a general polyphonic audio framework and assumes that, for a time-frame of signal, the observations have been generated by a mixture of notes composed by partial and noise components. Observations corresponding to partial frequencies can provide some information on the musical instrument that generated them. In the case of piano music, the fundamental frequency and the inharmonicity coefficient are introduced as parameters for each note, and can be estimated from the line spectra parameters by means of an Expectation-Maximization algorithm. This technique is finally applied for the unsupervised estimation of the tuning and inharmonicity along the whole compass of a piano, from the recording of a musical piece.
{"title":"A probabilistic line spectrum model for musical instrument sounds and its application to piano tuning estimation","authors":"François Rigaud, Angélique Dremeau, B. David, L. Daudet","doi":"10.1109/WASPAA.2013.6701879","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701879","url":null,"abstract":"The paper introduces a probabilistic model for the analysis of line spectra - defined here as a set of frequencies of spectral peaks with significant energy. This model is detailed in a general polyphonic audio framework and assumes that, for a time-frame of signal, the observations have been generated by a mixture of notes composed by partial and noise components. Observations corresponding to partial frequencies can provide some information on the musical instrument that generated them. In the case of piano music, the fundamental frequency and the inharmonicity coefficient are introduced as parameters for each note, and can be estimated from the line spectra parameters by means of an Expectation-Maximization algorithm. This technique is finally applied for the unsupervised estimation of the tuning and inharmonicity along the whole compass of a piano, from the recording of a musical piece.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116636044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701870
J. Jensen, J. Benesty, M. G. Christensen, Jingdong Chen
In this paper, we consider a recent class of optimal rectangular filtering matrices for single-channel speech enhancement. This class of filters exploits the fact that the dimension of the signal subspace is lower than that of the full space. Then, extra degrees of freedom in the filters, that are otherwise reserved for preserving the signal subspace, can be used for achieving an improved output signal-to-noise ratio (SNR). Interestingly, these filters unify the ideas of optimal filtering and subspace methods. We propose an optimal LCMV filter in this framework with minimum output power that passes the desired signal undistorted and cancels correlated noise. The cancellation was not facilitated by the filters derived so far in this framework. The results show that the proposed filter can achieve output SNRs similar to that of competing filter designs, while having a much higher output signal-to-interference ratio. This is showed for both synthetic and real speech signals.
{"title":"An LCMV filter for single-channel noise cancellation and reduction in the time domain","authors":"J. Jensen, J. Benesty, M. G. Christensen, Jingdong Chen","doi":"10.1109/WASPAA.2013.6701870","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701870","url":null,"abstract":"In this paper, we consider a recent class of optimal rectangular filtering matrices for single-channel speech enhancement. This class of filters exploits the fact that the dimension of the signal subspace is lower than that of the full space. Then, extra degrees of freedom in the filters, that are otherwise reserved for preserving the signal subspace, can be used for achieving an improved output signal-to-noise ratio (SNR). Interestingly, these filters unify the ideas of optimal filtering and subspace methods. We propose an optimal LCMV filter in this framework with minimum output power that passes the desired signal undistorted and cancels correlated noise. The cancellation was not facilitated by the filters derived so far in this framework. The results show that the proposed filter can achieve output SNRs similar to that of competing filter designs, while having a much higher output signal-to-interference ratio. This is showed for both synthetic and real speech signals.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121743532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701853
Tomas Bäckström
Speech and audio coding have during the last decade converged to an increasingly unified technology. This contribution discusses one of the remaining fundamental differences between speech and audio paradigms, namely, windowing of the input signal. Audio codecs generally use lapped transforms and apply a perceptual model in the transform domain, whereby temporal continuity is achieved by windowing and overlap-add. Speech codecs on the other hand achieve temporal continuity by using linear predictive filtering, whereby windowing is applied in the residual domain. Despite these fundamental differences, we demonstrate that the two windowing approaches, combined with perceptual modeling, perform very similarly both in terms of perceptual quality and theoretical properties.
{"title":"Comparison of windowing in speech and audio coding","authors":"Tomas Bäckström","doi":"10.1109/WASPAA.2013.6701853","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701853","url":null,"abstract":"Speech and audio coding have during the last decade converged to an increasingly unified technology. This contribution discusses one of the remaining fundamental differences between speech and audio paradigms, namely, windowing of the input signal. Audio codecs generally use lapped transforms and apply a perceptual model in the transform domain, whereby temporal continuity is achieved by windowing and overlap-add. Speech codecs on the other hand achieve temporal continuity by using linear predictive filtering, whereby windowing is applied in the residual domain. Despite these fundamental differences, we demonstrate that the two windowing approaches, combined with perceptual modeling, perform very similarly both in terms of perceptual quality and theoretical properties.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124064054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701825
A. Schwarz, Christian Hofmann, Walter Kellermann
We propose a method for nonlinear residual echo suppression that consists of extracting spectral features from the far-end signal, and using an artificial neural network to model the residual echo magnitude spectrum from these features. We compare the modeling accuracy achieved by realizations with different features and network topologies, evaluating the mean squared error of the estimated residual echo magnitude spectrum. We also present a low complexity real-time implementation combining an offline-trained network with online adaptation, and investigate its performance in terms of echo suppression and speech distortion for real mobile phone recordings.
{"title":"Spectral feature-based nonlinear residual echo suppression","authors":"A. Schwarz, Christian Hofmann, Walter Kellermann","doi":"10.1109/WASPAA.2013.6701825","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701825","url":null,"abstract":"We propose a method for nonlinear residual echo suppression that consists of extracting spectral features from the far-end signal, and using an artificial neural network to model the residual echo magnitude spectrum from these features. We compare the modeling accuracy achieved by realizations with different features and network topologies, evaluating the mean squared error of the estimated residual echo magnitude spectrum. We also present a low complexity real-time implementation combining an offline-trained network with online adaptation, and investigate its performance in terms of echo suppression and speech distortion for real mobile phone recordings.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127722474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701838
H. Khalilian, I. Bajić, R. Vaughan
We describe a method for approximating a desired sound filed in a cubic region using a planar array of omnidirectional loudspeakers. For this purpose, a constrained matching pursuit algorithm is employed to find the appropriate locations of the loudspeakers. Unlike previously proposed methods for sound field approximation, this iterative procedure attempts to approximate the residual error vector at each iteration, leading to a more efficient representation of the desired field as a linear combination of the Acoustic Transfer Functions (ATFs) of the selected loudspeakers. Simulations suggest that the new method offers considerable improvement in approximation accuracy compared to uniformly placed loudspeakers, as well as another recent method for loudspeaker placement.
{"title":"Loudspeaker placement for sound field reproduction by constrained matching pursuit","authors":"H. Khalilian, I. Bajić, R. Vaughan","doi":"10.1109/WASPAA.2013.6701838","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701838","url":null,"abstract":"We describe a method for approximating a desired sound filed in a cubic region using a planar array of omnidirectional loudspeakers. For this purpose, a constrained matching pursuit algorithm is employed to find the appropriate locations of the loudspeakers. Unlike previously proposed methods for sound field approximation, this iterative procedure attempts to approximate the residual error vector at each iteration, leading to a more efficient representation of the desired field as a linear combination of the Acoustic Transfer Functions (ATFs) of the selected loudspeakers. Simulations suggest that the new method offers considerable improvement in approximation accuracy compared to uniformly placed loudspeakers, as well as another recent method for loudspeaker placement.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122766754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701877
Sonia Badar, Nobutaka Ono, L. Daudet
Reducing the total data throughput for microphones arrays is often necessary, especially when using very large arrays. However, what information can be lost depends on the processing task at the decoder level. In this paper, we investigate simple ways of linearly down-mixing the microphone signals into a reduced number of channels, using non-adaptive coefficients derived from a diffuse noise model, based only on the geometry of the array. In source separation experiments, this multiplexing scheme provides no significant loss in quality even with a high reduction in the number of transmission channels, and outperforms a multiplexing scheme with random coefficients. It furthermore introduces some robustness with respect to the microphone gains and angle from the sources.
{"title":"Microphone multiplexing with diffuse noise model-based principal component analysis","authors":"Sonia Badar, Nobutaka Ono, L. Daudet","doi":"10.1109/WASPAA.2013.6701877","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701877","url":null,"abstract":"Reducing the total data throughput for microphones arrays is often necessary, especially when using very large arrays. However, what information can be lost depends on the processing task at the decoder level. In this paper, we investigate simple ways of linearly down-mixing the microphone signals into a reduced number of channels, using non-adaptive coefficients derived from a diffuse noise model, based only on the geometry of the array. In source separation experiments, this multiplexing scheme provides no significant loss in quality even with a high reduction in the number of transmission channels, and outperforms a multiplexing scheme with random coefficients. It furthermore introduces some robustness with respect to the microphone gains and angle from the sources.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132571120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701811
Jason Wung, Ted S. Wada, M. Souden, B. Juang
It is well established that a decorrelation procedure is required in a multi-channel acoustic echo control system to mitigate the so-called non-uniqueness problem. A recently proposed technique that accomplishes decorrelation by resampling (DBR) has been shown to be advantageous; it achieves a superior performance in the echo reduction gain and offers the possibility of frequency selective decorrelation to further preserve the sound quality of the system. In this paper, we analyze with rigor the performance behavior of DBR in terms of coherence reduction and the resultant misalignment of an adaptive filter. We derive closed-form expressions for the performance bounds and validate the theoretical analysis with simulation.
{"title":"On the misalignment of stereophonic acoustic echo cancellation with decorrelation by resampling","authors":"Jason Wung, Ted S. Wada, M. Souden, B. Juang","doi":"10.1109/WASPAA.2013.6701811","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701811","url":null,"abstract":"It is well established that a decorrelation procedure is required in a multi-channel acoustic echo control system to mitigate the so-called non-uniqueness problem. A recently proposed technique that accomplishes decorrelation by resampling (DBR) has been shown to be advantageous; it achieves a superior performance in the echo reduction gain and offers the possibility of frequency selective decorrelation to further preserve the sound quality of the system. In this paper, we analyze with rigor the performance behavior of DBR in terms of coherence reduction and the resultant misalignment of an adaptive filter. We derive closed-form expressions for the performance bounds and validate the theoretical analysis with simulation.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131432297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-10-01DOI: 10.1109/WASPAA.2013.6701856
W. Etter
In voice acquisition, variations of the microphone distance introduce not only level changes, but also frequency response changes due to the near-field effect. This paper presents a method for adaptive distance and near-field compensation based on the talker-to-microphone distance and the microphone polar pattern. If available, the microphone orientation and the critical distance associated with the room acoustic can be taken into account to further improve compensation accuracy. Aimed at teleconference use, the significance of the critical distance for compensation is discussed for office and conference rooms. An example for the performance of the algorithm is provided, in which a sensor is applied to continuously measure a varying microphone distance.
{"title":"Adaptive distance and near-field compensation applied to microphones","authors":"W. Etter","doi":"10.1109/WASPAA.2013.6701856","DOIUrl":"https://doi.org/10.1109/WASPAA.2013.6701856","url":null,"abstract":"In voice acquisition, variations of the microphone distance introduce not only level changes, but also frequency response changes due to the near-field effect. This paper presents a method for adaptive distance and near-field compensation based on the talker-to-microphone distance and the microphone polar pattern. If available, the microphone orientation and the critical distance associated with the room acoustic can be taken into account to further improve compensation accuracy. Aimed at teleconference use, the significance of the critical distance for compensation is discussed for office and conference rooms. An example for the performance of the algorithm is provided, in which a sensor is applied to continuously measure a varying microphone distance.","PeriodicalId":341888,"journal":{"name":"2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics","volume":"16 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114015981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}