Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2244089
Petko N. Petkov, G. Henter, W. Kleijn
An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.
{"title":"Maximizing Phoneme Recognition Accuracy for Enhanced Speech Intelligibility in Noise","authors":"Petko N. Petkov, G. Henter, W. Kleijn","doi":"10.1109/TASL.2013.2244089","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244089","url":null,"abstract":"An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244089","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2239292
Emanuël Habets, J. Benesty
In general, the signal-to-noise ratio as well as the signal-to-reverberation ratio of speech received by a microphone decrease when the distance between the talker and microphone increases. Dereverberation and noise reduction algorithm are essential for many applications such as videoconferencing, hearing aids, and automatic speech recognition to improve the quality and intelligibility of the received desired speech that is corrupted by reverberation and noise. In the last decade, researchers have aimed at estimating the reverberant desired speech signal as received by one of the microphones. Although this approach has let to practical noise reduction algorithms, the spatial diversity of the received desired signal is not exploited to dereverberate the speech signal. In this paper, a two-stage beamforming approach is presented for dereverberation and noise reduction. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones and residual noise. In the second stage, the filtered microphone signals and the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. In this stage, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction and speech distortion. The presented performance evaluation demonstrates the effectiveness of the proposed two-stage approach.
{"title":"A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation","authors":"Emanuël Habets, J. Benesty","doi":"10.1109/TASL.2013.2239292","DOIUrl":"https://doi.org/10.1109/TASL.2013.2239292","url":null,"abstract":"In general, the signal-to-noise ratio as well as the signal-to-reverberation ratio of speech received by a microphone decrease when the distance between the talker and microphone increases. Dereverberation and noise reduction algorithm are essential for many applications such as videoconferencing, hearing aids, and automatic speech recognition to improve the quality and intelligibility of the received desired speech that is corrupted by reverberation and noise. In the last decade, researchers have aimed at estimating the reverberant desired speech signal as received by one of the microphones. Although this approach has let to practical noise reduction algorithms, the spatial diversity of the received desired signal is not exploited to dereverberate the speech signal. In this paper, a two-stage beamforming approach is presented for dereverberation and noise reduction. In the first stage, a signal-independent beamformer is used to generate a reference signal which contains a dereverberated version of the desired speech signal as received at the microphones and residual noise. In the second stage, the filtered microphone signals and the noisy reference signal are used to obtain an estimate of the dereverberated desired speech signal. In this stage, different signal-dependent beamformers can be used depending on the desired operating point in terms of noise reduction and speech distortion. The presented performance evaluation demonstrates the effectiveness of the proposed two-stage approach.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2239292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2244083
L. Deng, Xiao Li
Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem - for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.
{"title":"Machine Learning Paradigms for Speech Recognition: An Overview","authors":"L. Deng, Xiao Li","doi":"10.1109/TASL.2013.2244083","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244083","url":null,"abstract":"Automatic Speech Recognition (ASR) has historically been a driving force behind many machine learning (ML) techniques, including the ubiquitously used hidden Markov model, discriminative learning, structured sequence learning, Bayesian learning, and adaptive learning. Moreover, ML can and occasionally does use ASR as a large-scale, realistic application to rigorously test the effectiveness of a given technique, and to inspire new problems arising from the inherently sequential and dynamic nature of speech. On the other hand, even though ASR is available commercially for some applications, it is largely an unsolved problem - for almost all applications, the performance of ASR is not on par with human performance. New insight from modern ML methodology shows great promise to advance the state-of-the-art in ASR technology. This overview article provides readers with an overview of modern ML techniques as utilized in the current and as relevant to future ASR research and systems. The intent is to foster further cross-pollination between the ML and ASR communities than has occurred in the past. The article is organized according to the major ML paradigms that are either popular already or have potential for making significant contributions to ASR technology. The paradigms presented and elaborated in this overview include: generative and discriminative learning; supervised, unsupervised, semi-supervised, and active learning; adaptive and multi-task learning; and Bayesian learning. These learning paradigms are motivated and discussed in the context of ASR technology and applications. We finally present and analyze recent developments of deep learning and learning with sparse representations, focusing on their direct relevance to advancing ASR technology.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2244084
Xiao-Dan Zhu, Colin Cherry, Gerald Penn
This paper studies the problem of imposing an existing hierarchical semantic structure onto a corresponding spoken document in which the structures are embedded, with the goal of indexing such documents for easier access. We propose a graph-partitioning framework to solve a semantic tree-to-string alignment problem through optimizing a normalized-cut criterion. We present models with different modeling capabilities and time complexities in this framework and provide experimental evidence of their performance. We relate graph partitioning to conventional dynamic time warping (DTW) as it applies to this problem, and show that the proposed framework can naturally include topic segmentation to accommodate cohesion constraints.
{"title":"A Graph-Partitioning Framework for Aligning Hierarchical Topic Structures to Presentations","authors":"Xiao-Dan Zhu, Colin Cherry, Gerald Penn","doi":"10.1109/TASL.2013.2244084","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244084","url":null,"abstract":"This paper studies the problem of imposing an existing hierarchical semantic structure onto a corresponding spoken document in which the structures are embedded, with the goal of indexing such documents for easier access. We propose a graph-partitioning framework to solve a semantic tree-to-string alignment problem through optimizing a normalized-cut criterion. We present models with different modeling capabilities and time complexities in this framework and provide experimental evidence of their performance. We relate graph partitioning to conventional dynamic time warping (DTW) as it applies to this problem, and show that the proposed framework can naturally include topic segmentation to accommodate cohesion constraints.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244084","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2012.2236315
S. Chan, Y. Chu, Z. G. Zhang, K. Tsui
This paper proposes a new variable regularized QR decompPosition (QRD)-based recursive least M-estimate (VR-QRRLM) adaptive filter and studies its convergence performance and acoustic applications. Firstly, variable L2 regularization is introduced to an efficient QRD-based implementation of the conventional RLM algorithm to reduce its variance and improve the numerical stability. Difference equations describing the convergence behavior of this algorithm in Gaussian inputs and additive contaminated Gaussian noises are derived, from which new expressions for the steady-state excess mean square error (EMSE) are obtained. They suggest that regularization can help to reduce the variance, especially when the input covariance matrix is ill-conditioned due to lacking of excitation, with slightly increased bias. Moreover, the advantage of the M-estimation algorithm over its least squares counterpart is analytically quantified. For white Gaussian inputs, a new formula for selecting the regularization parameter is derived from the MSE analysis, which leads to the proposed VR-QRRLM algorithm. Its application to acoustic path identification and active noise control (ANC) problems is then studied where a new filtered-x (FX) VR-QRRLM ANC algorithm is derived. Moreover, the performance of this new ANC algorithm under impulsive noises and regularization can be characterized by the proposed theoretical analysis. Simulation results show that the VR-QRRLM-based algorithms considerably outperform the traditional algorithms when the input signal level is low or in the presence of impulsive noises and the theoretical predictions are in good agreement with simulation results.
{"title":"A New Variable Regularized QR Decomposition-Based Recursive Least M-Estimate Algorithm—Performance Analysis and Acoustic Applications","authors":"S. Chan, Y. Chu, Z. G. Zhang, K. Tsui","doi":"10.1109/TASL.2012.2236315","DOIUrl":"https://doi.org/10.1109/TASL.2012.2236315","url":null,"abstract":"This paper proposes a new variable regularized QR decompPosition (QRD)-based recursive least M-estimate (VR-QRRLM) adaptive filter and studies its convergence performance and acoustic applications. Firstly, variable L2 regularization is introduced to an efficient QRD-based implementation of the conventional RLM algorithm to reduce its variance and improve the numerical stability. Difference equations describing the convergence behavior of this algorithm in Gaussian inputs and additive contaminated Gaussian noises are derived, from which new expressions for the steady-state excess mean square error (EMSE) are obtained. They suggest that regularization can help to reduce the variance, especially when the input covariance matrix is ill-conditioned due to lacking of excitation, with slightly increased bias. Moreover, the advantage of the M-estimation algorithm over its least squares counterpart is analytically quantified. For white Gaussian inputs, a new formula for selecting the regularization parameter is derived from the MSE analysis, which leads to the proposed VR-QRRLM algorithm. Its application to acoustic path identification and active noise control (ANC) problems is then studied where a new filtered-x (FX) VR-QRRLM ANC algorithm is derived. Moreover, the performance of this new ANC algorithm under impulsive noises and regularization can be characterized by the proposed theoretical analysis. Simulation results show that the VR-QRRLM-based algorithms considerably outperform the traditional algorithms when the input signal level is low or in the presence of impulsive noises and the theoretical predictions are in good agreement with simulation results.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2236315","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62884887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2012.2234111
R. Serizel, M. Moonen, J. Wouters, S. H. Jensen
This paper presents a binaural approach to integrated active noise control and noise reduction in hearing aids and aims at demonstrating that a binaural setup indeed provides significant advantages in terms of the number of noise sources that can be compensated for and in terms of the causality margins.
{"title":"Binaural Integrated Active Noise Control and Noise Reduction in Hearing Aids","authors":"R. Serizel, M. Moonen, J. Wouters, S. H. Jensen","doi":"10.1109/TASL.2012.2234111","DOIUrl":"https://doi.org/10.1109/TASL.2012.2234111","url":null,"abstract":"This paper presents a binaural approach to integrated active noise control and noise reduction in hearing aids and aims at demonstrating that a binaural setup indeed provides significant advantages in terms of the number of noise sources that can be compensated for and in terms of the causality margins.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2234111","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2239989
M. Liuni, A. Röbel, E. Matusiak, M. Romito, X. Rodet
We present an algorithm for sound analysis and re-synthesis with local automatic adaptation of time-frequency resolution. The reconstruction formula we propose is highly efficient, and gives a good approximation of the original signal from analyses with different time-varying resolutions within complementary frequency bands: this is a typical case where perfect reconstruction cannot in general be achieved with fast algorithms, which provides an error to be minimized. We provide a theoretical upper bound for the reconstruction error of our method, and an example of automatic adaptive analysis and re-synthesis of a music sound.
{"title":"Automatic Adaptation of the Time-Frequency Resolution for Sound Analysis and Re-Synthesis","authors":"M. Liuni, A. Röbel, E. Matusiak, M. Romito, X. Rodet","doi":"10.1109/TASL.2013.2239989","DOIUrl":"https://doi.org/10.1109/TASL.2013.2239989","url":null,"abstract":"We present an algorithm for sound analysis and re-synthesis with local automatic adaptation of time-frequency resolution. The reconstruction formula we propose is highly efficient, and gives a good approximation of the original signal from analyses with different time-varying resolutions within complementary frequency bands: this is a typical case where perfect reconstruction cannot in general be achieved with fast algorithms, which provides an error to be minimized. We provide a theoretical upper bound for the reconstruction error of our method, and an example of automatic adaptive analysis and re-synthesis of a music sound.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2239989","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62885921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2239291
Sandro Cumani, P. Laface
This work aims at reducing the memory demand of the data structures that are usually pre-computed and stored for fast computation of the i-vectors, a compact representation of spoken utterances that is used by most state-of-the-art speaker recognition systems. We propose two new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices, and with the recently proposed fast eigen-decomposition technique. The first approach computes an i-vector in a Variational Bayes (VB) framework by iterating the estimation of one sub-block of i-vector elements at a time, keeping fixed all the others, and can obtain i-vectors as accurate as the ones obtained by the standard technique but requiring only 25% of its memory. The second technique is based on the Conjugate Gradient solution of a linear system, which is accurate and uses even less memory, but is slower than the VB approach. We analyze and compare the time and memory resources required by all these solutions, which are suited to different applications, and we show that it is possible to get accurate results greatly reducing memory demand compared with the standard solution at almost the same speed.
{"title":"Memory and Computation Trade-Offs for Efficient I-Vector Extraction","authors":"Sandro Cumani, P. Laface","doi":"10.1109/TASL.2013.2239291","DOIUrl":"https://doi.org/10.1109/TASL.2013.2239291","url":null,"abstract":"This work aims at reducing the memory demand of the data structures that are usually pre-computed and stored for fast computation of the i-vectors, a compact representation of spoken utterances that is used by most state-of-the-art speaker recognition systems. We propose two new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices, and with the recently proposed fast eigen-decomposition technique. The first approach computes an i-vector in a Variational Bayes (VB) framework by iterating the estimation of one sub-block of i-vector elements at a time, keeping fixed all the others, and can obtain i-vectors as accurate as the ones obtained by the standard technique but requiring only 25% of its memory. The second technique is based on the Conjugate Gradient solution of a linear system, which is accurate and uses even less memory, but is slower than the VB approach. We analyze and compare the time and memory resources required by all these solutions, which are suited to different applications, and we show that it is possible to get accurate results greatly reducing memory demand compared with the standard solution at almost the same speed.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2239291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62886072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-05-01DOI: 10.1109/TASL.2013.2244090
M. Niedźwiecki, M. Ciołek
In this application-oriented paper we consider the problem of elimination of impulsive disturbances, such as clicks, pops and record scratches, from archive audio recordings. The proposed approach is based on bidirectional processing-noise pulses are localized by combining the results of forward-time and backward-time signal analysis. Based on the results of specially designed empirical tests (rather than on the results of theoretical analysis), incorporating real audio files corrupted by real impulsive disturbances, we work out a set of local, case-dependent fusion rules that can be used to combine forward and backward detection alarms. This allows us to localize noise pulses more accurately and more reliably, yielding noticeable performance improvements, compared to the traditional methods, based on unidirectional processing. The proposed approach is carefully validated using both artificially corrupted audio files and real archive gramophone recordings.
{"title":"Elimination of Impulsive Disturbances From Archive Audio Signals Using Bidirectional Processing","authors":"M. Niedźwiecki, M. Ciołek","doi":"10.1109/TASL.2013.2244090","DOIUrl":"https://doi.org/10.1109/TASL.2013.2244090","url":null,"abstract":"In this application-oriented paper we consider the problem of elimination of impulsive disturbances, such as clicks, pops and record scratches, from archive audio recordings. The proposed approach is based on bidirectional processing-noise pulses are localized by combining the results of forward-time and backward-time signal analysis. Based on the results of specially designed empirical tests (rather than on the results of theoretical analysis), incorporating real audio files corrupted by real impulsive disturbances, we work out a set of local, case-dependent fusion rules that can be used to combine forward and backward detection alarms. This allows us to localize noise pulses more accurately and more reliably, yielding noticeable performance improvements, compared to the traditional methods, based on unidirectional processing. The proposed approach is carefully validated using both artificially corrupted audio files and real archive gramophone recordings.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2244090","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-04-01DOI: 10.1109/TASL.2012.2231072
Cemil Demir, M. Saraçlar, A. Cemgil
In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.
{"title":"Single-Channel Speech-Music Separation for Robust ASR With Mixture Models","authors":"Cemil Demir, M. Saraçlar, A. Cemgil","doi":"10.1109/TASL.2012.2231072","DOIUrl":"https://doi.org/10.1109/TASL.2012.2231072","url":null,"abstract":"In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2231072","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62884384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}