Pub Date : 2013-07-01DOI: 10.1109/TASL.2013.2248719
Jingdong Chen, J. Benesty
This paper deals with the problem of noise reduction in stereo sound systems where the objective is not only to reduce noise, but also to preserve the spatial information of both the desired speech and noise sources so that the listener can still localize the speech and noise sources by listening to the enhanced binaural outputs. To achieve this objective, we use the widely linear (WL) framework developed previously and convert the problem of binaural noise reduction into one of monaural filtering with complex signals. We then present a way to decompose both the complex speech and noise signal vectors into two orthogonal components: one correlated and the other uncorrelated with the corresponding current signal sample. With this decomposition, the problem of noise reduction with preservation of the spatial information of speech and noise sources is formulated as an optimization problem with two constraints: one on the desired speech and the other on the preservation of the noise signal. We then derive a WL linearly constrained minimum variance (LCMV) filter, which can take advantage of the statistics and noncircularity of the complex speech signal to achieve noise reduction. In contrast to the WL Wiener and minimum variance distortionless response (MVDR) filters developed previously that can only preserve the characteristics and spatial information of the desired sound source, this new WL LCMV filter has the potential to reduce noise while preserving the characteristics and spatial information of both the desired and noise sources at the same time. Experimental results are provided to justify the claimed merits of the proposed WL LCMV filter.
{"title":"On the Time-Domain Widely Linear LCMV Filter for Noise Reduction With a Stereo System","authors":"Jingdong Chen, J. Benesty","doi":"10.1109/TASL.2013.2248719","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248719","url":null,"abstract":"This paper deals with the problem of noise reduction in stereo sound systems where the objective is not only to reduce noise, but also to preserve the spatial information of both the desired speech and noise sources so that the listener can still localize the speech and noise sources by listening to the enhanced binaural outputs. To achieve this objective, we use the widely linear (WL) framework developed previously and convert the problem of binaural noise reduction into one of monaural filtering with complex signals. We then present a way to decompose both the complex speech and noise signal vectors into two orthogonal components: one correlated and the other uncorrelated with the corresponding current signal sample. With this decomposition, the problem of noise reduction with preservation of the spatial information of speech and noise sources is formulated as an optimization problem with two constraints: one on the desired speech and the other on the preservation of the noise signal. We then derive a WL linearly constrained minimum variance (LCMV) filter, which can take advantage of the statistics and noncircularity of the complex speech signal to achieve noise reduction. In contrast to the WL Wiener and minimum variance distortionless response (MVDR) filters developed previously that can only preserve the characteristics and spatial information of the desired sound source, this new WL LCMV filter has the potential to reduce noise while preserving the characteristics and spatial information of both the desired and noise sources at the same time. Experimental results are provided to justify the claimed merits of the proposed WL LCMV filter.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248719","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.1109/TASL.2013.2253098
Xiaoyan Cai, Wenjie Li
Multi-document summarization aims to create a condensed summary while retaining the main characteristics of the original set of documents. Under such background, sentence ranking has hitherto been the issue of most concern. Since documents often cover a number of topic themes with each theme represented by a cluster of highly related sentences, sentence clustering has been explored in the literature in order to provide more informative summaries. For each topic theme, the rank of terms conditional on this topic theme should be very distinct, and quite different from the rank of terms in other topic themes. Existing cluster-based summarization approaches apply clustering and ranking in isolation, which leads to incomplete, or sometimes rather biased, analytical results. A newly emerged framework uses sentence clustering results to improve or refine the sentence ranking results. Under this framework, we propose a novel approach that directly generates clusters integrated with ranking in this paper. The basic idea of the approach is that ranking distribution of sentences in each cluster should be quite different from each other, which may serve as features of clusters and new clustering measures of sentences can be calculated accordingly. Meanwhile, better clustering results can achieve better ranking results. As a result, ranking and clustering by mutually and simultaneously updating each other so that the performance of both can be improved. The effectiveness of the proposed approach is demonstrated by both the cluster quality analysis and the summarization evaluation conducted on the DUC 2004-2007 datasets.
{"title":"Ranking Through Clustering: An Integrated Approach to Multi-Document Summarization","authors":"Xiaoyan Cai, Wenjie Li","doi":"10.1109/TASL.2013.2253098","DOIUrl":"https://doi.org/10.1109/TASL.2013.2253098","url":null,"abstract":"Multi-document summarization aims to create a condensed summary while retaining the main characteristics of the original set of documents. Under such background, sentence ranking has hitherto been the issue of most concern. Since documents often cover a number of topic themes with each theme represented by a cluster of highly related sentences, sentence clustering has been explored in the literature in order to provide more informative summaries. For each topic theme, the rank of terms conditional on this topic theme should be very distinct, and quite different from the rank of terms in other topic themes. Existing cluster-based summarization approaches apply clustering and ranking in isolation, which leads to incomplete, or sometimes rather biased, analytical results. A newly emerged framework uses sentence clustering results to improve or refine the sentence ranking results. Under this framework, we propose a novel approach that directly generates clusters integrated with ranking in this paper. The basic idea of the approach is that ranking distribution of sentences in each cluster should be quite different from each other, which may serve as features of clusters and new clustering measures of sentences can be calculated accordingly. Meanwhile, better clustering results can achieve better ranking results. As a result, ranking and clustering by mutually and simultaneously updating each other so that the performance of both can be improved. The effectiveness of the proposed approach is demonstrated by both the cluster quality analysis and the summarization evaluation conducted on the DUC 2004-2007 datasets.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2253098","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-07-01DOI: 10.1109/TASL.2013.2255280
S. M. Golan, S. Gannot, I. Cohen
Beamforming with wireless acoustic sensor networks (WASNs) has recently drawn the attention of the research community. As the number of microphones grows it is difficult, and in some applications impossible, to determine their layout beforehand. A common practice in analyzing the expected performance is to utilize statistical considerations. In the current contribution, we consider applying the speech distortion weighted multi-channel Wiener filter (SDW-MWF) to enhance a desired source propagating in a reverberant enclosure where the microphones are randomly located with a uniform distribution. Two noise fields are considered, namely, multiple coherent interference signals and a diffuse sound field. Utilizing the statistics of the acoustic transfer function (ATF), we derive a statistical model for two important criteria of the beamformer (BF): the signal to interference ratio (SIR), and the white noise gain. Moreover, we propose reliability functions, which determine the probability of the SIR and white noise gain to exceed a predefined level. We verify the proposed model with an extensive simulative study.
{"title":"Performance of the SDW-MWF With Randomly Located Microphones in a Reverberant Enclosure","authors":"S. M. Golan, S. Gannot, I. Cohen","doi":"10.1109/TASL.2013.2255280","DOIUrl":"https://doi.org/10.1109/TASL.2013.2255280","url":null,"abstract":"Beamforming with wireless acoustic sensor networks (WASNs) has recently drawn the attention of the research community. As the number of microphones grows it is difficult, and in some applications impossible, to determine their layout beforehand. A common practice in analyzing the expected performance is to utilize statistical considerations. In the current contribution, we consider applying the speech distortion weighted multi-channel Wiener filter (SDW-MWF) to enhance a desired source propagating in a reverberant enclosure where the microphones are randomly located with a uniform distribution. Two noise fields are considered, namely, multiple coherent interference signals and a diffuse sound field. Utilizing the statistics of the acoustic transfer function (ATF), we derive a statistical model for two important criteria of the beamformer (BF): the signal to interference ratio (SIR), and the white noise gain. Moreover, we propose reliability functions, which determine the probability of the SIR and white noise gain to exceed a predefined level. We verify the proposed model with an extensive simulative study.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2255280","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2245648
D. Ying, Yonghong Yan
The temporal correlation of speech presence/absence is widely used in noise estimation. The most popular technique for exploiting temporal correlation is the smoothing of noisy spectra using a time-recursive filter, in which the forgetting factor is controlled by speech presence probability. However, this technique is not unified into a theoretical framework that enables optimal noise estimation. In theory, hidden Markov models (HMMs) are superior to this technique in modeling temporal correlation. HMMs can model a time sequence of presence/absence of speech signal as a dynamic process of the transition between speech and non-speech states. Moreover, a number of methods, such as maximum likelihood, are available for optimal estimation of HMM parameters. This paper presents a constrained sequential HMM for modeling the log-power sequence on each frequency band. The emission probability of each HMM state is represented by a Gaussian model. The Gaussian mean of the non-speech state is considered as the optimal estimate of noise logarithmic power. The HMM parameter set is sequentially estimated from one frame to another on the basis of maximum likelihood. The proposed method is compared with well-established algorithms through various experiments. Our method delivers more accurate results and does not rely on the assumption of the “non-speech signal onset” as do most algorithms.
{"title":"Noise Estimation Using a Constrained Sequential Hidden Markov Model in the Log-Spectral Domain","authors":"D. Ying, Yonghong Yan","doi":"10.1109/TASL.2013.2245648","DOIUrl":"https://doi.org/10.1109/TASL.2013.2245648","url":null,"abstract":"The temporal correlation of speech presence/absence is widely used in noise estimation. The most popular technique for exploiting temporal correlation is the smoothing of noisy spectra using a time-recursive filter, in which the forgetting factor is controlled by speech presence probability. However, this technique is not unified into a theoretical framework that enables optimal noise estimation. In theory, hidden Markov models (HMMs) are superior to this technique in modeling temporal correlation. HMMs can model a time sequence of presence/absence of speech signal as a dynamic process of the transition between speech and non-speech states. Moreover, a number of methods, such as maximum likelihood, are available for optimal estimation of HMM parameters. This paper presents a constrained sequential HMM for modeling the log-power sequence on each frequency band. The emission probability of each HMM state is represented by a Gaussian model. The Gaussian mean of the non-speech state is considered as the optimal estimate of noise logarithmic power. The HMM parameter set is sequentially estimated from one frame to another on the basis of maximum likelihood. The proposed method is compared with well-established algorithms through various experiments. Our method delivers more accurate results and does not rely on the assumption of the “non-speech signal onset” as do most algorithms.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2245648","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2245652
I. Bayram, M. Kamasak
We propose a simple prior for restoration problems involving oscillatory signals. The prior makes use of an underlying analytic frame decomposition with narrow subbands. Other than this, the prior does not have any other parameters, which makes it simple to use and apply. We demonstrate the utility of the proposed prior through some real audio restoration experiments.
{"title":"A Simple Prior for Audio Signals","authors":"I. Bayram, M. Kamasak","doi":"10.1109/TASL.2013.2245652","DOIUrl":"https://doi.org/10.1109/TASL.2013.2245652","url":null,"abstract":"We propose a simple prior for restoration problems involving oscillatory signals. The prior makes use of an underlying analytic frame decomposition with narrow subbands. Other than this, the prior does not have any other parameters, which makes it simple to use and apply. We demonstrate the utility of the proposed prior through some real audio restoration experiments.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2245652","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2245651
Bin Zhang, Alex Marin, Brian Hutchinson, Mari Ostendorf
This paper introduces methods to discriminatively learn phrase patterns for use as features in text classification. An efficient solution is described using a recursive algorithm with a mutual information selection criterion. The algorithm automatically determines when word classes are useful in specific locations of a phrase pattern, allowing for variable specificity depending on the amount of labeled data available. Experiments are carried out on three text classification tasks in both English and Chinese, resulting in improved performance when adding the phrase patterns to the existing n-gram features.
{"title":"Learning Phrase Patterns for Text Classification","authors":"Bin Zhang, Alex Marin, Brian Hutchinson, Mari Ostendorf","doi":"10.1109/TASL.2013.2245651","DOIUrl":"https://doi.org/10.1109/TASL.2013.2245651","url":null,"abstract":"This paper introduces methods to discriminatively learn phrase patterns for use as features in text classification. An efficient solution is described using a recursive algorithm with a mutual information selection criterion. The algorithm automatically determines when word classes are useful in specific locations of a phrase pattern, allowing for variable specificity depending on the amount of labeled data available. Experiments are carried out on three text classification tasks in both English and Chinese, resulting in improved performance when adding the phrase patterns to the existing n-gram features.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2245651","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62887839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2248716
B. Lecouteux, G. Linarès, Y. Estève, G. Gravier
Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both and beam-search-based decoder yields similar performances.
{"title":"Dynamic Combination of Automatic Speech Recognition Systems by Driven Decoding","authors":"B. Lecouteux, G. Linarès, Y. Estève, G. Gravier","doi":"10.1109/TASL.2013.2248716","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248716","url":null,"abstract":"Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both and beam-search-based decoder yields similar performances.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248716","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2248723
David Rybach, H. Ney, R. Schlüter
Dynamic network decoders have the advantage of significantly lower memory consumption compared to static network decoders, especially when huge vocabularies and complex language models are required. This paper compares the properties of two well-known search strategies for dynamic network decoding, namely history conditioned lexical tree search and weighted finite-state transducer-based search using on-the-fly transducer composition. The two search strategies share many common principles like the use of dynamic programming, beam search, and many more. We point out the similarities of both approaches and investigate the implications of their differing features, both formally and experimentally, with a focus on implementation independent properties. Therefore, experimental results are obtained with a single decoder by representing the history conditioned lexical tree search strategy in the transducer framework. The properties analyzed cover structure and size of the search space, differences in hypotheses recombination, language model look-ahead techniques, and lattice generation.
{"title":"Lexical Prefix Tree and WFST: A Comparison of Two Dynamic Search Concepts for LVCSR","authors":"David Rybach, H. Ney, R. Schlüter","doi":"10.1109/TASL.2013.2248723","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248723","url":null,"abstract":"Dynamic network decoders have the advantage of significantly lower memory consumption compared to static network decoders, especially when huge vocabularies and complex language models are required. This paper compares the properties of two well-known search strategies for dynamic network decoding, namely history conditioned lexical tree search and weighted finite-state transducer-based search using on-the-fly transducer composition. The two search strategies share many common principles like the use of dynamic programming, beam search, and many more. We point out the similarities of both approaches and investigate the implications of their differing features, both formally and experimentally, with a focus on implementation independent properties. Therefore, experimental results are obtained with a single decoder by representing the history conditioned lexical tree search strategy in the transducer framework. The properties analyzed cover structure and size of the search space, differences in hypotheses recombination, language model look-ahead techniques, and lattice generation.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248723","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2248713
Ki-Seung Lee
The present study tested a new stereo playback system that effectively cancels cross-talk signals at an arbitrary listening position. Such a playback system was implemented by integrating listener position tracking techniques and crosstalk cancellation techniques. The entire listening space was partitioned into a number of non-overlapped cells and a crosstalk cancellation filter was assigned to each cell. The listening space partitions and the corresponding crosstalk cancellation filters were constructed by maximizing the average channel separation ratio (CSR). Since the proposed method employed cell-based crosstalk cancellation, estimation of the exact position of the listener was not necessary. Instead, it was only necessary to determine the cell in which the listener was located. This was achieved by simply employing an artificial neural network (ANN) where the time delay to each pair of microphones was used as the ANN input and the ANN output corresponded to the index of cells. The experimental results showed that more than 95% of the experimental listening space had a CSR ≥ 10 dB when the number of clusters exceeded 12. Under these conditions, the correlation between the true directions of the virtual sound sources and the directions recognized by the subjects was greater than 0.9.
{"title":"Position-Dependent Crosstalk Cancellation Using Space Partitioning","authors":"Ki-Seung Lee","doi":"10.1109/TASL.2013.2248713","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248713","url":null,"abstract":"The present study tested a new stereo playback system that effectively cancels cross-talk signals at an arbitrary listening position. Such a playback system was implemented by integrating listener position tracking techniques and crosstalk cancellation techniques. The entire listening space was partitioned into a number of non-overlapped cells and a crosstalk cancellation filter was assigned to each cell. The listening space partitions and the corresponding crosstalk cancellation filters were constructed by maximizing the average channel separation ratio (CSR). Since the proposed method employed cell-based crosstalk cancellation, estimation of the exact position of the listener was not necessary. Instead, it was only necessary to determine the cell in which the listener was located. This was achieved by simply employing an artificial neural network (ANN) where the time delay to each pair of microphones was used as the ANN input and the ANN output corresponded to the index of cells. The experimental results showed that more than 95% of the experimental listening space had a CSR ≥ 10 dB when the number of clusters exceeded 12. Under these conditions, the correlation between the true directions of the virtual sound sources and the directions recognized by the subjects was greater than 0.9.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248713","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-06-01DOI: 10.1109/TASL.2013.2248715
Yusuke Hioka, K. Furuya, Kazunori Kobayashi, K. Niwa, Y. Haneda
A method for separating underdetermined sound sources based on a novel power spectral density (PSD) estimation is proposed. The method enables up to M(M-1)+1 sources to be separated when we use a microphone array of M sensors and a Wiener post-filter calculated by the estimated PSDs. The PSD of a beamformer's output is modelled by a mixture of source PSDs multiplied by the beamformer's directivity gain in the particular angle where each source is located. Based on this model, the PSD of each sound source is estimated from the PSD of multiple fixed beamformers' outputs using the difference in the combination of directivity gains. Simulation results proved that the proposed method effectively separated up to M(M-1)+1 sound sources if the fixed beamformers were appropriately selected. Experiments were also conducted in a reverberant chamber to ensure the proposed method was also effective in practical use.
{"title":"Underdetermined Sound Source Separation Using Power Spectrum Density Estimated by Combination of Directivity Gain","authors":"Yusuke Hioka, K. Furuya, Kazunori Kobayashi, K. Niwa, Y. Haneda","doi":"10.1109/TASL.2013.2248715","DOIUrl":"https://doi.org/10.1109/TASL.2013.2248715","url":null,"abstract":"A method for separating underdetermined sound sources based on a novel power spectral density (PSD) estimation is proposed. The method enables up to M(M-1)+1 sources to be separated when we use a microphone array of M sensors and a Wiener post-filter calculated by the estimated PSDs. The PSD of a beamformer's output is modelled by a mixture of source PSDs multiplied by the beamformer's directivity gain in the particular angle where each source is located. Based on this model, the PSD of each sound source is estimated from the PSD of multiple fixed beamformers' outputs using the difference in the combination of directivity gains. Simulation results proved that the proposed method effectively separated up to M(M-1)+1 sound sources if the fixed beamformers were appropriately selected. Experiments were also conducted in a reverberant chamber to ensure the proposed method was also effective in practical use.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2248715","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62888411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}