Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413981
Yuan Chang, Yisong Chen, Guoping Wang
In this paper, we present a framework of view synthesis, including range guided depth refinement and uncertainty-aware aggregation based novel view synthesis. We first propose a novel depth refinement method to improve the quality and robustness of the depth map reconstruction. To that end, we use a range prior to constrain the estimated depth, which helps us to get more accurate depth information. Then we propose an uncertainty-aware aggregation method for novel view synthesis. We compute the uncertainty of the estimated depth for each pixel, and reduce the influence of pixels whose uncertainty are large when synthesizing novel views. This step helps to reduce some artifacts such as ghost and blur. We validate the performance of our algorithm experimentally, and we show that our approach achieves state-of-the-art performance.
{"title":"Range Guided Depth Refinement and Uncertainty-Aware Aggregation for View Synthesis","authors":"Yuan Chang, Yisong Chen, Guoping Wang","doi":"10.1109/ICASSP39728.2021.9413981","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413981","url":null,"abstract":"In this paper, we present a framework of view synthesis, including range guided depth refinement and uncertainty-aware aggregation based novel view synthesis. We first propose a novel depth refinement method to improve the quality and robustness of the depth map reconstruction. To that end, we use a range prior to constrain the estimated depth, which helps us to get more accurate depth information. Then we propose an uncertainty-aware aggregation method for novel view synthesis. We compute the uncertainty of the estimated depth for each pixel, and reduce the influence of pixels whose uncertainty are large when synthesizing novel views. This step helps to reduce some artifacts such as ghost and blur. We validate the performance of our algorithm experimentally, and we show that our approach achieves state-of-the-art performance.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123999494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414792
Jiaxiang Tang, Xiang Gao, Wei Hu
Graph Convolutional Neural Networks (GCNNs) extend CNNs to irregular graph data domain, such as brain networks, citation networks and 3D point clouds. It is critical to identify an appropriate graph for basic operations in GCNNs. Existing methods often manually construct or learn one fixed graph based on known connectivities, which may be sub-optimal. To this end, we propose a residual graph learning paradigm to infer edge connectivities and weights in graphs, which is cast as distance metric learning under a low-rank assumption and a similarity-preserving regularization. In particular, we learn the underlying graph based on similarity-preserving mapping on graphs, which keeps similar nodes close and pushes dissimilar nodes away. Extensive experiments on semi-supervised learning of citation networks and 3D point clouds show that we achieve the state-of-the-art performance in terms of both accuracy and robustness.
{"title":"RGLN: Robust Residual Graph Learning Networks via Similarity-Preserving Mapping on Graphs","authors":"Jiaxiang Tang, Xiang Gao, Wei Hu","doi":"10.1109/ICASSP39728.2021.9414792","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414792","url":null,"abstract":"Graph Convolutional Neural Networks (GCNNs) extend CNNs to irregular graph data domain, such as brain networks, citation networks and 3D point clouds. It is critical to identify an appropriate graph for basic operations in GCNNs. Existing methods often manually construct or learn one fixed graph based on known connectivities, which may be sub-optimal. To this end, we propose a residual graph learning paradigm to infer edge connectivities and weights in graphs, which is cast as distance metric learning under a low-rank assumption and a similarity-preserving regularization. In particular, we learn the underlying graph based on similarity-preserving mapping on graphs, which keeps similar nodes close and pushes dissimilar nodes away. Extensive experiments on semi-supervised learning of citation networks and 3D point clouds show that we achieve the state-of-the-art performance in terms of both accuracy and robustness.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124009029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414420
Ming Feng, Yin Wang, Kele Xu, Huaimin Wang, Bo Ding
B-mode ultrasound tongue imaging is widely used to visualize the tongue motion, due to its appearing properties. Extracting the tongue surface contour in the B-mode ultrasound image is still a challenge, while it is a prerequisite for further quantitative analysis. Recently, deep learning-based approach has been adopted in this task. However, the standard deep models fail to address faint contour when the ultrasound wave goes parallel to the tongue surface. To address the faint or missing contours in the sequence, we explore the shape consistency-based regularizer, which can take sequential information into account. By incorporating the regularizer, the deep model not only can extract frame-specific contours, but also can enforce the similarity between the contours extracted from adjacent frames. Extensive experiments are conducted both on the synthetic and real ultrasound tongue imaging dataset and the results demonstrate the effectiveness of proposed method. To better promote the research in this field, we have released our code at1.
{"title":"Improving Ultrasound Tongue Contour Extraction Using U-Net and Shape Consistency-Based Regularizer","authors":"Ming Feng, Yin Wang, Kele Xu, Huaimin Wang, Bo Ding","doi":"10.1109/ICASSP39728.2021.9414420","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414420","url":null,"abstract":"B-mode ultrasound tongue imaging is widely used to visualize the tongue motion, due to its appearing properties. Extracting the tongue surface contour in the B-mode ultrasound image is still a challenge, while it is a prerequisite for further quantitative analysis. Recently, deep learning-based approach has been adopted in this task. However, the standard deep models fail to address faint contour when the ultrasound wave goes parallel to the tongue surface. To address the faint or missing contours in the sequence, we explore the shape consistency-based regularizer, which can take sequential information into account. By incorporating the regularizer, the deep model not only can extract frame-specific contours, but also can enforce the similarity between the contours extracted from adjacent frames. Extensive experiments are conducted both on the synthetic and real ultrasound tongue imaging dataset and the results demonstrate the effectiveness of proposed method. To better promote the research in this field, we have released our code at1.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124056902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414442
Yuying Li, Yuchen Liu, D. Williamson
Reverberation time, T60, directly influences the amount of reverberation in a signal, and its direct estimation may help with dereverberation. Traditionally, T60 estimation has been done using signal processing or probabilistic approaches, until recently where deep-learning approaches have been developed. Unfortunately, the appropriate loss function for training the network has not been adequately determined. In this paper, we propose a composite classification- and regression-based cost function for training a deep neural network that predicts T60 for a variety of reverberant signals. We investigate pure-classification, pure-regression, and combined classification-regression based loss functions, where we additionally incorporate computational measures of success. Our results reveal that our composite loss function leads to the best performance as compared to other loss functions and comparison approaches. We also show that this combined loss function helps with generalization.
{"title":"On Loss Functions for Deep-Learning Based T60 Estimation","authors":"Yuying Li, Yuchen Liu, D. Williamson","doi":"10.1109/ICASSP39728.2021.9414442","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414442","url":null,"abstract":"Reverberation time, T60, directly influences the amount of reverberation in a signal, and its direct estimation may help with dereverberation. Traditionally, T60 estimation has been done using signal processing or probabilistic approaches, until recently where deep-learning approaches have been developed. Unfortunately, the appropriate loss function for training the network has not been adequately determined. In this paper, we propose a composite classification- and regression-based cost function for training a deep neural network that predicts T60 for a variety of reverberant signals. We investigate pure-classification, pure-regression, and combined classification-regression based loss functions, where we additionally incorporate computational measures of success. Our results reveal that our composite loss function leads to the best performance as compared to other loss functions and comparison approaches. We also show that this combined loss function helps with generalization.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124125739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413420
Yize Jin, Liang Zhao, Xin Zhao, Shangyi Liu, A. Bovik
In AOMedia Video 1 (AV1), directional intra prediction modes are applied to model local texture patterns that present certain directionality. Each intra prediction direction is represented with a nominal mode index and a delta angle. The delta angle is entropy coded using shared context between luma and chroma, and the context is derived using the associated nominal mode. In this paper, two methods are proposed to further reduce the signaling cost of delta angles: cross-component delta angle coding, and context-adaptive delta angle coding, whereby the cross-component and spatial correlation of the delta angles are explored, respectively. The proposed methods were implemented on top of a recent version of libaom. Experimental results show that the proposed cross-component delta angle coding achieved average 0.4% BD-rate reduction with 4% encoding time saving over all intra configurations. By combining both methods, an average 1.2% BD-rate reduction is achieved.
{"title":"Improved Intra Mode Coding Beyond Av1","authors":"Yize Jin, Liang Zhao, Xin Zhao, Shangyi Liu, A. Bovik","doi":"10.1109/ICASSP39728.2021.9413420","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413420","url":null,"abstract":"In AOMedia Video 1 (AV1), directional intra prediction modes are applied to model local texture patterns that present certain directionality. Each intra prediction direction is represented with a nominal mode index and a delta angle. The delta angle is entropy coded using shared context between luma and chroma, and the context is derived using the associated nominal mode. In this paper, two methods are proposed to further reduce the signaling cost of delta angles: cross-component delta angle coding, and context-adaptive delta angle coding, whereby the cross-component and spatial correlation of the delta angles are explored, respectively. The proposed methods were implemented on top of a recent version of libaom. Experimental results show that the proposed cross-component delta angle coding achieved average 0.4% BD-rate reduction with 4% encoding time saving over all intra configurations. By combining both methods, an average 1.2% BD-rate reduction is achieved.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126482629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413753
Yawei Kong, Lu Zhang, Can Ma, Cong Cao
In the multi-turn dialogue system, response generation is not only related to the sentences in context but also relies on the words in each utterance. Although there are lots of methods that pay attention to model words and utterances, there still exist problems such as tending to generate common responses. In this paper, we propose a hierarchical self-attention network, named HSAN, which attends to the important words and utterances in context simultaneously. Firstly, we use the hierarchical encoder to update the word and utterance representations with their position information respectively. Secondly, the response representations are updated by the mask self-attention module in the decoder. Finally, the relevance between utterances and response is computed by another self-attention module and used for the next response decoding process. In terms of automatic metrics and human judgements, experimental results show that HSAN significantly outperforms all baselines on two common public datasets.
{"title":"HSAN: A Hierarchical Self-Attention Network for Multi-Turn Dialogue Generation","authors":"Yawei Kong, Lu Zhang, Can Ma, Cong Cao","doi":"10.1109/ICASSP39728.2021.9413753","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413753","url":null,"abstract":"In the multi-turn dialogue system, response generation is not only related to the sentences in context but also relies on the words in each utterance. Although there are lots of methods that pay attention to model words and utterances, there still exist problems such as tending to generate common responses. In this paper, we propose a hierarchical self-attention network, named HSAN, which attends to the important words and utterances in context simultaneously. Firstly, we use the hierarchical encoder to update the word and utterance representations with their position information respectively. Secondly, the response representations are updated by the mask self-attention module in the decoder. Finally, the relevance between utterances and response is computed by another self-attention module and used for the next response decoding process. In terms of automatic metrics and human judgements, experimental results show that HSAN significantly outperforms all baselines on two common public datasets.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125657885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414083
Lulu Ge, K. Parhi
Hyperdimensional (HD) computing holds promise for classifying two groups of data. This paper explores seizure detection from electroencephalogram (EEG) from subjects with epilepsy using HD computing based on power spectral density (PSD) features. Publicly available intra-cranial EEG (iEEG) data collected from 4 dogs and 8 human patients in the Kaggle seizure detection contest are used in this paper. This paper explores two methods for classification. First, few ranked PSD features from small number of channels from a prior classification are used in the context of HD classification. Second, all PSD features extracted from all channels are used as features for HD classification. It is shown that for about half the subjects small number features outperform all features in the context of HD classification, and for the other half, all features outperform small number of features. HD classification achieves above 95% accuracy for six of the 12 subjects, and between 85-95% accuracy for 4 subjects. For two subjects, the classification accuracy using HD computing is not as good as classical approaches such as support vector machine classifiers.
{"title":"Seizure Detection Using Power Spectral Density via Hyperdimensional Computing","authors":"Lulu Ge, K. Parhi","doi":"10.1109/ICASSP39728.2021.9414083","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414083","url":null,"abstract":"Hyperdimensional (HD) computing holds promise for classifying two groups of data. This paper explores seizure detection from electroencephalogram (EEG) from subjects with epilepsy using HD computing based on power spectral density (PSD) features. Publicly available intra-cranial EEG (iEEG) data collected from 4 dogs and 8 human patients in the Kaggle seizure detection contest are used in this paper. This paper explores two methods for classification. First, few ranked PSD features from small number of channels from a prior classification are used in the context of HD classification. Second, all PSD features extracted from all channels are used as features for HD classification. It is shown that for about half the subjects small number features outperform all features in the context of HD classification, and for the other half, all features outperform small number of features. HD classification achieves above 95% accuracy for six of the 12 subjects, and between 85-95% accuracy for 4 subjects. For two subjects, the classification accuracy using HD computing is not as good as classical approaches such as support vector machine classifiers.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"46 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125704055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413813
M. Bekrani, Anh H. T. Nguyen, Andy W. H. Khong
Virtual microphone beamforming techniques are attractive for devices limited by space constraints. These techniques synthesize virtual microphone signals via interpolation algorithms. We propose to extend existing virtual microphone signal interpolation by employing an adaptive non-linear (ANL) process for acoustic beamforming. The proposed ANL based interpolation utilizes a target-presence probability criteria to determine the degree of non-linearity. The beamformer output is then derived using a combination between interpolations during target inactive zones and target active zones. Such combination offers a trade-off between reducing interference and target signal distortion. We apply the proposed ANL-based interpolator to the maximum signal-to-noise ratio (MSNR) beamformer and compare its performance against conventional beamforming and virtual microphone based beamforming methods in under-determined situations.
{"title":"An Adaptive Non-Linear Process for Under-Determined Virtual Microphone Beamforming","authors":"M. Bekrani, Anh H. T. Nguyen, Andy W. H. Khong","doi":"10.1109/ICASSP39728.2021.9413813","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413813","url":null,"abstract":"Virtual microphone beamforming techniques are attractive for devices limited by space constraints. These techniques synthesize virtual microphone signals via interpolation algorithms. We propose to extend existing virtual microphone signal interpolation by employing an adaptive non-linear (ANL) process for acoustic beamforming. The proposed ANL based interpolation utilizes a target-presence probability criteria to determine the degree of non-linearity. The beamformer output is then derived using a combination between interpolations during target inactive zones and target active zones. Such combination offers a trade-off between reducing interference and target signal distortion. We apply the proposed ANL-based interpolator to the maximum signal-to-noise ratio (MSNR) beamformer and compare its performance against conventional beamforming and virtual microphone based beamforming methods in under-determined situations.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126011349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9414971
Xia Dong, Danyang Wu, F. Nie, Rong Wang, Xuelong Li
In this paper, we propose a novel approach called dependence-guided multi-view clustering (DGMC). Our model enhances the dependence between unified embedding learning and clustering, as well as promotes the dependence between unified embedding and embedding of each view. Specifically, DGMC learns a unified embedding and partitions data in a joint fashion, thus the clustering results can be directly obtained. A kernel dependence measure is employed to learn a unified embedding by forcing it to be close to different views, thus the complex dependence among different views can be captured. Moreover, an implicit-weight learning mechanism is provided to ensure the diversity of different views. An efficient algorithm with rigorous convergence analysis is derived to solve the proposed model. Experimental results demonstrate the advantages of the proposed method over the state of the arts on real-world datasets.
{"title":"Dependence-Guided Multi-View Clustering","authors":"Xia Dong, Danyang Wu, F. Nie, Rong Wang, Xuelong Li","doi":"10.1109/ICASSP39728.2021.9414971","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414971","url":null,"abstract":"In this paper, we propose a novel approach called dependence-guided multi-view clustering (DGMC). Our model enhances the dependence between unified embedding learning and clustering, as well as promotes the dependence between unified embedding and embedding of each view. Specifically, DGMC learns a unified embedding and partitions data in a joint fashion, thus the clustering results can be directly obtained. A kernel dependence measure is employed to learn a unified embedding by forcing it to be close to different views, thus the complex dependence among different views can be captured. Moreover, an implicit-weight learning mechanism is provided to ensure the diversity of different views. An efficient algorithm with rigorous convergence analysis is derived to solve the proposed model. Experimental results demonstrate the advantages of the proposed method over the state of the arts on real-world datasets.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121922384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-06DOI: 10.1109/ICASSP39728.2021.9413494
Soichiro Oyabu, Daichi Kitamura, K. Yatabe
Determined blind source separation (BSS) extracts the source signals by linear multichannel filtering. Its performance depends on the accuracy of source modeling, and hence existing BSS methods have proposed several source models. Recently, a new determined BSS algorithm that incorporates a time-frequency mask has been proposed. It enables very flexible source modeling because the model is implicitly defined by a mask-generating function. Building up on this framework, in this paper, we propose a unification of determined BSS and harmonic/percussive sound separation (HPSS). HPSS is an important preprocessing for musical applications. By incorporating HPSS, both harmonic and percussive instruments can be accurately modeled for determined BSS. The resultant algorithm estimates the demixing filter using the information obtained by an HPSS method. We also propose a stabilization method that is essential for the proposed algorithm. Our experiments showed that the proposed method outperformed both HPSS and determined BSS methods including independent low-rank matrix analysis.
{"title":"Linear Multichannel Blind Source Separation based on Time-Frequency Mask Obtained by Harmonic/Percussive Sound Separation","authors":"Soichiro Oyabu, Daichi Kitamura, K. Yatabe","doi":"10.1109/ICASSP39728.2021.9413494","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413494","url":null,"abstract":"Determined blind source separation (BSS) extracts the source signals by linear multichannel filtering. Its performance depends on the accuracy of source modeling, and hence existing BSS methods have proposed several source models. Recently, a new determined BSS algorithm that incorporates a time-frequency mask has been proposed. It enables very flexible source modeling because the model is implicitly defined by a mask-generating function. Building up on this framework, in this paper, we propose a unification of determined BSS and harmonic/percussive sound separation (HPSS). HPSS is an important preprocessing for musical applications. By incorporating HPSS, both harmonic and percussive instruments can be accurately modeled for determined BSS. The resultant algorithm estimates the demixing filter using the information obtained by an HPSS method. We also propose a stabilization method that is essential for the proposed algorithm. Our experiments showed that the proposed method outperformed both HPSS and determined BSS methods including independent low-rank matrix analysis.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122275024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}