Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-673
Amir Ivry, I. Cohen, B. Berdugo
Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.
{"title":"Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case","authors":"Amir Ivry, I. Cohen, B. Berdugo","doi":"10.21437/interspeech.2022-673","DOIUrl":"https://doi.org/10.21437/interspeech.2022-673","url":null,"abstract":"Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5348-5352"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42839527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10904
Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.
{"title":"Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device","authors":"Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays","doi":"10.21437/interspeech.2022-10904","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10904","url":null,"abstract":"Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4845-4849"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41326867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-961
Wenjing Liu, Chuan Xie
Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.
{"title":"Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network","authors":"Wenjing Liu, Chuan Xie","doi":"10.21437/interspeech.2022-961","DOIUrl":"https://doi.org/10.21437/interspeech.2022-961","url":null,"abstract":"Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5368-5372"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49331936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-209
Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell
In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.
{"title":"L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis","authors":"Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell","doi":"10.21437/interspeech.2022-209","DOIUrl":"https://doi.org/10.21437/interspeech.2022-209","url":null,"abstract":"In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4317-4321"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49388203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-846
Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth
Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.
{"title":"ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning","authors":"Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth","doi":"10.21437/interspeech.2022-846","DOIUrl":"https://doi.org/10.21437/interspeech.2022-846","url":null,"abstract":"Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2413-2417"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49492375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.
{"title":"Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition","authors":"Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi","doi":"10.21437/interspeech.2022-11307","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11307","url":null,"abstract":"This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4740-4744"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49527957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10685
K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa
Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.
{"title":"Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings","authors":"K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa","doi":"10.21437/interspeech.2022-10685","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10685","url":null,"abstract":"Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4496-4500"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49559750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-320
Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao
In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).
{"title":"The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge","authors":"Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao","doi":"10.21437/interspeech.2022-320","DOIUrl":"https://doi.org/10.21437/interspeech.2022-320","url":null,"abstract":"In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4367-4370"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42514937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-740
Chenhui Zhang, Xiang Pan
This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.
{"title":"Single-channel speech enhancement using Graph Fourier Transform","authors":"Chenhui Zhang, Xiang Pan","doi":"10.21437/interspeech.2022-740","DOIUrl":"https://doi.org/10.21437/interspeech.2022-740","url":null,"abstract":"This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"946-950"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45499958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}