Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-470
Peng Zhang, Peng Hu, Xueliang Zhang
In this paper, we present the Elevoc systems submitted to the Spoofing Aware Speaker Verification Challenge (SASVC) 2022. Our submissions focus on bridge the gap between the automatic speaker verification (ASV) and countermeasure (CM) systems. We investigate a general and efficient norm-constrained score-level ensemble method which jointly processes the scores extracted from ASV and CM subsystems, improving robustness to both zero-effect imposters and spoof-ing attacks. Furthermore, we explore that the ensemble system can provide better performance when both ASV and CM subsystems are optimized. Experimental results show that our primary system yields 0.45% SV-EER, 0.26% SPF-EER and 0.37% SASV-EER, and obtains more than 96.08%, 66.67% and 94.19% relative improvements over the best performing baseline systems on the SASVC 2022 evaluation set. All of our code and pre-trained models weights are publicly available and reproducible 1 .
{"title":"Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification","authors":"Peng Zhang, Peng Hu, Xueliang Zhang","doi":"10.21437/interspeech.2022-470","DOIUrl":"https://doi.org/10.21437/interspeech.2022-470","url":null,"abstract":"In this paper, we present the Elevoc systems submitted to the Spoofing Aware Speaker Verification Challenge (SASVC) 2022. Our submissions focus on bridge the gap between the automatic speaker verification (ASV) and countermeasure (CM) systems. We investigate a general and efficient norm-constrained score-level ensemble method which jointly processes the scores extracted from ASV and CM subsystems, improving robustness to both zero-effect imposters and spoof-ing attacks. Furthermore, we explore that the ensemble system can provide better performance when both ASV and CM subsystems are optimized. Experimental results show that our primary system yields 0.45% SV-EER, 0.26% SPF-EER and 0.37% SASV-EER, and obtains more than 96.08%, 66.67% and 94.19% relative improvements over the best performing baseline systems on the SASVC 2022 evaluation set. All of our code and pre-trained models weights are publicly available and reproducible 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4371-4375"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44747898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difficulty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classification. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.
{"title":"CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content","authors":"Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid","doi":"10.21437/interspeech.2022-10053","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10053","url":null,"abstract":"Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difficulty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classification. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2758-2762"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41753715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-478
Bei Liu, Zhengyang Chen, Y. Qian
As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.
该方法利用深度神经网络提取代表不同说话人身份的固定维嵌入向量。ResNet和ECAPA-TDNN这两种网络架构在之前的研究中被普遍采用,并取得了最先进的性能。特征融合是其中一个无所不在的部分,在二者中起着重要的作用。例如,为了融合ResNet中剩余块输入和输出的身份映射,设计了快捷连接。ECAPA-TDNN采用多层特征聚合将浅层特征映射与深层特征映射进行融合。传统的特征融合通常是通过简单的操作来实现的,比如元素的添加或连接。本文提出了一种更有效的特征融合方案,即a - tentive F - fusion (AFF),实现不同特征的动态加权融合。它利用注意模块根据特征内容学习融合权值。此外,还设计了两种融合策略:顺序融合和并行融合。在Voxceleb数据集上的实验表明,我们提出的关注特征融合方案比基线系统的相对改进率高达40%。
{"title":"Attentive Feature Fusion for Robust Speaker Verification","authors":"Bei Liu, Zhengyang Chen, Y. Qian","doi":"10.21437/interspeech.2022-478","DOIUrl":"https://doi.org/10.21437/interspeech.2022-478","url":null,"abstract":"As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"286-290"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41894666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11131
Juliana N. Saba, J. Hansen
Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.
{"title":"Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting","authors":"Juliana N. Saba, J. Hansen","doi":"10.21437/interspeech.2022-11131","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11131","url":null,"abstract":"Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5473-5477"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41805320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-306
Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada
In the last few years, unsupervised learning methods have been proposed in speech denoising by taking advantage of Deep Neural Networks (DNNs). The reason is that such unsupervised methods are more practical than the supervised counterparts. In our scenario, we are given a set of noisy speech data, where any two data do not share the same clean data. Our goal is to obtain the denoiser by training a DNN based model. Using the set, we train the model via the following two steps: 1) From the noisy speech data, construct another noisy speech data via our proposed masking technique. 2) Minimize our proposed loss defined from the DNN and the two noisy speech data. We evaluate our method using Gaussian and real-world noises in our numerical experiments. As a result, our method outperforms the state-of-the-art method on average for both noises. In addi-tion, we provide the theoretical explanation of why our method can be efficient if the noise has Gaussian distribution.
{"title":"Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches","authors":"Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada","doi":"10.21437/interspeech.2022-306","DOIUrl":"https://doi.org/10.21437/interspeech.2022-306","url":null,"abstract":"In the last few years, unsupervised learning methods have been proposed in speech denoising by taking advantage of Deep Neural Networks (DNNs). The reason is that such unsupervised methods are more practical than the supervised counterparts. In our scenario, we are given a set of noisy speech data, where any two data do not share the same clean data. Our goal is to obtain the denoiser by training a DNN based model. Using the set, we train the model via the following two steps: 1) From the noisy speech data, construct another noisy speech data via our proposed masking technique. 2) Minimize our proposed loss defined from the DNN and the two noisy speech data. We evaluate our method using Gaussian and real-world noises in our numerical experiments. As a result, our method outperforms the state-of-the-art method on average for both noises. In addi-tion, we provide the theoretical explanation of why our method can be efficient if the noise has Gaussian distribution.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1178-1182"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41830258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-698
Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang
Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.
{"title":"Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition","authors":"Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang","doi":"10.21437/interspeech.2022-698","DOIUrl":"https://doi.org/10.21437/interspeech.2022-698","url":null,"abstract":"Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3794-3798"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41833974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-11232
H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka
This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.
{"title":"CAUSE: Crossmodal Action Unit Sequence Estimation from Speech","authors":"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka","doi":"10.21437/interspeech.2022-11232","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11232","url":null,"abstract":"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"506-510"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43127714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10428
Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang
Packet loss is one of the main reasons for speech quality degradation in voice over internet phone (VOIP) calls. However, the existing packet loss concealment (PLC) algorithms are hard to generate high-quality speech signal while maintaining low computational complexity. In this paper, a causal wave-to-wave non-autoregressive lightweight PLC model (PLCNet) is proposed, which can do real-time streaming process with low latency. In addition, we introduce multiple multi-resolution discriminators and semi-supervised training strategy to improve the ability of the encoder part to extract global features while enabling the decoder part to accurately reconstruct waveforms where packets are lost. Contrary to autoregressive model, PLCNet can guarantee the smoothness and continuity of the speech phase before and after packet loss without any smoothing operations. Experimental results show that PLCNet achieves significant improvements in perceptual quality and intelligibility over three classical PLC methods and three state-of-the-art deep PLC methods. In the INTERSPEECH 2022 PLC Challenge, our approach has ranked the 3rd place on PLCMOS (3.829) and the 3rd place on the final score (0.798).
{"title":"PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network","authors":"Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang","doi":"10.21437/interspeech.2022-10428","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10428","url":null,"abstract":"Packet loss is one of the main reasons for speech quality degradation in voice over internet phone (VOIP) calls. However, the existing packet loss concealment (PLC) algorithms are hard to generate high-quality speech signal while maintaining low computational complexity. In this paper, a causal wave-to-wave non-autoregressive lightweight PLC model (PLCNet) is proposed, which can do real-time streaming process with low latency. In addition, we introduce multiple multi-resolution discriminators and semi-supervised training strategy to improve the ability of the encoder part to extract global features while enabling the decoder part to accurately reconstruct waveforms where packets are lost. Contrary to autoregressive model, PLCNet can guarantee the smoothness and continuity of the speech phase before and after packet loss without any smoothing operations. Experimental results show that PLCNet achieves significant improvements in perceptual quality and intelligibility over three classical PLC methods and three state-of-the-art deep PLC methods. In the INTERSPEECH 2022 PLC Challenge, our approach has ranked the 3rd place on PLCMOS (3.829) and the 3rd place on the final score (0.798).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"575-579"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43369716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-10097
Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng
{"title":"ASR-Robust Natural Language Understanding on ASR-GLUE dataset","authors":"Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng","doi":"10.21437/interspeech.2022-10097","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10097","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1101-1105"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43470194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-18DOI: 10.21437/interspeech.2022-291
Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang
{"title":"A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text","authors":"Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang","doi":"10.21437/interspeech.2022-291","DOIUrl":"https://doi.org/10.21437/interspeech.2022-291","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2613-2617"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43520324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}