首页 > 最新文献

Interspeech最新文献

英文 中文
Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation 用于实时单声道语音去噪和去混响的频谱-时间子网
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-468
Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng
This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.
本文提出了一种改进的子带神经网络,用于在线单通道场景的联合语音去噪和去混响。保留了子带模型(SubNet)的优点,该子带模型独立地处理每个频带并且需要少量的资源来实现良好的泛化,所提出的名为STSubNet的框架通过与跨频带的双向长短期记忆网络协作的二维卷积网络从语音频谱中释放出足够的频谱-时间感受野(STRFs),以进一步改进神经网络在期望的语音分量和包括噪声和混响的不期望干扰之间的区分。通过评估单个模块对STSubNet同时去噪和去混响性能的贡献,分析了该STRF提取器的重要性。实验结果表明,在两个公开的基准测试集上,与最先进的模型相比,STSubNet优于其他子带变体,并实现了有竞争力的性能。
{"title":"Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation","authors":"Feifei Xiong, Weiguang Chen, P. Wang, Xiaofei Li, Jinwei Feng","doi":"10.21437/interspeech.2022-468","DOIUrl":"https://doi.org/10.21437/interspeech.2022-468","url":null,"abstract":"This paper presents an improved subband neural network applied to joint speech denoising and dereverberation for online single-channel scenarios. Preserving the advantages of subband model (SubNet) that processes each frequency band in-dependently and requires small amount of resources for good generalization, the proposed framework named STSubNet ex-ploits sufficient spectro-temporal receptive fields (STRFs) from speech spectrum via a two-dimensional convolution network cooperating with a bi-directional long short-term memory network across frequency bands, to further improve the neural network discrimination between desired speech component and undesired interference including noise and reverberation. The importance of this STRF extractor is analyzed by evaluating the contribution of individual module to the STSubNet performance for simultaneously denoising and dereverberation. Experimental results show that STSubNet outperforms other subband variants and achieves competitive performance compared to state-of-the-art models on two publicly benchmark test sets.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"931-935"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46609959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Linguistically Informed Post-processing for ASR Error correction in Sanskrit 梵文ASR纠错的语言信息后处理
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11189
Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi
We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.
我们提出了一种针对低资源语言梵语的ASR系统,该系统有效地将子词标记化策略和搜索空间丰富与语言信息相结合。更具体地说,为了解决语言中存在大量词汇外条目所带来的挑战,我们首先使用基于子词的语言模型和声学模型来生成搜索空间。这样获得的搜索空间被转换为基于单词的搜索空间,并使用基于浅解析器的形态和词汇信息进一步丰富搜索空间。最后,使用为梵语提出的有监督的形态学解析器重新定位搜索空间中的转换。我们提出的方法目前在梵语ASR中报告了最先进的结果,比以前的最先进的WER降低了7.18个绝对点。
{"title":"Linguistically Informed Post-processing for ASR Error correction in Sanskrit","authors":"Rishabh Kumar, D. Adiga, R. Ranjan, A. Krishna, Ganesh Ramakrishnan, Pawan Goyal, P. Jyothi","doi":"10.21437/interspeech.2022-11189","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11189","url":null,"abstract":"We propose an ASR system for Sanskrit, a low-resource language, that effectively combines subword tokenisation strategies and search space enrichment with linguistic information. More specifically, to address the challenges due to the high degree of out-of-vocabulary entries present in the language, we first use a subword-based language model and acoustic model to generate a search space. The search space, so obtained, is converted into a word-based search space and is further enriched with morphological and lexical information based on a shallow parser. Finally, the transitions in the search space are rescored using a supervised morphological parser proposed for Sanskrit. Our proposed approach currently reports the state-of-the-art results in Sanskrit ASR, with a 7.18 absolute point reduction in WER than the previous state-of-the-art.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2293-2297"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46613564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bring dialogue-context into RNN-T for streaming ASR 将对话上下文带入RNN-T以进行流式ASR
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-697
Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma
Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.
最近,直接将诸如历史话语的对话上下文集成到E2E模型中的端到端会话(E2E)自动语音识别(ASR)模型显示出比单话语E2E模型优越的性能。然而,很少有研究如何将对话上下文注入递归神经网络转换器(RNN-T)模型。在这项工作中,我们将对话上下文引入流式RNN-T模型,并探索上下文RNN-T模式的各种结构以及更好地利用对话上下文的训练策略。首先,我们提出了一种深度融合架构,该架构有效地将对话上下文集成在RNN-T的编码器和预测器中。其次,我们提出了上下文和非上下文模型联合训练作为正则化,并提出了上下文扰动来缓解训练和推理之间的上下文不匹配。此外,我们采用上下文感知语言模型(CLM)进行上下文RNN-T解码,以充分利用对话上下文进行会话ASR。我们在Switchboard-2000h任务上进行了实验,并观察了所提出的技术带来的性能增益。与非上下文的RNN-T相比,我们的上下文RNN-T模型在Switchboard和Callhome Hub5'00测试集上产生了4.8%/6.0%的相对改进。通过对CLM进行额外积分,增益进一步增加到10.6%/7.8%。
{"title":"Bring dialogue-context into RNN-T for streaming ASR","authors":"Junfeng Hou, Jinkun Chen, Wanyu Li, Yufeng Tang, Jun Zhang, Zejun Ma","doi":"10.21437/interspeech.2022-697","DOIUrl":"https://doi.org/10.21437/interspeech.2022-697","url":null,"abstract":"Recently the conversational end-to-end (E2E) automatic speech recognition (ASR) models, which directly integrate dialogue-context such as historical utterances into E2E models, have shown superior performance than single-utterance E2E models. However, few works investigate how to inject the dialogue-context into the recurrent neural network transducer (RNN-T) model. In this work, we bring dialogue-context into a streaming RNN-T model and explore various structures of contextual RNN-T model as well as training strategies to better utilize the dialogue-context. Firstly, we propose a deep fusion architecture which efficiently integrates the dialogue-context within the encoder and predictor of RNN-T. Secondly, we propose contextual & non-contextual model joint training as regularization, and propose context perturbation to relieve the context mismatch between training and inference. Moreover, we adopt a context-aware language model (CLM) for contextual RNN-T decoding to take full advantage of the dialogue-context for conversational ASR. We conduct experiments on the Switchboard-2000h task and observe performance gains from the proposed techniques. Compared with non-contextual RNN-T, our contextual RNN-T model yields 4.8% / 6.0% relative improvement on Switchboard and Callhome Hub5’00 testsets. By additionally integrating a CLM, the gain is further increased to 10.6% / 7.8%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2048-2052"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46695012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks MISRNet:使用多输入单共享残差块的轻量级神经声码器
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11152
Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki
Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1
近年来,神经声码器在文本到语音的合成和语音转换中越来越受欢迎,这增加了对高效神经声码器的需求。一种成功的方法是HiFi-GAN,它使用一个相对较小的模型来存档高保真音频合成。这一特性是通过将多接收场融合(MRF)与残差块的多个分支相结合的生成器获得的,允许用较少的通道卷积扩展描述容量。然而,MRF要求模型大小随着分支数量的增加而增加。另外,我们提出了一种称为MISRNet的网络,它包含一个称为多输入单共享剩余块(MISR)的新模块。MISR通过使用核大小为1的轻量级卷积丰富输入变化来扩大描述能力,或者减少残差块从多个到单个的变化。由于输入卷积的模型大小明显小于残差块的模型大小,因此MISR与MRF相比减小了模型大小。此外,我们还介绍了一种MISR的实现技术,通过采用张量重塑来加快处理速度。我们通过实验将我们的想法应用于HiFi-GAN和iSTFTNet的轻量化变体,使模型更轻量化,具有相当的语音质量,且不影响速度。1
{"title":"MISRNet: Lightweight Neural Vocoder Using Multi-Input Single Shared Residual Blocks","authors":"Takuhiro Kaneko, H. Kameoka, Kou Tanaka, Shogo Seki","doi":"10.21437/interspeech.2022-11152","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11152","url":null,"abstract":"Neural vocoders have recently become popular in text-to-speech synthesis and voice conversion, increasing the demand for efficient neural vocoders. One successful approach is HiFi-GAN, which archives high-fidelity audio synthesis using a relatively small model. This characteristic is obtained using a generator incorporating multi-receptive field fusion (MRF) with multiple branches of residual blocks, allowing the expansion of the description capacity with few-channel convolutions. How-ever, MRF requires the model size to increase with the number of branches. Alternatively, we propose a network called MISRNet , which incorporates a novel module called multi-input single shared residual block (MISR) . MISR enlarges the description capacity by enriching the input variation using lightweight convolutions with a kernel size of 1 and, alternatively, reduces the variation of residual blocks from multiple to single. Because the model size of the input convolutions is significantly smaller than that of the residual blocks, MISR reduces the model size compared with that of MRF. Furthermore, we introduce an implementation technique for MISR, where we accelerate the processing speed by adopting tensor reshaping. We experimentally applied our ideas to lightweight variants of HiFi-GAN and iSTFTNet, making the models more lightweight with comparable speech quality and without compromising speed. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1631-1635"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41526941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case 在立体声情况下评估双重通话期间残余回声抑制的客观指标
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-673
Amir Ivry, I. Cohen, B. Berdugo
Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.
由人类评估的语音质量最准确地由人类的主观评级来评估。引入了客观声学回声消除平均意见得分(AECMOS)度量,并在预测双关语中的人类感知方面实现了高精度。然而,残余回声抑制(RES)系统采用信噪比(SDR)度量来量化双话中的语音质量。在这项研究中,我们专注于立体声回声消除,并表明根据AECMOS,立体声SDR(SSDR)与主观人类评级的相关性很差,因为SSDR受到所需语音失真和残余回声存在的影响。我们引入了一对客观指标,可以清楚地评估双通话期间的立体声期望语音保持水平(SDSML)和立体声残余回声抑制水平(SRESL)。通过采用基于深度学习的可调RES系统,并使用100小时的真实和模拟记录,SDSML和SRESL指标在各种设置中显示出与AECMOS的高度相关性。我们还研究了设计参数如何控制SDSML-SRESL权衡,并利用这种关系在实际情况下为频繁变化的用户需求提供最佳性能。
{"title":"Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case","authors":"Amir Ivry, I. Cohen, B. Berdugo","doi":"10.21437/interspeech.2022-673","DOIUrl":"https://doi.org/10.21437/interspeech.2022-673","url":null,"abstract":"Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5348-5352"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42839527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device 增量分层自监督学习在设备上实现高效的无监督语音域自适应
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10904
Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.
流式端到端语音识别模型已被广泛应用于移动设备,并显示出效率的显著提高。这些模型通常在服务器上使用转录的语音数据进行训练。然而,服务器数据分布可能与用户设备上的数据分布非常不同,这可能会影响模型性能。设备上训练有两个主要挑战,有限的可靠标签和有限的训练记忆。虽然自监督学习算法可以使用未标记的数据来减轻域之间的不匹配,但由于内存限制,它们不直接适用于移动设备。在本文中,我们提出了一种增量分层自监督学习算法,用于移动设备上有效的无监督语音域自适应,其中一次只更新一层。大量的实验结果表明,所提出的算法实现了24。与监督基线相比,目标域上2%的相对字错误率(WER)改进,并且成本为95。与端到端自监督学习算法相比,训练内存减少7%。
{"title":"Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device","authors":"Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays","doi":"10.21437/interspeech.2022-10904","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10904","url":null,"abstract":"Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4845-4849"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41326867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network 时域目标说话人提取网络的门控卷积融合
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-961
Wenjing Liu, Chuan Xie
Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.
目标说话人提取旨在基于目标说话人的辅助参考语音,从混合语音中提取目标说话人的语音。说话人嵌入通常从参考语音中提取,并与所学习的声学表示融合。现有的大多数工作都执行简单的基于操作的级联融合。然而,这种直接将扬声器嵌入声学表示的天真方法可能无法有效地探索潜在的跨模态相关性。在这项工作中,我们通过探索全局条件建模和可训练的门控机制,提出了一种门控卷积融合方法,用于学习扬声器嵌入和声学表示之间的复杂交互。在WSJ0-2mix-extr数据集上的实验证明了所提出的融合方法的有效性,该方法与其他融合方法相比表现良好,在SDRi和SI SDRi方面有相当大的改进。此外,我们的方法可以灵活地结合到类似的时域扬声器提取网络中,以获得更好的性能。
{"title":"Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network","authors":"Wenjing Liu, Chuan Xie","doi":"10.21437/interspeech.2022-961","DOIUrl":"https://doi.org/10.21437/interspeech.2022-961","url":null,"abstract":"Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5368-5372"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49331936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis L2-GEN:一种用于发音错误诊断的二语语音合成的神经音位语法方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-209
Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell
In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.
在本文中,我们研究了为发音错误检测和诊断(MDD)任务生成模仿非母语(L2)使用者学习英语作为第二语言(ESL)的发音错误语音的问题。这篇论文的动机是在MDD研究中广泛观察到但尚未得到很好解决的数据稀疏性,其中L2语音音频及其细粒度语音注释都很难获得,导致发音错误反馈准确性不令人满意。我们提出了L2-GEN,这是一个新的数据扩充框架,通过设计一个独特的基于机器翻译的序列转述模型来生成L2音素序列,该序列可以捕捉真实的发音错误模式。提出了一种新的多样化和偏好感知解码算法,以推广L2-GEN,在非常有限的L2训练数据下处理看不见的单词和新的学习者群体。进一步设计了一种对比增强技术,以利用生成的合成L2数据优化MDD性能改进。我们在公共L2-ARCTIC和SpeechOcean762数据集上评估了L2-GEN。结果表明,在域内和域外场景中,L2-GEN分别导致高达3.9%和5.0%的MDD F1分数提高。
{"title":"L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis","authors":"Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell","doi":"10.21437/interspeech.2022-209","DOIUrl":"https://doi.org/10.21437/interspeech.2022-209","url":null,"abstract":"In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4317-4321"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49388203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning ORCA-WHISPER:一个使用深度学习的杀人鲸声音类型自动生成工具包
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-846
Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth
Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.
即使在今天,目前对动物物种发声范式的理解和解释也很大程度上是基于他的理论和手动数据分析,考虑到相对较小的数据语料库,主要是由于时间和人力资源的限制,以及可用的物种相关机器学习技术的稀缺性。部分基于人为的数据检查既不能代表整个真实世界的声乐曲目,也不能代表动物内部和动物间特定叫声类型组合的变化,通常只会产生少量类别特定的基本事实数据。现代机器(深度)学习概念是在大量生物声学数据档案中识别具有统计意义的动物相关发声模式的基本要求。然而,由于呼叫特定的基本事实数据有限,再加上单个呼叫类型事件之间的严重类不平衡,纯监督训练方法的适用性具有挑战性。目前的研究首次提出了一个名为ORCA-WHISPER的深度生物声学信号生成框架,这是一个基于低资源虎鲸(Orcinus ORCA)呼叫类型数据训练的生成对抗性网络(GAN)。除了视听检查、监督呼叫类型分类和模型可转移性外,通过可视化、表示和增强真实世界的虎鲸信号数据集,进一步证明了生成的假语音的良好质量。此外,通过将伪信号集成到原始数据分区中,先前的orca/噪声分割结果表现出色。
{"title":"ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning","authors":"Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth","doi":"10.21437/interspeech.2022-846","DOIUrl":"https://doi.org/10.21437/interspeech.2022-846","url":null,"abstract":"Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2413-2417"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49492375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Predicting label distribution improves non-intrusive speech quality estimation 预测标签分布改进了非侵入式语音质量估计
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11186
A. Faridee, H. Gamper
{"title":"Predicting label distribution improves non-intrusive speech quality estimation","authors":"A. Faridee, H. Gamper","doi":"10.21437/interspeech.2022-11186","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11186","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"406-410"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49515191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1