首页 > 最新文献

2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

英文 中文
TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement TEA-PSE 2.0:实时个性化语音增强子带网络
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023174
Yukai Ju, Shimin Zhang, Wei Rao, Yannan Wang, Tao Yu, Lei Xie, Shidong Shang
Personalized speech enhancement (PSE) utilizes additional cues like speaker embeddings to remove background noise and interfering speech and extract the speech from target speaker. Previous work, the Tencent-Ethereal-Audio-Lab personalized speech enhancement (TEA-PSE) system, ranked 1st in the ICASSP 2022 deep noise suppression (DNS2022) challenge. In this paper, we expand TEA-PSE to its sub-band version - TEA-PSE 2.0, to reduce computational complexity as well as further improve performance. Specifically, we adopt finite impulse response filter banks and spectrum splitting to reduce computational complexity. We introduce a time frequency convolution module (TFCM) to the system for increasing the receptive field with small convolution kernels. Besides, we explore several training strategies to optimize the two-stage network and investigate various loss functions in the PSE task. TEA-PSE 2.0 significantly outperforms TEA-PSE in both speech enhancement performance and computation complexity. Experimental results on the DNS2022 blind test set show that TEA-PSE 2.0 brings 0.102 OVRL personalized DNSMOS improvement with only 21.9% multiply-accumulate operations compared with the previous TEA-PSE.
个性化语音增强(PSE)利用诸如说话人嵌入之类的额外线索来去除背景噪声和干扰语音,并从目标说话人那里提取语音。之前的工作是腾讯-以太-音频-实验室个性化语音增强(TEA-PSE)系统,在ICASSP 2022深度噪声抑制(DNS2022)挑战赛中排名第一。在本文中,我们将TEA-PSE扩展到其子带版本- TEA-PSE 2.0,以降低计算复杂度并进一步提高性能。具体来说,我们采用有限脉冲响应滤波器组和频谱分割来降低计算复杂度。我们在系统中引入时频卷积模块(TFCM),用小卷积核增加接收野。此外,我们探索了几种优化两阶段网络的训练策略,并研究了PSE任务中的各种损失函数。TEA-PSE 2.0在语音增强性能和计算复杂度上都明显优于TEA-PSE。在DNS2022盲测试集上的实验结果表明,与之前的TEA-PSE相比,TEA-PSE 2.0的乘累加运算次数仅为21.9%,提高了0.102的OVRL个性化DNSMOS。
{"title":"TEA-PSE 2.0: Sub-Band Network for Real-Time Personalized Speech Enhancement","authors":"Yukai Ju, Shimin Zhang, Wei Rao, Yannan Wang, Tao Yu, Lei Xie, Shidong Shang","doi":"10.1109/SLT54892.2023.10023174","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023174","url":null,"abstract":"Personalized speech enhancement (PSE) utilizes additional cues like speaker embeddings to remove background noise and interfering speech and extract the speech from target speaker. Previous work, the Tencent-Ethereal-Audio-Lab personalized speech enhancement (TEA-PSE) system, ranked 1st in the ICASSP 2022 deep noise suppression (DNS2022) challenge. In this paper, we expand TEA-PSE to its sub-band version - TEA-PSE 2.0, to reduce computational complexity as well as further improve performance. Specifically, we adopt finite impulse response filter banks and spectrum splitting to reduce computational complexity. We introduce a time frequency convolution module (TFCM) to the system for increasing the receptive field with small convolution kernels. Besides, we explore several training strategies to optimize the two-stage network and investigate various loss functions in the PSE task. TEA-PSE 2.0 significantly outperforms TEA-PSE in both speech enhancement performance and computation complexity. Experimental results on the DNS2022 blind test set show that TEA-PSE 2.0 brings 0.102 OVRL personalized DNSMOS improvement with only 21.9% multiply-accumulate operations compared with the previous TEA-PSE.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131234174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Context-Aware Neural Confidence Estimation for Rare Word Speech Recognition 上下文感知的罕见词语音识别神经置信度估计
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023411
David Qiu, Tsendsuren Munkhdalai, Yanzhang He, K. Sim
Confidence estimation for automatic speech recognition (ASR) is important for many downstream tasks. Recently, neural confidence estimation models (CEMs) have been shown to produce accurate confidence scores for predicting word-level errors. These models are built on top of an end-to-end (E2E) ASR and the acoustic embeddings are part of the input features. However, practical E2E ASR systems often incorporate contextual information in the decoder to improve rare word recognition. The CEM is not aware of this and underestimates the confidence of the rare words that have been corrected by the context. In this paper, we propose a context-aware CEM by incorporating context into the encoder using a neural associative memory (NAM) model. It uses attention to detect for presence of the biasing phrases and modify the encoder features. Experiments show that the proposed context-aware CEM using NAM augmented training can improve the AUC-ROC for word error prediction from 0.837 to 0.892.
自动语音识别(ASR)的置信度估计对许多后续任务都很重要。近年来,神经置信度估计模型(CEMs)已被证明可以产生准确的置信度分数来预测词级错误。这些模型建立在端到端(E2E) ASR之上,声学嵌入是输入特征的一部分。然而,实际的端到端自动识别系统经常在解码器中加入上下文信息,以提高罕见词的识别。CEM没有意识到这一点,低估了被上下文纠正的罕见词的信心。在本文中,我们提出了一个上下文感知的CEM,通过使用神经联想记忆(NAM)模型将上下文纳入编码器。它使用注意力来检测偏置短语的存在并修改编码器特征。实验表明,本文提出的基于NAM增强训练的上下文感知CEM可以将AUC-ROC从0.837提高到0.892。
{"title":"Context-Aware Neural Confidence Estimation for Rare Word Speech Recognition","authors":"David Qiu, Tsendsuren Munkhdalai, Yanzhang He, K. Sim","doi":"10.1109/SLT54892.2023.10023411","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023411","url":null,"abstract":"Confidence estimation for automatic speech recognition (ASR) is important for many downstream tasks. Recently, neural confidence estimation models (CEMs) have been shown to produce accurate confidence scores for predicting word-level errors. These models are built on top of an end-to-end (E2E) ASR and the acoustic embeddings are part of the input features. However, practical E2E ASR systems often incorporate contextual information in the decoder to improve rare word recognition. The CEM is not aware of this and underestimates the confidence of the rare words that have been corrected by the context. In this paper, we propose a context-aware CEM by incorporating context into the encoder using a neural associative memory (NAM) model. It uses attention to detect for presence of the biasing phrases and modify the encoder features. Experiments show that the proposed context-aware CEM using NAM augmented training can improve the AUC-ROC for word error prediction from 0.837 to 0.892.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116154827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Dual Learning for Large Vocabulary On-Device ASR 设备上大词汇量ASR的双重学习
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023407
Cal Peyser, W. R. Huang, Tara N. Sainath, Rohit Prabhavalkar, M. Picheny, K. Cho
Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.
双重学习是半监督机器学习的一种范式,旨在通过同时解决两个相反的任务来利用无监督数据。在这个方案中,每个模型被用来为未标记的样本生成伪标签,这些样本被用来训练另一个模型。通过将ASR和TTS作为双重任务配对,双重学习已经在语音处理中得到了一些应用。然而,这些结果大多只解决了使用不成对示例来补偿非常小的监督数据集的情况,并且主要是在大型非流模型上。对于使用无监督数据来改进已经在大型监督语料库上训练的现实设备上流模型,双重学习尚未被证明是有效的。我们通过对整个librisspeech上训练的设备大小的流转换器的分析,提供了这一缺失的部分,显示出没有LM的相对WER提高了10.7%/5.2%,使用LM的相对WER提高了11.7%/16.4%。
{"title":"Dual Learning for Large Vocabulary On-Device ASR","authors":"Cal Peyser, W. R. Huang, Tara N. Sainath, Rohit Prabhavalkar, M. Picheny, K. Cho","doi":"10.1109/SLT54892.2023.10023407","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023407","url":null,"abstract":"Dual learning is a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks at once. In this scheme, each model is used to generate pseudo-labels for unlabeled examples that are used to train the other model. Dual learning has seen some use in speech processing by pairing ASR and TTS as dual tasks. However, these results mostly address only the case of using unpaired examples to compensate for very small supervised datasets, and mostly on large, non-streaming models. Dual learning has not yet been proven effective for using unsupervised data to improve realistic on-device streaming models that are already trained on large supervised corpora. We provide this missing piece though an analysis of an on-device-sized streaming conformer trained on the entirety of Librispeech, showing relative WER improvements of 10.7%/5.2% without an LM and 11.7%/16.4% with an LM.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124042107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-Stage Progressive Audio Bandwidth Extension 多阶段渐进音频带宽扩展
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022989
Liang Wen, Lizhong Wang, Y. Zhang, K. Choi
Audio bandwidth extension can enhance subjective sound quality by increasing bandwidth of audio signal. This paper presents a novel multi-stage progressive method for time domain causal bandwidth extension. Each stage of the progressive model contains a light weight scale-up module to generate high frequency signal and a supervised attention module to guide features propagating between stages. Time-frequency two-step training method with weighted loss for progressive output is adopted to supervise bandwidth extension performance improves along stages. Test results show that multi-stage model can improve both objective results and perceptual quality progressively. The multi-stage progressive model makes bandwidth extension performance adjustable according to energy consumption, computing capacity and user preferences.
音频带宽扩展可以通过增加音频信号的带宽来提高主观音质。提出了一种时域因果带宽扩展的多阶段递进算法。渐进模型的每个阶段都包含一个轻量级的放大模块来产生高频信号,以及一个监督注意模块来引导特征在阶段之间传播。采用累进输出加权损失的时频两步训练方法来监督带宽扩展性能随阶段的提高。测试结果表明,多阶段模型能逐步提高客观结果和感知质量。多阶段递进模型使带宽扩展性能可根据能耗、计算能力和用户偏好进行调整。
{"title":"Multi-Stage Progressive Audio Bandwidth Extension","authors":"Liang Wen, Lizhong Wang, Y. Zhang, K. Choi","doi":"10.1109/SLT54892.2023.10022989","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022989","url":null,"abstract":"Audio bandwidth extension can enhance subjective sound quality by increasing bandwidth of audio signal. This paper presents a novel multi-stage progressive method for time domain causal bandwidth extension. Each stage of the progressive model contains a light weight scale-up module to generate high frequency signal and a supervised attention module to guide features propagating between stages. Time-frequency two-step training method with weighted loss for progressive output is adopted to supervise bandwidth extension performance improves along stages. Test results show that multi-stage model can improve both objective results and perceptual quality progressively. The multi-stage progressive model makes bandwidth extension performance adjustable according to energy consumption, computing capacity and user preferences.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127361200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Panel Discussion 小组讨论
Pub Date : 2023-01-09 DOI: 10.1109/slt54892.2023.10022919
{"title":"Panel Discussion","authors":"","doi":"10.1109/slt54892.2023.10022919","DOIUrl":"https://doi.org/10.1109/slt54892.2023.10022919","url":null,"abstract":"","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135062009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerator-Aware Training for Transducer-Based Speech Recognition 基于换能器的语音识别的加速器感知训练
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022592
Suhaila M. Shakiah, R. Swaminathan, H. Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, A. Mouchtaris, Grant P. Strimel, A. Rastrow
Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradation due to low-precision inference on the NNA in back-propagation. Our proposed method efficiently emulates NNA operations, thus foregoing the need to transfer quantization error-prone data to the Central Processing Unit (CPU), ultimately reducing the user perceived latency (UPL). We apply our approach to Recurrent Neural Network-Transducer (RNN-T), an attractive architecture for on-device streaming speech recognition tasks. We train and evaluate models on 270K hours of English data and show a 5-7% improvement in engine latency while saving up to 10% relative degradation in WER.
机器学习模型的权值和激活值在训练过程中以全精度表示。当部署在神经网络加速器(NNA)芯片上时,这会导致运行时性能下降,NNA芯片利用高度并行的定点算法来改善运行时内存和延迟。在这项工作中,我们在训练阶段复制了NNA算子,考虑了由于反向传播中对NNA的低精度推断而导致的退化。我们提出的方法有效地模拟了NNA操作,从而无需将量化容易出错的数据传输到中央处理单元(CPU),最终减少了用户感知延迟(UPL)。我们将我们的方法应用于递归神经网络传感器(RNN-T),这是一种用于设备上流语音识别任务的有吸引力的架构。我们在270K小时的英语数据上训练和评估了模型,结果显示发动机延迟改善了5-7%,同时减少了10%的相对下降。
{"title":"Accelerator-Aware Training for Transducer-Based Speech Recognition","authors":"Suhaila M. Shakiah, R. Swaminathan, H. Nguyen, Raviteja Chinta, Tariq Afzal, Nathan Susanj, A. Mouchtaris, Grant P. Strimel, A. Rastrow","doi":"10.1109/SLT54892.2023.10022592","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022592","url":null,"abstract":"Machine learning model weights and activations are represented in full-precision during training. This leads to performance degradation in runtime when deployed on neural network accelerator (NNA) chips, which leverage highly parallelized fixed-point arithmetic to improve runtime memory and latency. In this work, we replicate the NNA operators during the training phase, accounting for the degradation due to low-precision inference on the NNA in back-propagation. Our proposed method efficiently emulates NNA operations, thus foregoing the need to transfer quantization error-prone data to the Central Processing Unit (CPU), ultimately reducing the user perceived latency (UPL). We apply our approach to Recurrent Neural Network-Transducer (RNN-T), an attractive architecture for on-device streaming speech recognition tasks. We train and evaluate models on 270K hours of English data and show a 5-7% improvement in engine latency while saving up to 10% relative degradation in WER.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116341119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Information From Native Data for Non-Native Automatic Pronunciation Assessment 利用母语数据信息进行非母语语音自动评估
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022486
Binghuai Lin, Liyuan Wang
This paper proposes an end-to-end pronunciation assessment method to exploit the adequate native data and reduce the need for non-native data costly to label. To obtain discriminative acoustic representations at the phoneme level, the pretrained wav2vec 2.0 is re-trained with connectionist temporal classification (CTC) loss for phoneme recognition using native data. These acoustic representations are fused with phoneme representations derived from a phoneme encoder to obtain final pronunciation scores. An efficient fusion mechanism aligns each phoneme with acoustic frames based on attention, where all blank frames recognized by the CTC-based phoneme recognition are masked. Finally, the whole network is optimized by a multi-task learning framework combining CTC loss and mean square error loss between predicted and human scores. Extensive experiments demonstrate that it outperforms previous baselines in the Pearson correlation coefficient even with much fewer labeled non-native data.
本文提出了一种端到端的语音评估方法,以充分利用本地数据,减少对非本地数据的标注成本。为了在音素水平上获得判别性的声学表示,使用连接时间分类(CTC)损失对预训练的wav2vec 2.0进行重新训练,以使用本地数据进行音素识别。这些声学表征与音素表征融合,从音素编码器得到最终的发音分数。一种有效的融合机制将每个音素与基于注意力的声框架对齐,其中基于ctc的音素识别识别的所有空白框架都被掩盖。最后,通过多任务学习框架对整个网络进行优化,该框架结合了CTC损失和预测分数与人类分数之间的均方误差损失。大量的实验表明,即使标记的非本地数据少得多,它在Pearson相关系数方面也优于以前的基线。
{"title":"Exploiting Information From Native Data for Non-Native Automatic Pronunciation Assessment","authors":"Binghuai Lin, Liyuan Wang","doi":"10.1109/SLT54892.2023.10022486","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022486","url":null,"abstract":"This paper proposes an end-to-end pronunciation assessment method to exploit the adequate native data and reduce the need for non-native data costly to label. To obtain discriminative acoustic representations at the phoneme level, the pretrained wav2vec 2.0 is re-trained with connectionist temporal classification (CTC) loss for phoneme recognition using native data. These acoustic representations are fused with phoneme representations derived from a phoneme encoder to obtain final pronunciation scores. An efficient fusion mechanism aligns each phoneme with acoustic frames based on attention, where all blank frames recognized by the CTC-based phoneme recognition are masked. Finally, the whole network is optimized by a multi-task learning framework combining CTC loss and mean square error loss between predicted and human scores. Extensive experiments demonstrate that it outperforms previous baselines in the Pearson correlation coefficient even with much fewer labeled non-native data.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116928790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Flow-ER: A Flow-Based Embedding Regularization Strategy for Robust Speech Representation Learning 流- er:一种基于流的鲁棒语音表示学习嵌入正则化策略
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022986
Woohyun Kang, J. Alam, A. Fathan
Over the recent years, various deep learning-based embedding methods were proposed. Although the deep learning-based embedding extraction methods have shown good performance in numerous tasks including speaker verification, language identification and anti-spoofing, their performance is limited when it comes to mismatched conditions due to the variability within them unrelated to the main task. In order to alleviate this problem, we propose a novel training strategy that regularizes the embedding network to have minimum information about the nuisance attributes. To achieve this, our proposed method directly incorporates the information bottleneck scheme into the training process, where the mutual information is estimated using an auxiliary normalizing flow network. The performance of the proposed method is evaluated on different speech processing tasks and found to provide improvement over the standard training strategy in all experimentations.
近年来,人们提出了各种基于深度学习的嵌入方法。尽管基于深度学习的嵌入提取方法在许多任务中表现出良好的性能,包括说话人验证、语言识别和反欺骗,但由于其内部与主要任务无关的可变性,当涉及不匹配条件时,其性能受到限制。为了缓解这一问题,我们提出了一种新的训练策略,使嵌入网络具有最小的讨厌属性信息。为了实现这一点,我们提出的方法直接将信息瓶颈方案纳入训练过程,其中使用辅助归一化流网络估计互信息。在不同的语音处理任务中评估了该方法的性能,并发现在所有实验中都比标准训练策略提供了改进。
{"title":"Flow-ER: A Flow-Based Embedding Regularization Strategy for Robust Speech Representation Learning","authors":"Woohyun Kang, J. Alam, A. Fathan","doi":"10.1109/SLT54892.2023.10022986","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022986","url":null,"abstract":"Over the recent years, various deep learning-based embedding methods were proposed. Although the deep learning-based embedding extraction methods have shown good performance in numerous tasks including speaker verification, language identification and anti-spoofing, their performance is limited when it comes to mismatched conditions due to the variability within them unrelated to the main task. In order to alleviate this problem, we propose a novel training strategy that regularizes the embedding network to have minimum information about the nuisance attributes. To achieve this, our proposed method directly incorporates the information bottleneck scheme into the training process, where the mutual information is estimated using an auxiliary normalizing flow network. The performance of the proposed method is evaluated on different speech processing tasks and found to provide improvement over the standard training strategy in all experimentations.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"6 50","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132815784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multi-Modal Array of Interpretable Features to Evaluate Language and Speech Patterns in Different Neurological Disorders 多模态可解释特征阵列评估不同神经系统疾病的语言和言语模式
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022435
A. Favaro, C. Motley, Tianyu Cao, Miguel Iglesias, A. Butala, E. Oh, R. Stevens, J. Villalba, N. Dehak, L. Moro-Velázquez
Speech-based automatic approaches for evaluating neurological disorders (NDs) depend on feature extraction before the classification pipeline. It is preferable for these features to be interpretable to facilitate their development as diagnostic tools. This study focuses on the analysis of interpretable features obtained from the spoken responses of 88 subjects with NDs and controls (CN). Subjects with NDs have Alzheimer's disease (AD), Parkinson's disease (PD), or Parkinson's disease mimics (PDM). We configured three complementary sets of features related to cognition, speech, and language, and conducted a statistical analysis to examine which features differed between NDs and CN. Results suggested that features capturing response informativeness, reaction times, vocabulary richness, and syntactic complexity provided separability between AD and CN. Similarly, fundamental frequency variability helped differentiate PD from CN, while the number of salient informational units PDM from CN.
基于语音的神经系统疾病自动评估方法依赖于分类管道前的特征提取。这些特性最好是可解释的,以促进它们作为诊断工具的开发。本研究着重分析了88名精神病患者和对照组(CN)的可解释特征。患有NDs的受试者患有阿尔茨海默病(AD)、帕金森病(PD)或帕金森病模拟(PDM)。我们配置了与认知、语音和语言相关的三组互补的特征,并进行了统计分析,以检查NDs和CN之间的哪些特征不同。结果表明,反应信息量、反应时间、词汇丰富度和句法复杂性等特征提供了AD和CN之间的可分离性。同样,基频变异性有助于区分PD和CN,而显著信息单位的数量有助于区分PDM和CN。
{"title":"A Multi-Modal Array of Interpretable Features to Evaluate Language and Speech Patterns in Different Neurological Disorders","authors":"A. Favaro, C. Motley, Tianyu Cao, Miguel Iglesias, A. Butala, E. Oh, R. Stevens, J. Villalba, N. Dehak, L. Moro-Velázquez","doi":"10.1109/SLT54892.2023.10022435","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022435","url":null,"abstract":"Speech-based automatic approaches for evaluating neurological disorders (NDs) depend on feature extraction before the classification pipeline. It is preferable for these features to be interpretable to facilitate their development as diagnostic tools. This study focuses on the analysis of interpretable features obtained from the spoken responses of 88 subjects with NDs and controls (CN). Subjects with NDs have Alzheimer's disease (AD), Parkinson's disease (PD), or Parkinson's disease mimics (PDM). We configured three complementary sets of features related to cognition, speech, and language, and conducted a statistical analysis to examine which features differed between NDs and CN. Results suggested that features capturing response informativeness, reaction times, vocabulary richness, and syntactic complexity provided separability between AD and CN. Similarly, fundamental frequency variability helped differentiate PD from CN, while the number of salient informational units PDM from CN.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114860381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Vsameter: Evaluation of a New Open-Source Tool to Measure Vowel Space Area and Related Metrics Vsameter:评估一个新的开源工具来测量元音空间面积和相关指标
Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022637
Tianyu Cao, L. Moro-Velázquez, Piotr Żelasko, J. Villalba, N. Dehak
Vowel space area (VSA) is an applicable metric for studying speech production deficits and intelligibility. Previous works suggest that the VSA accounts for almost 50% of the intelligibility variance, being an essential component of global intelligibility estimates. However, almost no study publishes a tool to estimate VSA automatically with publicly available codes. In this paper, we propose an open-source tool called VSAmeter to measure VSA and vowel articulation index (VAI) automatically and validate it with the VSA and VAI obtained from a dataset in which the formants and phone segments have been annotated manually. The results show that VSA and VAI values obtained by our proposed method strongly correlate with those generated by manually extracted F1 and F2 and alignments. Such a method can be utilized in speech applications, e.g., the automatic measurement of VAI for the evaluation of speakers with dysarthria.
元音空间面积(VSA)是研究语音产生缺陷和可理解性的有效指标。先前的研究表明,VSA占可理解性方差的近50%,是全球可理解性估计的重要组成部分。然而,几乎没有研究发布一种工具来使用公开可用的代码自动估计VSA。在本文中,我们提出了一个名为VSAmeter的开源工具,用于自动测量VSA和元音发音指数(VAI),并使用从人工注释的共振子和电话段数据集中获得的VSA和VAI进行验证。结果表明,该方法获得的VSA和VAI值与人工提取F1和F2及对齐生成的值具有较强的相关性。这种方法可用于语音应用,例如,自动测量VAI以评估患有构音障碍的说话者。
{"title":"Vsameter: Evaluation of a New Open-Source Tool to Measure Vowel Space Area and Related Metrics","authors":"Tianyu Cao, L. Moro-Velázquez, Piotr Żelasko, J. Villalba, N. Dehak","doi":"10.1109/SLT54892.2023.10022637","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022637","url":null,"abstract":"Vowel space area (VSA) is an applicable metric for studying speech production deficits and intelligibility. Previous works suggest that the VSA accounts for almost 50% of the intelligibility variance, being an essential component of global intelligibility estimates. However, almost no study publishes a tool to estimate VSA automatically with publicly available codes. In this paper, we propose an open-source tool called VSAmeter to measure VSA and vowel articulation index (VAI) automatically and validate it with the VSA and VAI obtained from a dataset in which the formants and phone segments have been annotated manually. The results show that VSA and VAI values obtained by our proposed method strongly correlate with those generated by manually extracted F1 and F2 and alignments. Such a method can be utilized in speech applications, e.g., the automatic measurement of VAI for the evaluation of speakers with dysarthria.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116446209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2022 IEEE Spoken Language Technology Workshop (SLT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1