首页 > 最新文献

Interspeech最新文献

英文 中文
Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task 生物识别俄语视听扩展掩码(BRAVE-MASKS)语料库:多模式掩码类型识别任务
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10240
M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov
In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.
在本文中,我们提出了一种新的多模式语料库,称为生物识别俄罗斯视听扩展MASKS(BRAVE-MASKS),该语料库旨在分析佩戴各种口罩的人的语音和面部特征,并开发用于说话人双峰验证和识别的自动系统。特别地,我们处理多模式掩模类型识别任务(6类)。因此,开发了音频、视觉和多模式系统,在测试集上的UAR分别为54.83%、72.02%和82.01%。这些性能是BRAVE-MASKS语料库将后续方法与所提出的系统进行比较的基线。
{"title":"Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task","authors":"M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov","doi":"10.21437/interspeech.2022-10240","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10240","url":null,"abstract":"In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1756-1760"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49580219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement 面向相位感知语音增强的矢量量化变分自编码器
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-443
Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki
Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.
基于复理想比掩模(cIRM)的语音增强方法取得了良好的效果。这些方法通常使用深度神经网络来联合估计在复域中定义的cIRM的实分量和虚分量。然而,cIRM的无界特性给有效训练神经网络带来了困难。为了解决这一问题,本文提出了一种相位感知语音增强方法,该方法通过估计复杂自适应维纳滤波器的幅度和相位来实现语音增强。该方法采用抗噪矢量量化变分自编码器,在时频域利用Itakura-Saito散度估计维纳滤波器的幅值,在时域利用尺度不变信噪比约束的卷积循环网络估计维纳滤波器的相位。在开放的Voice Bank+DEMAND数据集上对该方法进行了评估,与其他语音增强方法进行了直接比较,在2020年深度噪声挑战中,该方法的语音质量感知评价得分为2.85,短时间客观可理解性得分为0.94,优于基于cIRM估计的最先进方法。
{"title":"Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement","authors":"Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki","doi":"10.21437/interspeech.2022-443","DOIUrl":"https://doi.org/10.21437/interspeech.2022-443","url":null,"abstract":"Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"176-180"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42627367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks 轻度认知障碍和帕金森氏症患者在有和无同时绘图任务的情况下的语音声学
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10772
Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri
Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1
帕金森病(PD)的特点是运动功能障碍;然而,认知能力下降等非运动症状也会对生活质量产生巨大影响。目前诊断认知障碍的评估需要很多小时,并且需要临床医生的高度参与。因此,有必要开发新的工具,快速准确地确定认知障碍,以便进行适当、及时的干预。在本文中,被指定为无认知障碍(NCI)或轻度认知障碍(MCI)的帕金森病患者接受了一项基于语音的方案,包括阅读或列出一个类别中的项目,无论是否同时进行绘图任务。从语音记录中,我们提取基于运动协调的特征,这些特征源自代表语音产生子系统的声学特征之间的相关性。在高斯混合模型中使用基于相关性的特征来区分单任务和双任务范式中指定为NCI或MCI的个体。特别是来自喉部和呼吸子系统的特征,在AUCs>0.80的这两组之间有区别。这些结果表明,使用单任务和双任务范式的语音都可以检测到认知障碍,认知障碍可能表现为声带振动稳定性的差异。1.
{"title":"Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks","authors":"Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri","doi":"10.21437/interspeech.2022-10772","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10772","url":null,"abstract":"Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2258-2262"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42743244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech 基于边界检测的语音槽填充有限生成框架
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11347
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang
Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.
补槽是口语理解的重要组成部分。与从ASR输出中提取槽的传统管道方法相比,端到端方法直接从分类或生成框架内的语音中获取槽。然而,分类依赖于预定义的类别,这是不可扩展的,并且生成模型是在开放域空间中解码的,受语音槽边界模糊的影响。为了解决这两种计算的缺点,我们提出了一种新的用于插槽填充的编码器-解码器框架,名为Speech2Slot,利用具有边界检测的有限生成方法。我们还发布了一个大规模的中文语音槽填充数据集,命名为中文语音导航数据集(VNDC)。在VNDC上的实验表明,我们的模型明显优于其他方法,比目前最先进的槽填充方法准确率提高了6.65%。我们公开了我们的代码1,供研究人员复制和构建我们的工作。
{"title":"Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech","authors":"Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang","doi":"10.21437/interspeech.2022-11347","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11347","url":null,"abstract":"Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2748-2752"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42908253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cross-Modal Decision Regularization for Simultaneous Speech Translation 语音同声翻译的跨模态决策正则化
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10617
Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.
同声传译系统在处理输入流中的部分源语句的同时开始产生输出。这些系统需要决定何时读取更多的输入以及何时写入输出。模型所做的决定取决于源/目标语言的结构和部分输入序列中包含的信息。因此,读/写决策策略在不同的输入模式(即语音和文本)之间保持不变。这促使我们利用与语音输入相对应的文本转录本来改进语音到文本的同时翻译(SimulST)。我们提出了跨模态决策正则化(CMDR),通过使用同时文本到文本翻译(SimulMT)任务来改进SimulST系统的决策策略。我们还扩展了离线语音翻译领域的几种技术,以探索SimulMT任务在提高SimulST性能方面的作用。总体而言,在MuST-C英-德(EnDe)SimulST任务的不同延迟机制下,我们比基线模型实现了34.66%/4.5 BLEU的改进。
{"title":"Cross-Modal Decision Regularization for Simultaneous Speech Translation","authors":"Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim","doi":"10.21437/interspeech.2022-10617","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10617","url":null,"abstract":"Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"116-120"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43012436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection 呼吸和语音信号的声学表示学习用于COVID-19检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10376
Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh
In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.
在本文中,我们描述了一种用于COVID-19检测任务的音频信号表示学习方法。原始音频样本用一组参数化为余弦调制高斯函数的一维卷积滤波器进行处理。这些核的选择允许将滤波器组解释为平滑带通滤波器。过滤后的输出进行池化、日志压缩,并用于基于自关注的相关性加权机制。相关性加权强调时频分解的关键区域,这些区域对下游任务很重要。该模型的后续层由循环架构组成,并对模型进行COVID-19检测任务的训练。在我们对Coswara数据集的实验中,我们表明所提出的模型比基线系统以及其他表示学习方法取得了显着的性能改进。此外,所提出的方法被证明是统一适用于语音和呼吸信号,并从更大的数据集迁移学习。版权所有©2022 ISCA。
{"title":"Acoustic Representation Learning on Breathing and Speech Signals for COVID-19 Detection","authors":"Debottam Dutta, Debarpan Bhattacharya, Sriram Ganapathy, A. H. Poorjam, Deepak Mittal, M. Singh","doi":"10.21437/interspeech.2022-10376","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10376","url":null,"abstract":"In this paper, we describe an approach for representation learning of audio signals for the task of COVID-19 detection. The raw audio samples are processed with a bank of 1-D convolutional filters that are parameterized as cosine modulated Gaussian functions. The choice of these kernels allows the interpretation of the filterbanks as smooth band-pass filters. The filtered outputs are pooled, log-compressed and used in a self-attention based relevance weighting mechanism. The relevance weighting emphasizes the key regions of the time-frequency decomposition that are important for the downstream task. The subsequent layers of the model consist of a recurrent architecture and the models are trained for a COVID-19 detection task. In our experiments on the Coswara data set, we show that the proposed model achieves significant performance improvements over the baseline system as well as other representation learning approaches. Further, the approach proposed is shown to be uniformly applicable for speech and breathing signals and for transfer learning from a larger data set. Copyright © 2022 ISCA.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2863-2867"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion 日语asr -鲁棒预训练的伪错误句模型
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-327
Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida
Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.
口语理解系统通常由自动语音识别(ASR)和自然语言处理(NLP)模块组成。虽然预训练语言模型(PLMs)通过在大型书面文本语料库上进行训练在NLP中取得了成功;口语有严重的ASR错误会改变其意思,很难理解。我们提出了一种不使用ASR对ASR误差进行鲁棒预训练的方法。该方法以书面文本为例,利用基于神经网络的字素-音素和音素-字素模型构建的伪错误字典生成含有伪asr错误的句子。语音对话总结实验表明,当对包含ASR误差的对话进行微调时,用该方法预训练的ASR鲁棒LM在ROUGE-L上的性能优于用标准屏蔽语言建模预训练的LM 3.17分。
{"title":"Japanese ASR-Robust Pre-trained Language Model with Pseudo-Error Sentences Generated by Grapheme-Phoneme Conversion","authors":"Yasuhito Ohsugi, Itsumi Saito, Kyosuke Nishida, Sen Yoshida","doi":"10.21437/interspeech.2022-327","DOIUrl":"https://doi.org/10.21437/interspeech.2022-327","url":null,"abstract":"Spoken language understanding systems typically consist of a pipeline of automatic speech recognition (ASR) and natural language processing (NLP) modules. Although pre-trained language models (PLMs) have been successful in NLP by training on large corpora of written texts; spoken language with serious ASR errors that change its meaning is difficult to understand. We propose a method for pre-training Japanese LMs robust against ASR errors without using ASR. With the proposed method using written texts, sentences containing pseudo-ASR errors are generated using a pseudo-error dictionary constructed using grapheme-to-phoneme and phoneme-to-grapheme models based on neural networks. Experiments on spoken dialogue summarization showed that the ASR-robust LM pre-trained with the proposed method outperformed the LM pre-trained with standard masked language modeling by 3.17 points on ROUGE-L when fine-tuning with dialogues including ASR errors.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2688-2692"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47792200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages 自下而上发现不同语言中响应令牌(“后台通道”)的结构和变化
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11288
Andreas Liesenfeld, Mark Dingemanse
Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.
响应令牌(也称为backchannel、continuers或feedback)是人类交互的常见特征,用于显示理解和简化轮询。我们提出了一种自下而上的方法来研究16种语言(8个语系)的响应行为。我们使用顺序上下文和回合循环格式,以语言不可知的方式在不同的会话语料库中识别候选响应令牌。然后我们直接在语音信号上使用UMAP聚类来表示结构和变化。我们发现(i)书写的正字法注释没有充分代表被测试的变化,(ii)格式之间的区别可以是渐变的,而不是离散的,(iii)大多数语言似乎在最小的鼻音格式“mm”和更完整的“yeah”-类似格式之间提供了广泛的区别。绘制人类交互的这一方面有助于我们理解跨语言的交互基础结构,并可以为语音技术的设计提供信息。
{"title":"Bottom-up discovery of structure and variation in response tokens ('backchannels') across diverse languages","authors":"Andreas Liesenfeld, Mark Dingemanse","doi":"10.21437/interspeech.2022-11288","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11288","url":null,"abstract":"Response tokens (also known as backchannels, continuers, or feedback) are a frequent feature of human interaction, where they serve to display understanding and streamline turn-taking. We propose a bottom-up method to study responsive behaviour across 16 languages (8 language families). We use sequential context and recurrence of turns formats to identify candidate response tokens in a language-agnostic way across diverse conversational corpora. We then use UMAP clustering directly on speech signals to represent structure and variation. We find that (i) written orthographic annotations underrepresent the at-tested variation, (ii) distinctions between formats can be gradient rather than discrete, (iii) most languages appear to make available a broad distinction between a minimal nasal format ‘mm’ and a fuller ‘yeah’-like format. Charting this aspect of human interaction contributes to our understanding of interactional infrastructure across languages and can inform the design of speech technologies.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1126-1130"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48906637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion FlowCPCVC:一种用于任意语音转换的对比预测编码监督流框架
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-577
Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang
Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .
近年来,对任意语音转换(VC)的研究发展迅速。然而,它们的质量往往不尽如人意,需要两个阶段的训练,其中频谱生成过程是必不可少的。在本文中,我们提出了FlowCPCVC系统,该系统具有更高的语音自然度和音色相似性。FlowCPCVC是我们所知的第一个针对任何任务的单阶段训练系统,它利用了VAE和对比学习。我们使用扬声器编码器来提取音色信息,并使用基于对比预测编码(CPC)的内容提取器来引导流模块丢弃音色并保留语言信息。我们的方法直接将声码器结合到训练中,从而避免了两阶段训练中频谱信息的丢失。使用一种奇特的方法来训练任意到任意任务,当在任意到多转换中使用它时,我们也可以获得稳健的结果。实验表明,与目前最先进的任意语音转换系统VQMIVC相比,FlowCPCVC实现了明显的改进。我们的演示可在线获得1。
{"title":"FlowCPCVC: A Contrastive Predictive Coding Supervised Flow Framework for Any-to-Any Voice Conversion","authors":"Jiahong Huang, Wen Xu, Yule Li, Junshi Liu, Dongpeng Ma, Wei Xiang","doi":"10.21437/interspeech.2022-577","DOIUrl":"https://doi.org/10.21437/interspeech.2022-577","url":null,"abstract":"Recently, the research of any-to-any voice conversion(VC) has been developed rapidly. However, they often suffer from unsat-isfactory quality and require two stages for training, in which a spectrum generation process is indispensable. In this paper, we propose the FlowCPCVC system, which results in higher speech naturalness and timbre similarity. FlowCPCVC is the first one-stage training system for any-to-any task in our knowledge by taking advantage of VAE and contrastive learning. We employ a speaker encoder to extract timbre information, and a contrastive predictive coding(CPC) based content extractor to guide the flow module to discard the timbre and keeping the linguistic information. Our method directly incorporates the vocoder into the training, thus avoiding the loss of spectral information as in two-stage training. With a fancy method in training any-to-any task, we can also get robust results when using it in any-to-many conversion. Experiments show that FlowCPCVC achieves obvious improvement when compared to VQMIVC which is current state-of-the-art any-to-any voice conversion system. Our demo is available online 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2558-2562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48445659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reducing Domain mismatch in Self-supervised speech pre-training 减少自监督语音预训练中的域不匹配
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-736
M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano
Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.
掩码语音建模(MSM)方法,如wav2vec2或w2v-BERT,学习在话语中随机掩码的语音帧上的表示。虽然这些方法提高了自动语音识别(ASR)系统的性能,但它们有一个主要的局限性。他们对所有的无监督语音样本的权重都是相等的,这阻碍了学习,因为不是所有的样本都有相关的信息来学习有意义的表示。在这项工作中,我们解决了这个限制。我们提出了一种在MSM预训练中关注特定样本的新方法ask2mask (ATM)。ATM采用外部ASR模型或评分器通过执行细粒度数据选择来对无监督输入样本进行加权。ATM对评分者选择的高度自信的输入帧执行屏蔽。这允许模型学习有意义的表示。我们对libisspeech(与预训练数据匹配)和AMI和CHiME-6(与预训练数据不匹配)两个经过良好基准测试的corpora进行了微调实验。结果证实了ATM在不匹配条件下显著提高识别性能,而在匹配条件下仍有适度的提高。
{"title":"Reducing Domain mismatch in Self-supervised speech pre-training","authors":"M. Baskar, A. Rosenberg, B. Ramabhadran, Yu Zhang, Nicolás Serrano","doi":"10.21437/interspeech.2022-736","DOIUrl":"https://doi.org/10.21437/interspeech.2022-736","url":null,"abstract":"Masked speech modeling (MSM) methods such as wav2vec2 or w2v-BERT learn representations over speech frames which are randomly masked within an utterance. While these methods improve performance of Automatic Speech Recognition (ASR) systems, they have one major limitation. They treat all unsupervised speech samples with equal weight, which hinders learning as not all samples have relevant information to learn meaningful representations. In this work, we address this limitation. We propose ask2mask (ATM), a novel approach to focus on specific samples during MSM pre-training. ATM employs an external ASR model or scorer to weight unsupervised input samples by performing a fine-grained data selection. ATM performs masking over the highly confident input frames as chosen by the scorer. This allows the model to learn meaningful representations. We conduct fine-tuning experiments on two well-benchmarked cor-pora: LibriSpeech (matching the pre-training data) and, AMI and CHiME-6 (not matching the pre-training data). The results substantiate the efficacy of ATM on significantly improving the recognition performance under mismatched conditions while still yielding modest improvements under matched conditions.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3028-3032"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48701115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1