首页 > 最新文献

Interspeech最新文献

英文 中文
Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding 基于多粒度对齐的跨模态迁移学习用于端到端口语理解
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11378
Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He
End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.
通过跨模态(文本到音频)迁移学习,端到端口语理解(E2E-SLU)取得了令人印象深刻的进步。然而,目前的方法大多侧重于具有简单损失的粗粒度序列级文本到音频的知识转移,而忽略了两种模式之间的细粒度时间对齐。在这项工作中,我们提出了一种新的用于E2E-SLU的多粒度跨模态迁移学习框架。具体来说,我们设计了一个交叉注意力模块来将文本的标记与语音的框架特征对齐,鼓励模型在传递语义信息的过程中针对每个标记所涉及的显著声学特征。我们还利用对比学习来促进句子层面的跨模态表征学习。最后,我们探索了各种数据扩充方法,以缓解E2E-SLU训练中大量标记数据的不足。在英文和中文SLU数据集上进行了大量实验,以验证我们提出的方法的有效性。实验结果和详细分析证明了该模型的优越性和竞争力。
{"title":"Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding","authors":"Yi Zhu, Zexun Wang, Hang Liu, Pei-Hsin Wang, Mingchao Feng, Meng Chen, Xiaodong He","doi":"10.21437/interspeech.2022-11378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11378","url":null,"abstract":"End-to-end spoken language understanding (E2E-SLU) has witnessed impressive improvements through cross-modal (text-to-audio) transfer learning. However, current methods mostly focus on coarse-grained sequence-level text-to-audio knowledge transfer with simple loss, and neglecting the fine-grained temporal alignment between the two modalities. In this work, we propose a novel multi-grained cross-modal transfer learning framework for E2E-SLU. Specifically, we devise a cross attention module to align the tokens of text with the frame features of speech, encouraging the model to target at the salient acoustic features attended to each token during transferring the semantic information. We also leverage contrastive learning to facilitate cross-modal representation learning in sentence level. Finally, we explore various data augmentation methods to mitigate the deficiency of large amount of labelled data for the training of E2E-SLU. Extensive experiments are conducted on both English and Chinese SLU datasets to verify the effectiveness of our proposed approach. Experimental results and detailed analyses demonstrate the superiority and competitiveness of our model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1131-1135"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44733411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Neural correlates of acoustic and semantic cues during speech segmentation in French 法语语音切分过程中声学和语义线索的神经关联
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10986
Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier
Natural speech is highly complex and variable. Particularly, spoken language, in contrast to written language, has no clear word boundaries. Adult listeners can exploit different types of information to segment the continuous stream such as acoustic and semantic information. However, the weight of these cues, when co-occurring, remains to be determined. Behavioural tasks are not conclusive on this point as they focus participants ’ attention on certain sources of information, thus biasing the results. Here, we looked at the processing of homophonic utterances such as l’amie vs la mie (both /lami/) which include fine acoustic differences and for which the meaning changes depending on segmentation. To examine the perceptual resolution of such ambiguities when semantic information is available, we measured the online processing of sentences containing such sequences in an ERP experiment involving no active task. In a congruent context, semantic information matched the acoustic signal of the word amie, while, in the incongruent condition, the semantic information carried by the sentence and the acoustic signal were leading to different lexical candidates. No clear neural markers for the use of acoustic cues were found. Our results suggest a preponderant weight of semantic information over acoustic information during natural spoken sentence processing.
自然语言是高度复杂和多变的。特别是口语,与书面语相比,没有明确的单词界限。成年听众可以利用不同类型的信息来分割连续的信息流,如声学信息和语义信息。然而,当这些线索同时出现时,其权重仍有待确定。行为任务在这一点上并不是决定性的,因为它们将参与者的注意力集中在某些信息来源上,从而使结果产生偏差。在这里,我们研究了同音话语的处理,比如l 'amie和la mie(都是/lami/),它们包括细微的声学差异,并且它们的意义根据分割而变化。为了检验当语义信息可用时,这些歧义的感知分辨率,我们在一个不涉及主动任务的ERP实验中测量了包含这些序列的句子的在线处理。在完全一致的语境下,语义信息与声信号相匹配,而在完全不一致的语境下,句子所携带的语义信息和声信号导致了不同的词汇候选者。没有发现使用声音线索的明确的神经标记。我们的研究结果表明,在自然口语句子处理过程中,语义信息的权重高于声学信息。
{"title":"Neural correlates of acoustic and semantic cues during speech segmentation in French","authors":"Maria del Mar Cordero, Ambre Denis-Noël, E. Spinelli, F. Meunier","doi":"10.21437/interspeech.2022-10986","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10986","url":null,"abstract":"Natural speech is highly complex and variable. Particularly, spoken language, in contrast to written language, has no clear word boundaries. Adult listeners can exploit different types of information to segment the continuous stream such as acoustic and semantic information. However, the weight of these cues, when co-occurring, remains to be determined. Behavioural tasks are not conclusive on this point as they focus participants ’ attention on certain sources of information, thus biasing the results. Here, we looked at the processing of homophonic utterances such as l’amie vs la mie (both /lami/) which include fine acoustic differences and for which the meaning changes depending on segmentation. To examine the perceptual resolution of such ambiguities when semantic information is available, we measured the online processing of sentences containing such sequences in an ERP experiment involving no active task. In a congruent context, semantic information matched the acoustic signal of the word amie, while, in the incongruent condition, the semantic information carried by the sentence and the acoustic signal were leading to different lexical candidates. No clear neural markers for the use of acoustic cues were found. Our results suggest a preponderant weight of semantic information over acoustic information during natural spoken sentence processing.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4058-4062"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41513074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complex sounds and cross-language influence: The case of ejectives in Omani Mehri 复杂的声音和跨语言的影响:以阿曼语Mehri中的弹射为例
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10199
Rachid Ridouane, Philipp Buech
Ejective consonants are known to considerably vary both cross-linguistically and within individual languages. This variability is often considered a consequence of the complex articulatory strategies involved in their production. Because they are complex, they might be particularly prone to sound change, especially under cross-language influence. In this study, we consider the production of ejectives in Mehri, a Semitic endangered language spoken in Oman where considerable influence from Arabic is expected. We provide acoustic data from seven speakers producing a list of items contrasting ejective and pulmonic alveolar and velar stops in word-initial (/#—/), word-medial (V—V), and word-final (V—#) positions. Different durational and non-durational correlates were examined. The relative importance of these correlates was quantified by the calculation of D-prime values for each. The key empirical finding is that the parameters used to signal ejectivity differ depending mainly on whether the stop is alveolar or velar. Specifically, ejective alveolar stops display characteristics of pharyngealization, similar to Arabic, but velars still maintain attributes of ejectivity in some word positions. We interpret these results as diagnostic of the sound change that is currently in progress, coupled with an ongoing context-dependent neutralization.
众所周知,推出辅音在跨语言和个别语言中都有很大的差异。这种变异性通常被认为是它们产生过程中复杂发音策略的结果。因为它们很复杂,它们可能特别容易发音变化,尤其是在跨语言影响下。在这项研究中,我们考虑了Mehri语中驱逐语的产生,这是一种在阿曼使用的闪米特语,预计阿拉伯语会对其产生相当大的影响。我们提供了来自七个扬声器的声学数据,这些扬声器产生了一个项目列表,这些项目在单词首字母(/#-/)、单词中间字母(V-V)和单词尾字母(V-#)的位置上对比了喷射性和肺性肺泡和velar停止。研究了不同的持续时间和非持续时间相关性。这些相关性的相对重要性是通过计算每个相关性的D质值来量化的。关键的经验发现是,用于发出弹射信号的参数主要取决于停止是肺泡还是绒毛。具体来说,顶出性牙槽停止语显示出类似于阿拉伯语的咽化特征,但在某些单词位置上,维拉尔语仍然保持着顶出性的属性。我们将这些结果解释为对目前正在进行的声音变化的诊断,以及正在进行的上下文相关的中和。
{"title":"Complex sounds and cross-language influence: The case of ejectives in Omani Mehri","authors":"Rachid Ridouane, Philipp Buech","doi":"10.21437/interspeech.2022-10199","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10199","url":null,"abstract":"Ejective consonants are known to considerably vary both cross-linguistically and within individual languages. This variability is often considered a consequence of the complex articulatory strategies involved in their production. Because they are complex, they might be particularly prone to sound change, especially under cross-language influence. In this study, we consider the production of ejectives in Mehri, a Semitic endangered language spoken in Oman where considerable influence from Arabic is expected. We provide acoustic data from seven speakers producing a list of items contrasting ejective and pulmonic alveolar and velar stops in word-initial (/#—/), word-medial (V—V), and word-final (V—#) positions. Different durational and non-durational correlates were examined. The relative importance of these correlates was quantified by the calculation of D-prime values for each. The key empirical finding is that the parameters used to signal ejectivity differ depending mainly on whether the stop is alveolar or velar. Specifically, ejective alveolar stops display characteristics of pharyngealization, similar to Arabic, but velars still maintain attributes of ejectivity in some word positions. We interpret these results as diagnostic of the sound change that is currently in progress, coupled with an ongoing context-dependent neutralization.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3433-3437"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41455931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues 在日语中,/a/+/ta/的音节序列可以听为/atta/,带有视觉或触觉提示
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10099
T. Arai, Miho Yamada, Megumi Okusawa
In our previous work, we reported that the word /atta/ with a geminate consonant differs from the syllable sequence /a/+pause+/ta/ in Japanese; specifically, there are formant transitions at the end of the first syllable in /atta/ but not in /a/+pause+/ta/. We also showed that native Japanese speakers perceived /atta/ when a facial video of /atta/ was synchronously played with an audio signal of /a/+pause+/ta/. In that study, we utilized two video clips for the two utterances in which the speaker was asked to control only the timing of the articulatory closing. In that case, there was no guarantee that the videos would be the exactly same except for the timing. Therefore, in the current study, we use a physical model of the human vocal tract with a miniature robot hand unit to achieve articulatory movements for visual cues. We also provide tactile cues to the listener’s finger because we want to test whether cues of another modality affect this perception in the same framework. Our findings showed that when either visual or tactile cues were presented with an audio stimulus, listeners more frequently responded that they heard /atta/ compared to audio-only presentations.
在我们之前的工作中,我们报道了带有双元音辅音的单词/atta/与日语中的音节序列/a/+pause+/ta/不同;具体来说,/atta/中第一个音节的末尾有形成音过渡,而/a/+pause+/ta/中没有。我们还发现,当/atta/的面部视频与/a/+pause+/ta/的音频信号同步播放时,以日语为母语的人感知到/atta/。在那项研究中,我们使用了两个视频片段来描述两种话语,在这两种话语中,说话者被要求只控制发音关闭的时间。在这种情况下,除了时间外,无法保证视频会完全相同。因此,在目前的研究中,我们使用一个人类声道的物理模型和一个微型机械手单元来实现视觉提示的发音运动。我们还向听者的手指提供触觉线索,因为我们想测试另一种形态的线索是否会在相同的框架下影响这种感知。我们的研究结果表明,当视觉或触觉提示同时呈现音频刺激时,与纯音频演示相比,听众更频繁地回应他们听到了/atta/。
{"title":"Syllable sequence of /a/+/ta/ can be heard as /atta/ in Japanese with visual or tactile cues","authors":"T. Arai, Miho Yamada, Megumi Okusawa","doi":"10.21437/interspeech.2022-10099","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10099","url":null,"abstract":"In our previous work, we reported that the word /atta/ with a geminate consonant differs from the syllable sequence /a/+pause+/ta/ in Japanese; specifically, there are formant transitions at the end of the first syllable in /atta/ but not in /a/+pause+/ta/. We also showed that native Japanese speakers perceived /atta/ when a facial video of /atta/ was synchronously played with an audio signal of /a/+pause+/ta/. In that study, we utilized two video clips for the two utterances in which the speaker was asked to control only the timing of the articulatory closing. In that case, there was no guarantee that the videos would be the exactly same except for the timing. Therefore, in the current study, we use a physical model of the human vocal tract with a miniature robot hand unit to achieve articulatory movements for visual cues. We also provide tactile cues to the listener’s finger because we want to test whether cues of another modality affect this perception in the same framework. Our findings showed that when either visual or tactile cues were presented with an audio stimulus, listeners more frequently responded that they heard /atta/ compared to audio-only presentations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"302 ","pages":"3083-3087"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41331666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation 基于潜在语音表征的跨语言迁移学习音素错误检测方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10228
Jovan M. Dalhouse, K. Itou
Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is ex-pensive and often scarce. This paper discusses a ”zero-shot” transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to ad-ditionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning.
已经对用于发音错误检测的CALL系统进行了广泛的研究,以通过自我评估自动化语言改进。然而,这些先前的方法中的许多都依赖于HMM或神经网络混合模型,尽管已被证明是有效的,但它们通常利用语音标记的L2语音数据,这是一种额外的且通常稀缺的数据。本文讨论了一种“零样本”迁移学习方法,以检测日本母语使用者仅使用未对齐的语音标记母语语音的二级英语语音中的语音错误。所提出的方法引入了一个简单的基础架构,该架构利用了在未标记的多语言语音上预训练的XLSR-Wav2Vec2.0模型。每种语言的音位映射是基于相似音位的发音差异来确定的。该方法在用低资源的自动语音标记对70小时的语音进行微调后,对错误的L2语音实现了0.214的语音错误率,并且证明了在不需要L2语音微调的情况下有效地对L2说话者的母语的音素进行条件建模。
{"title":"Cross-Lingual Transfer Learning Approach to Phoneme Error Detection via Latent Phonetic Representation","authors":"Jovan M. Dalhouse, K. Itou","doi":"10.21437/interspeech.2022-10228","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10228","url":null,"abstract":"Extensive research has been conducted on CALL systems for Pronunciation Error detection to automate language improvement through self-evaluation. However, many of these previous approaches have relied on HMM or Neural Network Hybrid Models which, although have proven to be effective, often utilize phonetically labelled L2 speech data which is ex-pensive and often scarce. This paper discusses a ”zero-shot” transfer learning approach to detect phonetic errors in L2 English speech by Japanese native speakers using solely unaligned phonetically labelled native language speech. The proposed method introduces a simple base architecture which utilizes the XLSR-Wav2Vec2.0 model pre-trained on unlabelled multilingual speech. Phoneme mapping for each language is determined based on difference of articulation of similar phonemes. This method achieved a Phonetic Error Rate of 0.214 on erroneous L2 speech after fine-tuning on 70 hours of speech with low resource automated phonetic labelling, and proved to ad-ditionally model phonemes of the native language of the L2 speaker effectively without the need for L2 speech fine-tuning.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3133-3137"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41397632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task 生物识别俄语视听扩展掩码(BRAVE-MASKS)语料库:多模式掩码类型识别任务
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10240
M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov
In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.
在本文中,我们提出了一种新的多模式语料库,称为生物识别俄罗斯视听扩展MASKS(BRAVE-MASKS),该语料库旨在分析佩戴各种口罩的人的语音和面部特征,并开发用于说话人双峰验证和识别的自动系统。特别地,我们处理多模式掩模类型识别任务(6类)。因此,开发了音频、视觉和多模式系统,在测试集上的UAR分别为54.83%、72.02%和82.01%。这些性能是BRAVE-MASKS语料库将后续方法与所提出的系统进行比较的基线。
{"title":"Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS) Corpus: Multimodal Mask Type Recognition Task","authors":"M. Markitantov, E. Ryumina, D. Ryumin, A. Karpov","doi":"10.21437/interspeech.2022-10240","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10240","url":null,"abstract":"In this paper, we present a new multimodal corpus called Biometric Russian Audio-Visual Extended MASKS (BRAVE-MASKS), which is designed to analyze voice and facial characteristics of persons wearing various masks, as well as to develop automatic systems for bimodal verification and identification of speakers. In particular, we tackle the multimodal mask type recognition task (6 classes). As a result, audio, visual and multimodal systems were developed, which showed UAR of 54.83%, 72.02% and 82.01%, respectively, on the Test set. These performances are the baseline for the BRAVE-MASKS corpus to compare the follow-up approaches with the proposed systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1756-1760"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49580219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement 面向相位感知语音增强的矢量量化变分自编码器
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-443
Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki
Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.
基于复理想比掩模(cIRM)的语音增强方法取得了良好的效果。这些方法通常使用深度神经网络来联合估计在复域中定义的cIRM的实分量和虚分量。然而,cIRM的无界特性给有效训练神经网络带来了困难。为了解决这一问题,本文提出了一种相位感知语音增强方法,该方法通过估计复杂自适应维纳滤波器的幅度和相位来实现语音增强。该方法采用抗噪矢量量化变分自编码器,在时频域利用Itakura-Saito散度估计维纳滤波器的幅值,在时域利用尺度不变信噪比约束的卷积循环网络估计维纳滤波器的相位。在开放的Voice Bank+DEMAND数据集上对该方法进行了评估,与其他语音增强方法进行了直接比较,在2020年深度噪声挑战中,该方法的语音质量感知评价得分为2.85,短时间客观可理解性得分为0.94,优于基于cIRM估计的最先进方法。
{"title":"Vector-quantized Variational Autoencoder for Phase-aware Speech Enhancement","authors":"Tuan Vu Ho, Q. Nguyen, M. Akagi, M. Unoki","doi":"10.21437/interspeech.2022-443","DOIUrl":"https://doi.org/10.21437/interspeech.2022-443","url":null,"abstract":"Speech-enhancement methods based on the complex ideal ratio mask (cIRM) have achieved promising results. These methods often deploy a deep neural network to jointly estimate the real and imaginary components of the cIRM defined in the complex domain. However, the unbounded property of the cIRM poses difficulties when it comes to effectively training a neural network. To alleviate this problem, this paper proposes a phase-aware speech-enhancement method through estimating the magnitude and phase of a complex adaptive Wiener filter. With this method, a noise-robust vector-quantized variational autoencoder is used for estimating the magnitude of the Wiener filter by using the Itakura-Saito divergence on the time-frequency domain, while the phase of the Wiener filter is estimated using a convolutional recurrent network using the scale-invariant signal-to-noise-ratio constraint in the time domain. The proposed method was evaluated on the open Voice Bank+DEMAND dataset to provide a direct comparison with other speech-enhancement methods and achieved a Perceptual Evaluation of Speech Quality score of 2.85 and ShortTime Objective Intelligibility score of 0.94, which is better than the stateof-art method based on cIRM estimation during the 2020 Deep Noise Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"176-180"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42627367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks 轻度认知障碍和帕金森氏症患者在有和无同时绘图任务的情况下的语音声学
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10772
Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri
Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1
帕金森病(PD)的特点是运动功能障碍;然而,认知能力下降等非运动症状也会对生活质量产生巨大影响。目前诊断认知障碍的评估需要很多小时,并且需要临床医生的高度参与。因此,有必要开发新的工具,快速准确地确定认知障碍,以便进行适当、及时的干预。在本文中,被指定为无认知障碍(NCI)或轻度认知障碍(MCI)的帕金森病患者接受了一项基于语音的方案,包括阅读或列出一个类别中的项目,无论是否同时进行绘图任务。从语音记录中,我们提取基于运动协调的特征,这些特征源自代表语音产生子系统的声学特征之间的相关性。在高斯混合模型中使用基于相关性的特征来区分单任务和双任务范式中指定为NCI或MCI的个体。特别是来自喉部和呼吸子系统的特征,在AUCs>0.80的这两组之间有区别。这些结果表明,使用单任务和双任务范式的语音都可以检测到认知障碍,认知障碍可能表现为声带振动稳定性的差异。1.
{"title":"Speech Acoustics in Mild Cognitive Impairment and Parkinson's Disease With and Without Concurrent Drawing Tasks","authors":"Tanya Talkar, Christina Manxhari, James Williamson, Kara M. Smith, T. Quatieri","doi":"10.21437/interspeech.2022-10772","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10772","url":null,"abstract":"Parkinson’s disease (PD) is characterized by motor dysfunction; however, non-motor symptoms such as cognitive decline also have a dramatic impact on quality of life. Current assessments to diagnose cognitive impairment take many hours and require high clinician involvement. Thus, there is a need to develop new tools leading to quick and accurate determination of cognitive impairment to allow for appropriate, timely interventions. In this paper, individuals with PD, designated as either having no cognitive impairment (NCI) or mild cognitive impairment (MCI), undergo a speech-based protocol, involving reading or listing items within a category, performed either with or without a concurrent drawing task. From the speech recordings, we extract motor coordination-based features, derived from correlations across acoustic features representative of speech production subsystems. The correlation-based features are utilized in gaussian mixture models to discriminate between individuals designated NCI or MCI in both the single and dual task paradigms. Features derived from the laryngeal and respiratory subsystems, in particular, discriminate between these two groups with AUCs > 0.80. These results suggest that cognitive impairment can be detected using speech from both single and dual task paradigms, and that cognitive impairment may manifest as differences in vocal fold vibration stability. 1","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2258-2262"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42743244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech 基于边界检测的语音槽填充有限生成框架
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11347
Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang
Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.
补槽是口语理解的重要组成部分。与从ASR输出中提取槽的传统管道方法相比,端到端方法直接从分类或生成框架内的语音中获取槽。然而,分类依赖于预定义的类别,这是不可扩展的,并且生成模型是在开放域空间中解码的,受语音槽边界模糊的影响。为了解决这两种计算的缺点,我们提出了一种新的用于插槽填充的编码器-解码器框架,名为Speech2Slot,利用具有边界检测的有限生成方法。我们还发布了一个大规模的中文语音槽填充数据集,命名为中文语音导航数据集(VNDC)。在VNDC上的实验表明,我们的模型明显优于其他方法,比目前最先进的槽填充方法准确率提高了6.65%。我们公开了我们的代码1,供研究人员复制和构建我们的工作。
{"title":"Speech2Slot: A Limited Generation Framework with Boundary Detection for Slot Filling from Speech","authors":"Pengwei Wang, Yinpei Su, Xiaohuan Zhou, Xin Ye, Liangchen Wei, Ming Liu, Yuan You, Feijun Jiang","doi":"10.21437/interspeech.2022-11347","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11347","url":null,"abstract":"Slot filling is an essential component of Spoken Language Understanding. In contrast to conventional pipeline approaches, which extract slots from the ASR output, end-to-end approaches directly get slots from speech within a classification or generation framework. However, classification relies on predefined categories, which is not scal-able, and the generative model is decoding in an open-domain space, suffering from blurred boundaries of slots in speech. To address the shortcomings of these two for-mulations, we propose a new encoder-decoder framework for slot filling, named Speech2Slot, leveraging a limited generation method with boundary detection. We also released a large-scale Chinese spoken slot filling dataset named Voice Navigation Dataset in Chinese (VNDC). Experiments on VNDC show that our model is markedly superior to other approaches, outperforming the state-of-the-art slot filling approach with 6.65% accuracy improvement. We make our code 1 publicly available for researchers to replicate and build on our work.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2748-2752"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42908253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cross-Modal Decision Regularization for Simultaneous Speech Translation 语音同声翻译的跨模态决策正则化
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10617
Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim
Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.
同声传译系统在处理输入流中的部分源语句的同时开始产生输出。这些系统需要决定何时读取更多的输入以及何时写入输出。模型所做的决定取决于源/目标语言的结构和部分输入序列中包含的信息。因此,读/写决策策略在不同的输入模式(即语音和文本)之间保持不变。这促使我们利用与语音输入相对应的文本转录本来改进语音到文本的同时翻译(SimulST)。我们提出了跨模态决策正则化(CMDR),通过使用同时文本到文本翻译(SimulMT)任务来改进SimulST系统的决策策略。我们还扩展了离线语音翻译领域的几种技术,以探索SimulMT任务在提高SimulST性能方面的作用。总体而言,在MuST-C英-德(EnDe)SimulST任务的不同延迟机制下,我们比基线模型实现了34.66%/4.5 BLEU的改进。
{"title":"Cross-Modal Decision Regularization for Simultaneous Speech Translation","authors":"Mohd Abbas Zaidi, Beomseok Lee, Sangha Kim, Chanwoo Kim","doi":"10.21437/interspeech.2022-10617","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10617","url":null,"abstract":"Simultaneous translation systems start producing the output while processing the partial source sentence in the incoming input stream. These systems need to decide when to read more input and when to write the output. The decisions taken by the model depend on the structure of source/target language and the information contained in the partial input sequence. Hence, read/write decision policy remains the same across different input modalities, i.e., speech and text. This motivates us to leverage the text transcripts corresponding to the speech input for improving simultaneous speech-to-text translation (SimulST). We propose Cross-Modal Decision Regularization (CMDR) to improve the decision policy of SimulST systems by using the simultaneous text-to-text translation (SimulMT) task. We also extend several techniques from the offline speech translation domain to explore the role of SimulMT task in improving SimulST performance. Overall, we achieve 34.66% / 4.5 BLEU improvement over the baseline model across different latency regimes for the MuST-C English-German (EnDe) SimulST task.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"116-120"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43012436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1