Interspeech最新文献

英文中文

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion 实时单次语音转换的可流语音表示解纠缠和多级韵律建模

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10277

Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu

This paper takes efforts to tackle the challenge of “live” oneshot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.

本文致力于解决“实时”单声道语音转换（VC）的挑战，该转换以流式方式在任意扬声器之间进行转换，同时保持高清晰度和自然度。我们提出了一种基于无监督和有监督学习的混合VC模型，该模型具有两阶段的模型训练策略。特别地，我们首先使用无监督解纠缠框架来分离不同粒度的语音表示。实验结果表明，我们提出的方法在语音自然度、可懂度和说话人相似性方面的性能与离线VC解决方案相当，在实际实时应用中具有足够的效率。音频样本可在线演示。

引用次数: 4

A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling 基于扩散概率建模的统一语音克隆和语音转换系统

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10879

T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei

Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.

文本到语音和语音转换是两个常见的语音生成任务，通常使用不同的模型来解决。在本文中，我们提出了一种基于单一扩散概率模型的语音克隆和任意到任意语音转换的新方法，该模型具有两个编码器，每个编码器在其输入域上操作，并共享一个解码器。大量的人类评估表明，所提出的模型通过说话人自适应复制目标说话人的声音的能力优于其他已知的同类多模态系统，并且我们的系统在语音克隆和语音转换模式下合成的语音质量与最近提出的针对相应单一任务的算法相当。此外，只需3分钟的GPU时间就可以使我们的模型适应只有15秒未转录音频的新扬声器，这使得它在实际应用中具有吸引力。

引用次数: 5

SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms SoundDoA：从声音原始波形中学习声源到达方向和语义

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-378

Yuhang He, A. Markham

A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.

智能体声学理解环境的一项基本任务是检测声源位置（如到达方向（DoA））和语义标签。这是一项具有挑战性的任务：首先，声源在时间、频率和空间上重叠；其次，虽然语义在很大程度上是通过时频能量（幅度）轮廓来传达的，但DoA是在信道间相位差中编码的；最后，尽管麦克风传感器的数量是稀疏的，但由于高采样率，记录的声音波形在时间上是密集的。现有的DoA预测方法大多依赖于预先提取的2D声学特征，如GCC-PHAT和Mel声谱图，以受益于成熟的基于2D图像的深度神经网络的成功。相反，我们提出了一种新的端到端可训练框架，名为SoundDoA，它能够直接从声音原始波形中学习声源DoA和语义。我们首先使用可学习的前端滤波器组将声源语义和DoA相关特征动态编码为紧凑表示。然后，提出了一个由两个相同的子网络组成的骨干网络，采用分层通信策略来进一步单独和联合学习语义标签和DoA。最后，添加了一个排列不变的多轨头来回归DoA并对语义标签进行分类。在DCASE 2020声音事件检测和定位数据集（SELD）上的大量实验结果表明，与其他现有方法相比，SoundDoA具有优越性。

{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-378","url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47392486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure 重叠频率分布网络:频率感知语音欺骗对策

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-657

Sunmook Choi, Il-Youp Kwak, Seungsang Oh

Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.

世界各地的许多IT公司都在通过其产品开发和部署人工语音助理，但它们仍然容易受到欺骗攻击。自2015年以来，每两年举办一次“自动说话人验证欺骗和对策挑战赛”，鼓励人们设计能够检测欺骗攻击的系统。本文主要研究基于卷积神经网络的欺骗对抗系统。然而，当使用声谱图作为输入时，细胞神经网络具有平移不变的特性，这可能导致频率信息的损失。因此，我们提出了沿频率轴分割输入的模型：1）重叠频率分布（OFD）模型和2）非重叠频率分布模型。使用ASVspoof 2019数据集，我们测量了它们在两种不同激活下的性能；ReLU和最大特征图（MFM）。LA数据集上性能最好的模型是具有ReLU的非OFD模型，其实现了1.35%的等误率（EER），而PA数据集上表现最好的模型则是具有MFM的OFD模型（其实现了0.35%的EER）。

{"title":"Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure","authors":"Sunmook Choi, Il-Youp Kwak, Seungsang Oh","doi":"10.21437/interspeech.2022-657","DOIUrl":"https://doi.org/10.21437/interspeech.2022-657","url":null,"abstract":"Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3558-3562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47675680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

How do our eyebrows respond to masks and whispering? The case of Persians 我们的眉毛对面具和窃窃私语有何反应？波斯人的情况

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10867

Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh

Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.

窃窃私语是人类传递语言信息的一种交际机制。由于缺乏声带振动，耳语在声学上不同于浊音，因为没有基本频率，而基本频率是语调的主要韵律关联之一。本研究探讨了面部线索相对于语音语调线索的重要性。具体来说，我们的目标是探索当人们低声说话和戴口罩时，眉毛的速度和皱纹是如何变化的，同时，当他们应该产生韵律调节时，就像在音调上升的极性问题中一样。为此，我们对10位说波斯语的人做了一个实验。结果表明，说话者耳语时的平均速度更高，这表明耳语中F0的缺失有补偿效应。我们还发现，当说话者戴着面具时，双眉的运动更为明显。最后，我们的研究结果显示，在提问中，眉毛的运动更大，这表明问题比陈述句更明显。没有发现眉纹的显著影响。然而，眉毛的运动与眉毛的扩大正相关，这表明这两种运动类型之间存在相互联系。

{"title":"How do our eyebrows respond to masks and whispering? The case of Persians","authors":"Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh","doi":"10.21437/interspeech.2022-10867","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10867","url":null,"abstract":"Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2023-2027"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47894190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training 会话历史依赖和独立ASR系统的多历史训练端到端联合建模

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11357

Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando

This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.

本文提出了会话历史依赖和独立自动语音识别系统的端到端联合建模方法。会话历史记录在ASR系统中可用，例如会议转录应用程序，但在语音搜索应用程序中不可用。到目前为止，这两个ASR系统都是使用不同的模型单独构建的，但是这对于每个应用程序来说都是低效的。事实上，传统的会话历史相关ASR系统既可以执行历史相关处理，也可以执行独立处理。然而，它们的性能不如历史无关的ASR系统。这是因为传统会话历史相关的ASR系统中的模型体系结构及其训练标准专门用于会话历史可用的情况。为了解决这个问题，我们提出的端到端联合建模方法使用了一种基于跨模式转换器的体系结构，该体系结构可以灵活地切换使用或不使用会话历史。此外，我们提出了多历史训练，同时利用无历史数据集和具有不同历史数据集，通过引入统一的体系结构有效地改进两种类型的ASR处理。在日语ASR任务上的实验验证了该方法的有效性。多历史训练，可以生成针对各种会话上下文和无会话上下文的鲁棒ASR模型。实验结果表明，与传统的E2E-ASR系统相比，所提出的E2E联合模型在历史依赖和独立ASR处理方面都具有更好的性能。

{"title":"End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training","authors":"Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando","doi":"10.21437/interspeech.2022-11357","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11357","url":null,"abstract":"This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3218-3222"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows 双向计算机辅助发音训练与规范化流

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-878

Zhan Zhang, Yuehai Wang, Jianyi Yang

Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. So far, most existing CAPT methods are discriminative and focus on detecting where the mispronunciation is. Although learners receive feedback about their current pronunciation, they may still not be able to learn the correct pronunciation. Nevertheless, there has been little discussion about speech-based teaching in CAPT. To ﬁll this gap, we propose a novel bidirectional CAPT method to detect mispronunciations and generate the corrected pronunciations simultaneously. This correction-based feedback can better preserve the speaking style to make the learning process more personalized. In addition, we propose to adopt normalizing ﬂows to share the latent for these two mirrored discriminative-generative tasks, making the whole model more compact. Experiments show that our method is efﬁcient for mispronunciation detection and can naturally correct the speech under different CAPT granularity requirements.

计算机辅助发音训练(CAPT)在语言学习中起着重要作用。到目前为止，大多数现有的CAPT方法都是判别性的，专注于检测错误发音的位置。尽管学习者收到了关于他们当前发音的反馈，但他们可能仍然无法学会正确的发音。为了填补这一空白，我们提出了一种新的双向CAPT方法来检测错误发音并同时生成正确的发音。这种基于纠正的反馈可以更好地保留说话风格，使学习过程更加个性化。此外，我们建议采用归一化流来共享这两个镜像判别生成任务的潜在，使整个模型更加紧凑。实验结果表明，该方法能够有效地检测语音错误，并能在不同的CAPT粒度要求下自然地纠正语音错误。

引用次数: 0

Homophone Disambiguation Profits from Durational Information 同音字消歧得益于历时信息

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10109

Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf

Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational

鉴于会话语音中的高度节段缩减，大量单词变得同音，而阅读语音中则不然。例如，在这项研究中考虑的标记ah、ach、auch、eine和er在奥地利德语会话中都可以简化为[a]。同音词是自动语音识别（ASR）的一个严重问题，同音词的消歧通常通过词汇上下文来解决。相反，我们提出了两种基于韵律和谱特征的同音词消歧方法。首先，我们构建了一个具有大量声学特征的随机森林分类器，在数据量较小的情况下，该分类器具有良好的性能，并使我们能够深入了解这些同音词在语音细节方面的区别。由于提取特征需要注释，因此这种方法对于集成到ASR系统中是不实用的。因此，我们探索了第二种基于卷积神经网络（CNN）的方法。该方法的性能与基于随机森林的方法相当，结果表明，当与作为ASR系统一部分的随机语言模型相结合时，该方法在促进同音词消歧方面具有很高的潜力。持久的

{"title":"Homophone Disambiguation Profits from Durational Information","authors":"Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf","doi":"10.21437/interspeech.2022-10109","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10109","url":null,"abstract":"Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3198-3202"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48930719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Frame-Level Stutter Detection 框架水平斯图加特检测

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-204

John Harvill, M. Hasegawa-Johnson, C. Yoo

Previous studies on the detection of stuttered speech have focused on classification at the utterance level (e.g., for speech therapy applications), and on the correct insertion of stutter events in sequence into an orthographic transcript. In this paper, we propose the task of frame-level stutter detection which seeks to identify the time alignment of stutter events in a speech ut-terance, and we evaluate our approach on the stutter correction task. Limited previous work on stutter correction has relied on simple signal processing techniques and only been evaluated on small datasets. Our approach is the first large scale data-driven technique proposed to identify stuttering probabilistically at the frame level, and we make use of the largest available stuttering dataset to date during training. Predicted frame-level probabilities of different stuttering events can be used in downstream applications for Automatic Speech Recognition (ASR) as either additional features or part of a speech preprocessing pipeline to clean speech before analysis by an ASR system.

先前对口吃言语检测的研究主要集中在话语水平上的分类(例如，用于言语治疗应用)，以及将口吃事件按顺序正确插入正字法转录本。在本文中，我们提出了帧级口吃检测任务，该任务旨在识别语音中口吃事件的时间一致性，并评估了我们在口吃纠正任务上的方法。先前有限的口吃矫正工作依赖于简单的信号处理技术，并且只在小数据集上进行了评估。我们的方法是第一个大规模的数据驱动技术，提出在帧级概率识别口吃，我们在训练过程中使用迄今为止最大的可用口吃数据集。预测不同口吃事件的帧级概率可用于自动语音识别(ASR)的下游应用，作为附加功能或语音预处理管道的一部分，在ASR系统分析之前清理语音。

引用次数: 4

Soft-label Learn for No-Intrusive Speech Quality Assessment 无干扰语音质量评估的软标签学习

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10400

Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi

Mean opinion score (MOS) is a widely used subjective metric to assess the quality of speech, and usually involves multiple human to judge each speech file. To reduce the labor cost of MOS, no-intrusive speech quality assessment methods have been extensively studied. However, due to the highly subjective bias of speech quality label, the performance of models to accurately represent speech quality scores is difficult to be trained. In this paper, we propose a convolutional self-attention neural network (Conformer) for MOS score prediction of conference speech to effectively alleviate the disadvantage of subjective bias on model training. In addition to this novel architecture, we further improve the generalization and accuracy of the predictor by utilizing attention label pooling and soft-label learning. We demonstrate that our proposed method achieves RMSE cost of 0.458 and PLCC score of 0.792 on evaluation test datasets of Conferencing Speech 2022 Challenge.

平均意见评分(Mean opinion score, MOS)是一种广泛使用的评价语音质量的主观指标，通常需要多人对每个语音文件进行评判。为了降低MOS的人工成本，无干扰语音质量评估方法得到了广泛的研究。然而，由于语音质量标签的高度主观偏差，模型准确表示语音质量分数的性能难以训练。本文提出了一种卷积自注意神经网络(Conformer)用于会议演讲的MOS评分预测，有效缓解了主观偏见对模型训练的不利影响。除了这种新颖的结构外，我们还利用注意标签池和软标签学习进一步提高了预测器的泛化和准确性。在conference Speech 2022 Challenge的评估测试数据集上，我们的方法实现了0.458的RMSE cost和0.792的PLCC score。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Interspeech

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀