首页 > 最新文献

Interspeech最新文献

英文 中文
BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows 双向计算机辅助发音训练与规范化流
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-878
Zhan Zhang, Yuehai Wang, Jianyi Yang
Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. So far, most existing CAPT methods are discriminative and focus on detecting where the mispronunciation is. Although learners receive feedback about their current pronunciation, they may still not be able to learn the correct pronunciation. Nevertheless, there has been little discussion about speech-based teaching in CAPT. To fill this gap, we propose a novel bidirectional CAPT method to detect mispronunciations and generate the corrected pronunciations simultaneously. This correction-based feedback can better preserve the speaking style to make the learning process more personalized. In addition, we propose to adopt normalizing flows to share the latent for these two mirrored discriminative-generative tasks, making the whole model more compact. Experiments show that our method is efficient for mispronunciation detection and can naturally correct the speech under different CAPT granularity requirements.
计算机辅助发音训练(CAPT)在语言学习中起着重要作用。到目前为止,大多数现有的CAPT方法都是判别性的,专注于检测错误发音的位置。尽管学习者收到了关于他们当前发音的反馈,但他们可能仍然无法学会正确的发音。为了填补这一空白,我们提出了一种新的双向CAPT方法来检测错误发音并同时生成正确的发音。这种基于纠正的反馈可以更好地保留说话风格,使学习过程更加个性化。此外,我们建议采用归一化流来共享这两个镜像判别生成任务的潜在,使整个模型更加紧凑。实验结果表明,该方法能够有效地检测语音错误,并能在不同的CAPT粒度要求下自然地纠正语音错误。
{"title":"BiCAPT: Bidirectional Computer-Assisted Pronunciation Training with Normalizing Flows","authors":"Zhan Zhang, Yuehai Wang, Jianyi Yang","doi":"10.21437/interspeech.2022-878","DOIUrl":"https://doi.org/10.21437/interspeech.2022-878","url":null,"abstract":"Computer-Assisted Pronunciation Training (CAPT) plays an important role in language learning. So far, most existing CAPT methods are discriminative and focus on detecting where the mispronunciation is. Although learners receive feedback about their current pronunciation, they may still not be able to learn the correct pronunciation. Nevertheless, there has been little discussion about speech-based teaching in CAPT. To fill this gap, we propose a novel bidirectional CAPT method to detect mispronunciations and generate the corrected pronunciations simultaneously. This correction-based feedback can better preserve the speaking style to make the learning process more personalized. In addition, we propose to adopt normalizing flows to share the latent for these two mirrored discriminative-generative tasks, making the whole model more compact. Experiments show that our method is efficient for mispronunciation detection and can naturally correct the speech under different CAPT granularity requirements.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4332-4336"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48927215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Homophone Disambiguation Profits from Durational Information 同音字消歧得益于历时信息
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10109
Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf
Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational
鉴于会话语音中的高度节段缩减,大量单词变得同音,而阅读语音中则不然。例如,在这项研究中考虑的标记ah、ach、auch、eine和er在奥地利德语会话中都可以简化为[a]。同音词是自动语音识别(ASR)的一个严重问题,同音词的消歧通常通过词汇上下文来解决。相反,我们提出了两种基于韵律和谱特征的同音词消歧方法。首先,我们构建了一个具有大量声学特征的随机森林分类器,在数据量较小的情况下,该分类器具有良好的性能,并使我们能够深入了解这些同音词在语音细节方面的区别。由于提取特征需要注释,因此这种方法对于集成到ASR系统中是不实用的。因此,我们探索了第二种基于卷积神经网络(CNN)的方法。该方法的性能与基于随机森林的方法相当,结果表明,当与作为ASR系统一部分的随机语言模型相结合时,该方法在促进同音词消歧方面具有很高的潜力。持久的
{"title":"Homophone Disambiguation Profits from Durational Information","authors":"Barbara Schuppler, Emil Berger, Xenia Kogler, F. Pernkopf","doi":"10.21437/interspeech.2022-10109","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10109","url":null,"abstract":"Given the high degree of segmental reduction in conversational speech, a large number of words become homophoneous that in read speech are not. For instance, the tokens considered in this study ah , ach , auch , eine and er may all be reduced to [a] in conversational Austrian German. Homophones pose a serious problem for automatic speech recognition (ASR), where homophone disambiguation is typically solved using lexical context. In contrast, we propose two approaches to disambiguate homophones on the basis of prosodic and spectral features. First, we build a Random Forest classifier with a large set of acoustic features, which reaches good performance given the small data size, and allows us to gain insight into how these homophones are distinct with respect to phonetic detail. Since for the extraction of the features annotations are required, this approach would not be practical for the integration into an ASR system. We thus explored a second, convolutional neural network (CNN) based approach. The performance of this approach is on par with the one based on Random Forest, and the results indicate a high potential of this approach to facilitate homophone disambiguation when combined with a stochastic language model as part of an ASR system. durational","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3198-3202"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48930719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frame-Level Stutter Detection 框架水平斯图加特检测
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-204
John Harvill, M. Hasegawa-Johnson, C. Yoo
Previous studies on the detection of stuttered speech have focused on classification at the utterance level (e.g., for speech therapy applications), and on the correct insertion of stutter events in sequence into an orthographic transcript. In this paper, we propose the task of frame-level stutter detection which seeks to identify the time alignment of stutter events in a speech ut-terance, and we evaluate our approach on the stutter correction task. Limited previous work on stutter correction has relied on simple signal processing techniques and only been evaluated on small datasets. Our approach is the first large scale data-driven technique proposed to identify stuttering probabilistically at the frame level, and we make use of the largest available stuttering dataset to date during training. Predicted frame-level probabilities of different stuttering events can be used in downstream applications for Automatic Speech Recognition (ASR) as either additional features or part of a speech preprocessing pipeline to clean speech before analysis by an ASR system.
先前对口吃言语检测的研究主要集中在话语水平上的分类(例如,用于言语治疗应用),以及将口吃事件按顺序正确插入正字法转录本。在本文中,我们提出了帧级口吃检测任务,该任务旨在识别语音中口吃事件的时间一致性,并评估了我们在口吃纠正任务上的方法。先前有限的口吃矫正工作依赖于简单的信号处理技术,并且只在小数据集上进行了评估。我们的方法是第一个大规模的数据驱动技术,提出在帧级概率识别口吃,我们在训练过程中使用迄今为止最大的可用口吃数据集。预测不同口吃事件的帧级概率可用于自动语音识别(ASR)的下游应用,作为附加功能或语音预处理管道的一部分,在ASR系统分析之前清理语音。
{"title":"Frame-Level Stutter Detection","authors":"John Harvill, M. Hasegawa-Johnson, C. Yoo","doi":"10.21437/interspeech.2022-204","DOIUrl":"https://doi.org/10.21437/interspeech.2022-204","url":null,"abstract":"Previous studies on the detection of stuttered speech have focused on classification at the utterance level (e.g., for speech therapy applications), and on the correct insertion of stutter events in sequence into an orthographic transcript. In this paper, we propose the task of frame-level stutter detection which seeks to identify the time alignment of stutter events in a speech ut-terance, and we evaluate our approach on the stutter correction task. Limited previous work on stutter correction has relied on simple signal processing techniques and only been evaluated on small datasets. Our approach is the first large scale data-driven technique proposed to identify stuttering probabilistically at the frame level, and we make use of the largest available stuttering dataset to date during training. Predicted frame-level probabilities of different stuttering events can be used in downstream applications for Automatic Speech Recognition (ASR) as either additional features or part of a speech preprocessing pipeline to clean speech before analysis by an ASR system.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2843-2847"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48948053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Soft-label Learn for No-Intrusive Speech Quality Assessment 无干扰语音质量评估的软标签学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10400
Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi
Mean opinion score (MOS) is a widely used subjective metric to assess the quality of speech, and usually involves multiple human to judge each speech file. To reduce the labor cost of MOS, no-intrusive speech quality assessment methods have been extensively studied. However, due to the highly subjective bias of speech quality label, the performance of models to accurately represent speech quality scores is difficult to be trained. In this paper, we propose a convolutional self-attention neural network (Conformer) for MOS score prediction of conference speech to effectively alleviate the disadvantage of subjective bias on model training. In addition to this novel architecture, we further improve the generalization and accuracy of the predictor by utilizing attention label pooling and soft-label learning. We demonstrate that our proposed method achieves RMSE cost of 0.458 and PLCC score of 0.792 on evaluation test datasets of Conferencing Speech 2022 Challenge.
平均意见评分(Mean opinion score, MOS)是一种广泛使用的评价语音质量的主观指标,通常需要多人对每个语音文件进行评判。为了降低MOS的人工成本,无干扰语音质量评估方法得到了广泛的研究。然而,由于语音质量标签的高度主观偏差,模型准确表示语音质量分数的性能难以训练。本文提出了一种卷积自注意神经网络(Conformer)用于会议演讲的MOS评分预测,有效缓解了主观偏见对模型训练的不利影响。除了这种新颖的结构外,我们还利用注意标签池和软标签学习进一步提高了预测器的泛化和准确性。在conference Speech 2022 Challenge的评估测试数据集上,我们的方法实现了0.458的RMSE cost和0.792的PLCC score。
{"title":"Soft-label Learn for No-Intrusive Speech Quality Assessment","authors":"Junyong Hao, Shunzhou Ye, Cheng Lu, Fei Dong, Jingang Liu, Dong Pi","doi":"10.21437/interspeech.2022-10400","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10400","url":null,"abstract":"Mean opinion score (MOS) is a widely used subjective metric to assess the quality of speech, and usually involves multiple human to judge each speech file. To reduce the labor cost of MOS, no-intrusive speech quality assessment methods have been extensively studied. However, due to the highly subjective bias of speech quality label, the performance of models to accurately represent speech quality scores is difficult to be trained. In this paper, we propose a convolutional self-attention neural network (Conformer) for MOS score prediction of conference speech to effectively alleviate the disadvantage of subjective bias on model training. In addition to this novel architecture, we further improve the generalization and accuracy of the predictor by utilizing attention label pooling and soft-label learning. We demonstrate that our proposed method achieves RMSE cost of 0.458 and PLCC score of 0.792 on evaluation test datasets of Conferencing Speech 2022 Challenge.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3303-3307"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44548627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification 用于欺骗感知说话人验证的规范约束分数级集成
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-470
Peng Zhang, Peng Hu, Xueliang Zhang
In this paper, we present the Elevoc systems submitted to the Spoofing Aware Speaker Verification Challenge (SASVC) 2022. Our submissions focus on bridge the gap between the automatic speaker verification (ASV) and countermeasure (CM) systems. We investigate a general and efficient norm-constrained score-level ensemble method which jointly processes the scores extracted from ASV and CM subsystems, improving robustness to both zero-effect imposters and spoof-ing attacks. Furthermore, we explore that the ensemble system can provide better performance when both ASV and CM subsystems are optimized. Experimental results show that our primary system yields 0.45% SV-EER, 0.26% SPF-EER and 0.37% SASV-EER, and obtains more than 96.08%, 66.67% and 94.19% relative improvements over the best performing baseline systems on the SASVC 2022 evaluation set. All of our code and pre-trained models weights are publicly available and reproducible 1 .
在本文中,我们介绍了提交给2022年Spoo fing Aware扬声器验证挑战赛(SASVC)的Elevoc系统。我们提交的材料重点是弥合自动扬声器验证(ASV)和对抗(CM)系统之间的差距。我们研究了一种通用且有效的范数约束分数级集成方法,该方法联合处理从ASV和CM子系统提取的分数,提高了对零效果冒名顶替和欺骗攻击的鲁棒性。此外,我们探索了当ASV和CM子系统都被优化时,集成系统可以提供更好的性能。实验结果表明,我们的主要系统产生了0.45%SV-EER、0.26%SPF-EER和0.37%SASV-EER,并在SASVC 2022评估集上获得了超过96.08%、66.67%和94.19%的相对改进。我们所有的代码和预先训练的模型权重都是公开的,并且是可复制的1。
{"title":"Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification","authors":"Peng Zhang, Peng Hu, Xueliang Zhang","doi":"10.21437/interspeech.2022-470","DOIUrl":"https://doi.org/10.21437/interspeech.2022-470","url":null,"abstract":"In this paper, we present the Elevoc systems submitted to the Spoofing Aware Speaker Verification Challenge (SASVC) 2022. Our submissions focus on bridge the gap between the automatic speaker verification (ASV) and countermeasure (CM) systems. We investigate a general and efficient norm-constrained score-level ensemble method which jointly processes the scores extracted from ASV and CM subsystems, improving robustness to both zero-effect imposters and spoof-ing attacks. Furthermore, we explore that the ensemble system can provide better performance when both ASV and CM subsystems are optimized. Experimental results show that our primary system yields 0.45% SV-EER, 0.26% SPF-EER and 0.37% SASV-EER, and obtains more than 96.08%, 66.67% and 94.19% relative improvements over the best performing baseline systems on the SASVC 2022 evaluation set. All of our code and pre-trained models weights are publicly available and reproducible 1 .","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4371-4375"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44747898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content 基于cnn的音频事件识别对主要视频内容的自动暴力分类和评级
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10053
Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid
Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difficulty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classification. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.
数字娱乐内容(DEC)中的自动暴力检测使用计算机视觉和自然语言处理方法对视觉和文本模式进行处理。由于数据的多样性、模糊性和多语言性质,这些方法在检测暴力方面面临困难。因此,我们引入了一种基于音频的方法来增强现有的暴力和评级分类方法。我们开发了一个通用的音频事件检测器模型(AED),该模型使用开源和Prime Video专有的语料库作为特征提取器。我们的特征集包括全局语义嵌入和从AED中提取的稀疏的局部音频事件概率。我们证明了音频的全局-局部特征视图可以获得最佳的检测性能。接下来,我们提出了一个多模态检测器融合多个学习器跨模态。我们的训练和评估集也至少比以前的文献大一个数量级。此外,我们表明,(a)与其他基线相比,基于音频的方法具有优越的性能;(b)与英语数据相比,音频模型在全球多语言数据上的优势更为明显;(c)多模态模型的评级准确率为63%,并提供了在PV目录中以88%的覆盖率和91%的准确率回填前90%流加权覆盖率的标题的能力。
{"title":"CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content","authors":"Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid","doi":"10.21437/interspeech.2022-10053","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10053","url":null,"abstract":"Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difficulty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classification. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) benefit due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backfill top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2758-2762"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41753715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Attentive Feature Fusion for Robust Speaker Verification 基于关注特征融合的稳健说话人验证
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-478
Bei Liu, Zhengyang Chen, Y. Qian
As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.
该方法利用深度神经网络提取代表不同说话人身份的固定维嵌入向量。ResNet和ECAPA-TDNN这两种网络架构在之前的研究中被普遍采用,并取得了最先进的性能。特征融合是其中一个无所不在的部分,在二者中起着重要的作用。例如,为了融合ResNet中剩余块输入和输出的身份映射,设计了快捷连接。ECAPA-TDNN采用多层特征聚合将浅层特征映射与深层特征映射进行融合。传统的特征融合通常是通过简单的操作来实现的,比如元素的添加或连接。本文提出了一种更有效的特征融合方案,即a - tentive F - fusion (AFF),实现不同特征的动态加权融合。它利用注意模块根据特征内容学习融合权值。此外,还设计了两种融合策略:顺序融合和并行融合。在Voxceleb数据集上的实验表明,我们提出的关注特征融合方案比基线系统的相对改进率高达40%。
{"title":"Attentive Feature Fusion for Robust Speaker Verification","authors":"Bei Liu, Zhengyang Chen, Y. Qian","doi":"10.21437/interspeech.2022-478","DOIUrl":"https://doi.org/10.21437/interspeech.2022-478","url":null,"abstract":"As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"286-290"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41894666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting 人工耳蜗听者对可理解性的言语修饰:元音和辅音增强的个体效应
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11131
Juliana N. Saba, J. Hansen
Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.
先前的研究已经证明了通过合成伦巴第效应(LE)语音来提高正常听力(NH)和人工耳蜗(CI)听者的自动语音识别和噪声中语音的可理解性的技术。在这项研究中,我们模拟和评估了基于自然LE语音中观察到的语音产生特征的片段特定修改,以提高CI听众的可理解性。设计了两种语音处理方法来修改元音、辅音的表示,并使用基于幅度的“电域”压缩技术来组合元音、辅音的表示。“电域”指的是传递给耳蜗内电极阵列的刺激序列,该序列与声信号相对应。使用辅音增强策略和辅音和元音增强策略,分别更好地表征与构音和辅音结构相对应的中频和高频内容,对CI听者的表现没有显著差异。还观察到光谱模糊和幅度变化减小,这可能对可理解性产生负面影响。在本研究中,使用加权对数和s型压缩函数的分段扰动证明了提高频率内容表示的能力,但破坏了基于幅度的线索,而不考虑可比的语音可理解性。虽然有无数的声学域修饰表征LE语音,但本研究展示了一个模拟电域分段差异的基本框架。
{"title":"Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting","authors":"Juliana N. Saba, J. Hansen","doi":"10.21437/interspeech.2022-11131","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11131","url":null,"abstract":"Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5473-5477"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41805320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches 噪声语音去噪的深度自监督学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-306
Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada
In the last few years, unsupervised learning methods have been proposed in speech denoising by taking advantage of Deep Neural Networks (DNNs). The reason is that such unsupervised methods are more practical than the supervised counterparts. In our scenario, we are given a set of noisy speech data, where any two data do not share the same clean data. Our goal is to obtain the denoiser by training a DNN based model. Using the set, we train the model via the following two steps: 1) From the noisy speech data, construct another noisy speech data via our proposed masking technique. 2) Minimize our proposed loss defined from the DNN and the two noisy speech data. We evaluate our method using Gaussian and real-world noises in our numerical experiments. As a result, our method outperforms the state-of-the-art method on average for both noises. In addi-tion, we provide the theoretical explanation of why our method can be efficient if the noise has Gaussian distribution.
近年来,人们利用深度神经网络(DNNs)在语音去噪中提出了无监督学习方法。原因是这种无监督的方法比有监督的方法更实用。在我们的场景中,我们得到了一组有噪声的语音数据,其中任何两个数据都不共享相同的干净数据。我们的目标是通过训练基于DNN的模型来获得去噪器。使用该集合,我们通过以下两个步骤训练模型:1)从有噪声的语音数据中,通过我们提出的掩蔽技术构造另一个有噪声的话音数据。2) 最大限度地减少我们从DNN和两个有噪声的语音数据中确定的拟议损失。我们在数值实验中使用高斯噪声和真实世界中的噪声来评估我们的方法。因此,我们的方法在两种噪声方面的平均性能都优于最先进的方法。此外,我们还提供了理论解释,说明如果噪声具有高斯分布,为什么我们的方法是有效的。
{"title":"Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches","authors":"Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada","doi":"10.21437/interspeech.2022-306","DOIUrl":"https://doi.org/10.21437/interspeech.2022-306","url":null,"abstract":"In the last few years, unsupervised learning methods have been proposed in speech denoising by taking advantage of Deep Neural Networks (DNNs). The reason is that such unsupervised methods are more practical than the supervised counterparts. In our scenario, we are given a set of noisy speech data, where any two data do not share the same clean data. Our goal is to obtain the denoiser by training a DNN based model. Using the set, we train the model via the following two steps: 1) From the noisy speech data, construct another noisy speech data via our proposed masking technique. 2) Minimize our proposed loss defined from the DNN and the two noisy speech data. We evaluate our method using Gaussian and real-world noises in our numerical experiments. As a result, our method outperforms the state-of-the-art method on average for both noises. In addi-tion, we provide the theoretical explanation of why our method can be efficient if the noise has Gaussian distribution.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1178-1182"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41830258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition 基于联合训练框架的粗粒度注意力融合复杂语音增强和端到端语音识别
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-698
Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang
Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.
语音增强和自动语音识别(ASR)的联合训练可以使模型在噪声环境下鲁棒工作。然而,这些模型大多是直接串联工作的,有噪声的语音信息没有被ASR模型重用,导致大量的特征失真。为了从根本上解决语音失真问题,我们提出了一种复杂语音增强网络,该网络将复域中的掩蔽和映射相结合,对语音进行增强。其次,提出了一种粗粒度注意力融合(CAF)机制来融合带噪语音和增强语音的特征。此外,进一步引入感知损失来约束CAF模块的输出和预训练模型的多层输出,使CAF的特征空间与ASR模型更加一致。我们的实验是在AISHELL-1语料库和DNS-3噪声数据集生成的数据集上进行训练和测试的。实验结果表明,该模型在0 dB和-5 dB噪声情况下的字符错误率分别为13.42%和20.67%。在AISHELL-2语料库和MUSAN噪声数据集生成的错配测试数据集上,所提出的联合训练模型具有良好的泛化性能(相对CER下降5.98%)。
{"title":"Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition","authors":"Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang","doi":"10.21437/interspeech.2022-698","DOIUrl":"https://doi.org/10.21437/interspeech.2022-698","url":null,"abstract":"Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3794-3798"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41833974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1