首页 > 最新文献

Interspeech最新文献

英文 中文
Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case 在立体声情况下评估双重通话期间残余回声抑制的客观指标
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-673
Amir Ivry, I. Cohen, B. Berdugo
Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.
由人类评估的语音质量最准确地由人类的主观评级来评估。引入了客观声学回声消除平均意见得分(AECMOS)度量,并在预测双关语中的人类感知方面实现了高精度。然而,残余回声抑制(RES)系统采用信噪比(SDR)度量来量化双话中的语音质量。在这项研究中,我们专注于立体声回声消除,并表明根据AECMOS,立体声SDR(SSDR)与主观人类评级的相关性很差,因为SSDR受到所需语音失真和残余回声存在的影响。我们引入了一对客观指标,可以清楚地评估双通话期间的立体声期望语音保持水平(SDSML)和立体声残余回声抑制水平(SRESL)。通过采用基于深度学习的可调RES系统,并使用100小时的真实和模拟记录,SDSML和SRESL指标在各种设置中显示出与AECMOS的高度相关性。我们还研究了设计参数如何控制SDSML-SRESL权衡,并利用这种关系在实际情况下为频繁变化的用户需求提供最佳性能。
{"title":"Objective Metrics to Evaluate Residual-Echo Suppression During Double-Talk in the Stereophonic Case","authors":"Amir Ivry, I. Cohen, B. Berdugo","doi":"10.21437/interspeech.2022-673","DOIUrl":"https://doi.org/10.21437/interspeech.2022-673","url":null,"abstract":"Speech quality, as evaluated by humans, is most accurately as-sessed by subjective human ratings. The objective acoustic echo cancellation mean opinion score (AECMOS) metric was re-cently introduced and achieved high accuracy in predicting human perception during double-talk. Residual-echo suppression (RES) systems, however, employ the signal-to-distortion ratio (SDR) metric to quantify speech-quality in double-talk. In this study, we focus on stereophonic acoustic echo cancellation, and show that the stereo SDR (SSDR) poorly correlates with subjective human ratings according to the AECMOS, since the SSDR is influenced by both distortion of desired speech and presence of residual-echo. We introduce a pair of objective metrics that distinctly assess the stereo desired-speech maintained level (SDSML) and stereo residual-echo suppression level (SRESL) during double-talk. By employing a tunable RES system based on deep learning and using 100 hours of real and simulated recordings, the SDSML and SRESL metrics show high correlation with the AECMOS across various setups. We also investi-gate into how the design parameter governs the SDSML-SRESL tradeoff, and harness this relation to allow optimal performance for frequently-changing user demands in practical cases.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5348-5352"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42839527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device 增量分层自监督学习在设备上实现高效的无监督语音域自适应
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10904
Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.
流式端到端语音识别模型已被广泛应用于移动设备,并显示出效率的显著提高。这些模型通常在服务器上使用转录的语音数据进行训练。然而,服务器数据分布可能与用户设备上的数据分布非常不同,这可能会影响模型性能。设备上训练有两个主要挑战,有限的可靠标签和有限的训练记忆。虽然自监督学习算法可以使用未标记的数据来减轻域之间的不匹配,但由于内存限制,它们不直接适用于移动设备。在本文中,我们提出了一种增量分层自监督学习算法,用于移动设备上有效的无监督语音域自适应,其中一次只更新一层。大量的实验结果表明,所提出的算法实现了24。与监督基线相比,目标域上2%的相对字错误率(WER)改进,并且成本为95。与端到端自监督学习算法相比,训练内存减少7%。
{"title":"Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device","authors":"Zhouyuan Huo, DongSeon Hwang, K. Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, F. Beaufays","doi":"10.21437/interspeech.2022-10904","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10904","url":null,"abstract":"Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24 . 2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95 . 7% less training memory than the end-to-end self-supervised learning algorithm.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4845-4849"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41326867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network 时域目标说话人提取网络的门控卷积融合
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-961
Wenjing Liu, Chuan Xie
Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.
目标说话人提取旨在基于目标说话人的辅助参考语音,从混合语音中提取目标说话人的语音。说话人嵌入通常从参考语音中提取,并与所学习的声学表示融合。现有的大多数工作都执行简单的基于操作的级联融合。然而,这种直接将扬声器嵌入声学表示的天真方法可能无法有效地探索潜在的跨模态相关性。在这项工作中,我们通过探索全局条件建模和可训练的门控机制,提出了一种门控卷积融合方法,用于学习扬声器嵌入和声学表示之间的复杂交互。在WSJ0-2mix-extr数据集上的实验证明了所提出的融合方法的有效性,该方法与其他融合方法相比表现良好,在SDRi和SI SDRi方面有相当大的改进。此外,我们的方法可以灵活地结合到类似的时域扬声器提取网络中,以获得更好的性能。
{"title":"Gated Convolutional Fusion for Time-Domain Target Speaker Extraction Network","authors":"Wenjing Liu, Chuan Xie","doi":"10.21437/interspeech.2022-961","DOIUrl":"https://doi.org/10.21437/interspeech.2022-961","url":null,"abstract":"Target speaker extraction aims to extract the target speaker’s voice from mixed utterances based on auxillary reference speech of the target speaker. A speaker embedding is usually extracted from the reference speech and fused with the learned acoustic representation. The majority of existing works perform simple operation-based fusion of concatenation. However, potential cross-modal correlation may not be effectively explored by this naive approach that directly fuse the speaker embedding into the acoustic representation. In this work, we propose a gated convolutional fusion approach by exploring global conditional modeling and trainable gating mechanism for learning so-phisticated interaction between speaker embedding and acoustic representation. Experiments on WSJ0-2mix-extr dataset proves the efficacy of the proposed fusion approach, which performs favorably against other fusion methods with considerable improvement in terms of SDRi and SI-SDRi. Moreover, our method can be flexibly incorporated into similar time-domain speaker extraction networks to attain better performance.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5368-5372"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49331936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis L2-GEN:一种用于发音错误诊断的二语语音合成的神经音位语法方法
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-209
Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell
In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.
在本文中,我们研究了为发音错误检测和诊断(MDD)任务生成模仿非母语(L2)使用者学习英语作为第二语言(ESL)的发音错误语音的问题。这篇论文的动机是在MDD研究中广泛观察到但尚未得到很好解决的数据稀疏性,其中L2语音音频及其细粒度语音注释都很难获得,导致发音错误反馈准确性不令人满意。我们提出了L2-GEN,这是一个新的数据扩充框架,通过设计一个独特的基于机器翻译的序列转述模型来生成L2音素序列,该序列可以捕捉真实的发音错误模式。提出了一种新的多样化和偏好感知解码算法,以推广L2-GEN,在非常有限的L2训练数据下处理看不见的单词和新的学习者群体。进一步设计了一种对比增强技术,以利用生成的合成L2数据优化MDD性能改进。我们在公共L2-ARCTIC和SpeechOcean762数据集上评估了L2-GEN。结果表明,在域内和域外场景中,L2-GEN分别导致高达3.9%和5.0%的MDD F1分数提高。
{"title":"L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis","authors":"Dan Zhang, Ashwinkumar Ganesan, Sarah Campbell","doi":"10.21437/interspeech.2022-209","DOIUrl":"https://doi.org/10.21437/interspeech.2022-209","url":null,"abstract":"In this paper, we study the problem of generating mispronounced speech mimicking non-native (L2) speakers learning English as a Second Language (ESL) for the mispronunciation detection and diagnosis (MDD) task. The paper is motivated by the widely observed yet not well addressed data sparsity is-sue in MDD research where both L2 speech audio and its fine-grained phonetic annotations are difficult to obtain, leading to unsatisfactory mispronunciation feedback accuracy. We pro-pose L2-GEN, a new data augmentation framework to generate L2 phoneme sequences that capture realistic mispronunciation patterns by devising an unique machine translation-based sequence paraphrasing model. A novel diversified and preference-aware decoding algorithm is proposed to generalize L2-GEN to handle both unseen words and new learner population with very limited L2 training data. A contrastive augmentation technique is further designed to optimize MDD performance improvements with the generated synthetic L2 data. We evaluate L2-GEN on public L2-ARCTIC and SpeechOcean762 datasets. The results have shown that L2-GEN leads to up to 3.9%, and 5.0% MDD F1-score improvements in in-domain and out-of-domain scenarios respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4317-4321"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49388203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning ORCA-WHISPER:一个使用深度学习的杀人鲸声音类型自动生成工具包
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-846
Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth
Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.
即使在今天,目前对动物物种发声范式的理解和解释也很大程度上是基于他的理论和手动数据分析,考虑到相对较小的数据语料库,主要是由于时间和人力资源的限制,以及可用的物种相关机器学习技术的稀缺性。部分基于人为的数据检查既不能代表整个真实世界的声乐曲目,也不能代表动物内部和动物间特定叫声类型组合的变化,通常只会产生少量类别特定的基本事实数据。现代机器(深度)学习概念是在大量生物声学数据档案中识别具有统计意义的动物相关发声模式的基本要求。然而,由于呼叫特定的基本事实数据有限,再加上单个呼叫类型事件之间的严重类不平衡,纯监督训练方法的适用性具有挑战性。目前的研究首次提出了一个名为ORCA-WHISPER的深度生物声学信号生成框架,这是一个基于低资源虎鲸(Orcinus ORCA)呼叫类型数据训练的生成对抗性网络(GAN)。除了视听检查、监督呼叫类型分类和模型可转移性外,通过可视化、表示和增强真实世界的虎鲸信号数据集,进一步证明了生成的假语音的良好质量。此外,通过将伪信号集成到原始数据分区中,先前的orca/噪声分割结果表现出色。
{"title":"ORCA-WHISPER: An Automatic Killer Whale Sound Type Generation Toolkit Using Deep Learning","authors":"Christian Bergler, Alexander Barnhill, Dominik Perrin, M. Schmitt, A. Maier, E. Nöth","doi":"10.21437/interspeech.2022-846","DOIUrl":"https://doi.org/10.21437/interspeech.2022-846","url":null,"abstract":"Even today, the current understanding and interpretation of animal-specific vocalization paradigms is largely based on his-torical and manual data analysis considering comparatively small data corpora, primarily because of time- and human-resource limitations, next to the scarcity of available species-related machine-learning techniques. Partial human-based data inspections neither represent the overall real-world vocal reper-toire, nor the variations within intra- and inter animal-specific call type portfolios, typically resulting only in small collections of category-specific ground truth data. Modern machine (deep) learning concepts are an essential requirement to identify sta-tistically significant animal-related vocalization patterns within massive bioacoustic data archives. However, the applicability of pure supervised training approaches is challenging, due to limited call-specific ground truth data, combined with strong class-imbalances between individual call type events. The current study is the first presenting a deep bioacoustic signal generation framework, entitled ORCA-WHISPER, a Generative Adversarial Network (GAN), trained on low-resource killer whale ( Orcinus Orca ) call type data. Besides audiovisual in-spection, supervised call type classification, and model transferability, the auspicious quality of generated fake vocalizations was further demonstrated by visualizing, representing, and en-hancing the real-world orca signal data manifold. Moreover, previous orca/noise segmentation results were outperformed by integrating fake signals to the original data partition.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2413-2417"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49492375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Predicting label distribution improves non-intrusive speech quality estimation 预测标签分布改进了非侵入式语音质量估计
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11186
A. Faridee, H. Gamper
{"title":"Predicting label distribution improves non-intrusive speech quality estimation","authors":"A. Faridee, H. Gamper","doi":"10.21437/interspeech.2022-11186","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11186","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"406-410"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49515191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition 基于跨模态变换器的交互式协同学习在视听情感识别中的应用
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11307
Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi
This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.
本文提出了一种新的视听情感识别建模方法。由于人类的情绪是以多种方式表达的,因此联合捕捉音频和视觉线索是一种潜在的有前景的方法。在传统的多模态建模方法中,识别模型是从视听配对数据集中训练出来的,目的是只提高视听情感识别的性能。然而,它无法从单个模态输入中估计情绪,这表明它们是通过过度拟合单个模态特征的组合而退化的。我们的假设是,情感识别的理想形式是用一个模型准确地进行视听多模态处理和单模态处理。这有望促进个体模态知识的利用,以提高视听情感识别。因此,我们提出的方法采用了一个跨模态变换器模型,该模型能够处理不同类型的输入。此外,我们还介绍了一种新的训练方法——交互式共同学习;它允许模型从两个模态和任意一个模态学习知识。在多标签情感识别任务上的实验证明了该方法的有效性。
{"title":"Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition","authors":"Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi","doi":"10.21437/interspeech.2022-11307","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11307","url":null,"abstract":"This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4740-4744"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49527957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings 使用自监督语音和文本预训练嵌入建模的语音情感识别的可解释性
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10685
K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa
Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.
语音情感识别(SER)在许多应用中都很有用,并且在过去使用信号处理技术和最近使用深度学习技术进行处理。人类的情感本质上是复杂的,在一句话中可以有很大的差异。使用各种多模态技术提高了SER的准确性,但在理解模型行为和以人类可解释的形式表达这些复杂情绪方面仍存在一些差距,我们提出并定义了可解释性度量,表示为话语的人类水平指标矩阵,并从定性和定量两个方面展示了它的有效性。使用自监督语音和文本预训练嵌入的基于注意力的序列建模,提出了单词级的可解释性。韵律特征也与所提出的模型相结合,以观察单词和话语层面的功效。我们为复杂话语的亚话语级情绪预测提供了见解,其中情绪类别在话语中发生了变化。我们对模型进行了评估,并在公开的IEMOCAP数据集上提供了解释。
{"title":"Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings","authors":"K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa","doi":"10.21437/interspeech.2022-10685","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10685","url":null,"abstract":"Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4496-4500"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49559750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge 2022年欺骗感知说话人验证挑战赛的CLIPS系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-320
Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao
In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).
本文开发了一种集自动说话人验证(ASV)系统和对抗(CM)系统于一体的欺骗感知说话人验证(SASV)系统。首先,利用改进的重参数化VGG (ARepVGG)模块,通过自适应滤波器从原始波形中学习多尺度特征,提取高级表征,然后利用谱时图注意网络学习音频是否被欺骗的最终决策信息。其次,从最大特征映射层(MFM)中得到启发,构建了一个新的网络,在保持ASV系统固定的同时对CM系统进行微调。在2022年欺骗感知说话人验证挑战(2022 SASV)中,我们提出的SASV系统显著提高了SASV等错误率(SASV- eer),在评估数据集中从6.73%提高到1.36%,在开发数据集中从4.85%提高到0.98%。
{"title":"The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge","authors":"Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao","doi":"10.21437/interspeech.2022-320","DOIUrl":"https://doi.org/10.21437/interspeech.2022-320","url":null,"abstract":"In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4367-4370"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42514937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Single-channel speech enhancement using Graph Fourier Transform 使用图傅里叶变换的单通道语音增强
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-740
Chenhui Zhang, Xiang Pan
This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.
结合图傅里叶变换(GFT)和U-net,提出了一种用于单通道语音增强的深度神经网络(DNN) G-Unet。对语音数据进行GFT,以创建U-net的输入。将GFT输出与Unet在时间图(T-G)域估计的掩码相结合,利用逆GFT在时域内重建增强语音。G-Unet在改善语音质量和去混响方面优于短时傅里叶变换(STFT)和幅度估计组合的U-net,在某些情况下在改善语音质量方面优于STFT和复杂U-net组合,并通过在librisspeech和NOISEX92数据集上的测试验证了这一点。
{"title":"Single-channel speech enhancement using Graph Fourier Transform","authors":"Chenhui Zhang, Xiang Pan","doi":"10.21437/interspeech.2022-740","DOIUrl":"https://doi.org/10.21437/interspeech.2022-740","url":null,"abstract":"This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"946-950"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45499958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1