首页 > 最新文献

Interspeech最新文献

英文 中文
Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition 基于跨模态变换器的交互式协同学习在视听情感识别中的应用
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11307
Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi
This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.
本文提出了一种新的视听情感识别建模方法。由于人类的情绪是以多种方式表达的,因此联合捕捉音频和视觉线索是一种潜在的有前景的方法。在传统的多模态建模方法中,识别模型是从视听配对数据集中训练出来的,目的是只提高视听情感识别的性能。然而,它无法从单个模态输入中估计情绪,这表明它们是通过过度拟合单个模态特征的组合而退化的。我们的假设是,情感识别的理想形式是用一个模型准确地进行视听多模态处理和单模态处理。这有望促进个体模态知识的利用,以提高视听情感识别。因此,我们提出的方法采用了一个跨模态变换器模型,该模型能够处理不同类型的输入。此外,我们还介绍了一种新的训练方法——交互式共同学习;它允许模型从两个模态和任意一个模态学习知识。在多标签情感识别任务上的实验证明了该方法的有效性。
{"title":"Interactive Co-Learning with Cross-Modal Transformer for Audio-Visual Emotion Recognition","authors":"Akihiko Takashima, Ryo Masumura, Atsushi Ando, Yoshihiro Yamazaki, Mihiro Uchida, Shota Orihashi","doi":"10.21437/interspeech.2022-11307","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11307","url":null,"abstract":"This paper proposes a novel modeling method for audio-visual emotion recognition. Since human emotions are expressed multi-modally, jointly capturing audio and visual cues is a potentially promising approach. In conventional multi-modal modeling methods, a recognition model was trained from an audio-visual paired dataset so as to only enhance audio-visual emotion recognition performance. However, it fails to estimate emotions from single-modal inputs, which indicates they are degraded by overfitting the combinations of the individual modal features. Our supposition is that the ideal form of the emotion recognition is to accurately perform both audio-visual multimodal processing and single-modal processing with a single model. This is expected to promote utilization of individual modal knowledge for improving audio-visual emotion recognition. Therefore, our proposed method employs a cross-modal transformer model that enables different types of inputs to be handled. In addition, we introduce a novel training method named interactive co-learning; it allows the model to learn knowledge from both and either of the modals. Experiments on a multi-label emotion recognition task demonstrate the ef-fectiveness of the proposed method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4740-4744"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49527957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings 使用自监督语音和文本预训练嵌入建模的语音情感识别的可解释性
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10685
K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa
Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.
语音情感识别(SER)在许多应用中都很有用,并且在过去使用信号处理技术和最近使用深度学习技术进行处理。人类的情感本质上是复杂的,在一句话中可以有很大的差异。使用各种多模态技术提高了SER的准确性,但在理解模型行为和以人类可解释的形式表达这些复杂情绪方面仍存在一些差距,我们提出并定义了可解释性度量,表示为话语的人类水平指标矩阵,并从定性和定量两个方面展示了它的有效性。使用自监督语音和文本预训练嵌入的基于注意力的序列建模,提出了单词级的可解释性。韵律特征也与所提出的模型相结合,以观察单词和话语层面的功效。我们为复杂话语的亚话语级情绪预测提供了见解,其中情绪类别在话语中发生了变化。我们对模型进行了评估,并在公开的IEMOCAP数据集上提供了解释。
{"title":"Interpretabilty of Speech Emotion Recognition modelled using Self-Supervised Speech and Text Pre-Trained Embeddings","authors":"K. V. V. Girish, Srikanth Konjeti, Jithendra Vepa","doi":"10.21437/interspeech.2022-10685","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10685","url":null,"abstract":"Speech emotion recognition (SER) is useful in many applications and is approached using signal processing techniques in the past and deep learning techniques recently. Human emotions are complex in nature and can vary widely within an utterance. The SER accuracy has improved using various multimodal techniques but there is still some gap in understanding the model behaviour and expressing these complex emotions in a human interpretable form. In this work, we propose and define interpretability measures represented as a Human Level Indicator Matrix for an utterance and showcase it’s effective-ness in both qualitative and quantitative terms. A word level interpretability is presented using an attention based sequence modelling of self-supervised speech and text pre-trained embeddings. Prosody features are also combined with the proposed model to see the efficacy at the word and utterance level. We provide insights into sub-utterance level emotion predictions for complex utterances where the emotion classes change within the utterance. We evaluate the model and provide the interpretations on the publicly available IEMOCAP dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4496-4500"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49559750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge 2022年欺骗感知说话人验证挑战赛的CLIPS系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-320
Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao
In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).
本文开发了一种集自动说话人验证(ASV)系统和对抗(CM)系统于一体的欺骗感知说话人验证(SASV)系统。首先,利用改进的重参数化VGG (ARepVGG)模块,通过自适应滤波器从原始波形中学习多尺度特征,提取高级表征,然后利用谱时图注意网络学习音频是否被欺骗的最终决策信息。其次,从最大特征映射层(MFM)中得到启发,构建了一个新的网络,在保持ASV系统固定的同时对CM系统进行微调。在2022年欺骗感知说话人验证挑战(2022 SASV)中,我们提出的SASV系统显著提高了SASV等错误率(SASV- eer),在评估数据集中从6.73%提高到1.36%,在开发数据集中从4.85%提高到0.98%。
{"title":"The CLIPS System for 2022 Spoofing-Aware Speaker Verification Challenge","authors":"Jucai Lin, Tingwei Chen, Jingbiao Huang, Ruidong Fang, Jun Yin, Yuanping Yin, W. Shi, Wei Huang, Yapeng Mao","doi":"10.21437/interspeech.2022-320","DOIUrl":"https://doi.org/10.21437/interspeech.2022-320","url":null,"abstract":"In this paper, a spoofing-aware speaker verification (SASV) system that integrates the automatic speaker verification (ASV) system and countermeasure (CM) system is developed. Firstly, a modified re-parameterized VGG (ARepVGG) module is utilized to extract high-level representation from the multi-scale feature that learns from the raw waveform though sinc-filters, and then a spectra-temporal graph attention network is used to learn the final decision information whether the audio is spoofed or not. Secondly, a new network that is inspired from the Max-Feature-Map (MFM) layers is constructed to fine-tune the CM system while keeping the ASV system fixed. Our proposed SASV system significantly improves the SASV equal error rate (SASV-EER) from 6.73 % to 1.36 % on the evaluation dataset and 4.85 % to 0.98 % on the development dataset in the 2022 Spoofing-Aware Speaker Verification Challenge(2022 SASV).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4367-4370"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42514937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Single-channel speech enhancement using Graph Fourier Transform 使用图傅里叶变换的单通道语音增强
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-740
Chenhui Zhang, Xiang Pan
This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.
结合图傅里叶变换(GFT)和U-net,提出了一种用于单通道语音增强的深度神经网络(DNN) G-Unet。对语音数据进行GFT,以创建U-net的输入。将GFT输出与Unet在时间图(T-G)域估计的掩码相结合,利用逆GFT在时域内重建增强语音。G-Unet在改善语音质量和去混响方面优于短时傅里叶变换(STFT)和幅度估计组合的U-net,在某些情况下在改善语音质量方面优于STFT和复杂U-net组合,并通过在librisspeech和NOISEX92数据集上的测试验证了这一点。
{"title":"Single-channel speech enhancement using Graph Fourier Transform","authors":"Chenhui Zhang, Xiang Pan","doi":"10.21437/interspeech.2022-740","DOIUrl":"https://doi.org/10.21437/interspeech.2022-740","url":null,"abstract":"This paper presents combination of Graph Fourier Transform (GFT) and U-net, proposes a deep neural network (DNN) named G-Unet for single channel speech enhancement. GFT is carried out over speech data for creating inputs of U-net. The GFT outputs are combined with the mask estimated by Unet in time-graph (T-G) domain to reconstruct enhanced speech in time domain by Inverse GFT. The G-Unet outperforms the combination of Short time Fourier Transform (STFT) and magnitude estimation U-net in improving speech quality and de-reverberation, and outperforms the combination of STFT and complex U-net in improving speech quality in some cases, which is validated by testing on LibriSpeech and NOISEX92 dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"946-950"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45499958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion 实时单次语音转换的可流语音表示解纠缠和多级韵律建模
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10277
Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu
This paper takes efforts to tackle the challenge of “live” oneshot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.
本文致力于解决“实时”单声道语音转换(VC)的挑战,该转换以流式方式在任意扬声器之间进行转换,同时保持高清晰度和自然度。我们提出了一种基于无监督和有监督学习的混合VC模型,该模型具有两阶段的模型训练策略。特别地,我们首先使用无监督解纠缠框架来分离不同粒度的语音表示。实验结果表明,我们提出的方法在语音自然度、可懂度和说话人相似性方面的性能与离线VC解决方案相当,在实际实时应用中具有足够的效率。音频样本可在线演示。
{"title":"Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion","authors":"Haoquan Yang, Liqun Deng, Y. Yeung, Nianzu Zheng, Yong Xu","doi":"10.21437/interspeech.2022-10277","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10277","url":null,"abstract":"This paper takes efforts to tackle the challenge of “live” oneshot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularities Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2578-2582"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45650340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling 基于扩散概率建模的统一语音克隆和语音转换系统
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10879
T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei
Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.
文本到语音和语音转换是两个常见的语音生成任务,通常使用不同的模型来解决。在本文中,我们提出了一种基于单一扩散概率模型的语音克隆和任意到任意语音转换的新方法,该模型具有两个编码器,每个编码器在其输入域上操作,并共享一个解码器。大量的人类评估表明,所提出的模型通过说话人自适应复制目标说话人的声音的能力优于其他已知的同类多模态系统,并且我们的系统在语音克隆和语音转换模式下合成的语音质量与最近提出的针对相应单一任务的算法相当。此外,只需3分钟的GPU时间就可以使我们的模型适应只有15秒未转录音频的新扬声器,这使得它在实际应用中具有吸引力。
{"title":"A Unified System for Voice Cloning and Voice Conversion through Diffusion Probabilistic Modeling","authors":"T. Sadekova, Vladimir Gogoryan, Ivan Vovk, Vadim Popov, M. Kudinov, Jiansheng Wei","doi":"10.21437/interspeech.2022-10879","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10879","url":null,"abstract":"Text-to-speech and voice conversion are two common speech generation tasks typically solved using different models. In this paper, we present a novel approach to voice cloning and any-to-any voice conversion relying on a single diffusion probabilistic model with two encoders each operating on its input domain and a shared decoder. Extensive human evaluation shows that the proposed model can copy a target speaker’s voice by means of speaker adaptation better than other known multimodal systems of such kind and the quality of the speech synthesized by our system in both voice cloning and voice conversion modes is comparable with that of recently proposed algorithms for the corresponding single tasks. Besides, it takes as few as 3 minutes of GPU time to adapt our model to a new speaker with only 15 seconds of untranscribed audio which makes it attractive for practical applications.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3003-3007"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45997376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms SoundDoA:从声音原始波形中学习声源到达方向和语义
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-378
Yuhang He, A. Markham
A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.
智能体声学理解环境的一项基本任务是检测声源位置(如到达方向(DoA))和语义标签。这是一项具有挑战性的任务:首先,声源在时间、频率和空间上重叠;其次,虽然语义在很大程度上是通过时频能量(幅度)轮廓来传达的,但DoA是在信道间相位差中编码的;最后,尽管麦克风传感器的数量是稀疏的,但由于高采样率,记录的声音波形在时间上是密集的。现有的DoA预测方法大多依赖于预先提取的2D声学特征,如GCC-PHAT和Mel声谱图,以受益于成熟的基于2D图像的深度神经网络的成功。相反,我们提出了一种新的端到端可训练框架,名为SoundDoA,它能够直接从声音原始波形中学习声源DoA和语义。我们首先使用可学习的前端滤波器组将声源语义和DoA相关特征动态编码为紧凑表示。然后,提出了一个由两个相同的子网络组成的骨干网络,采用分层通信策略来进一步单独和联合学习语义标签和DoA。最后,添加了一个排列不变的多轨头来回归DoA并对语义标签进行分类。在DCASE 2020声音事件检测和定位数据集(SELD)上的大量实验结果表明,与其他现有方法相比,SoundDoA具有优越性。
{"title":"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":"https://doi.org/10.21437/interspeech.2022-378","url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47392486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure 重叠频率分布网络:频率感知语音欺骗对策
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-657
Sunmook Choi, Il-Youp Kwak, Seungsang Oh
Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.
世界各地的许多IT公司都在通过其产品开发和部署人工语音助理,但它们仍然容易受到欺骗攻击。自2015年以来,每两年举办一次“自动说话人验证欺骗和对策挑战赛”,鼓励人们设计能够检测欺骗攻击的系统。本文主要研究基于卷积神经网络的欺骗对抗系统。然而,当使用声谱图作为输入时,细胞神经网络具有平移不变的特性,这可能导致频率信息的损失。因此,我们提出了沿频率轴分割输入的模型:1)重叠频率分布(OFD)模型和2)非重叠频率分布模型。使用ASVspoof 2019数据集,我们测量了它们在两种不同激活下的性能;ReLU和最大特征图(MFM)。LA数据集上性能最好的模型是具有ReLU的非OFD模型,其实现了1.35%的等误率(EER),而PA数据集上表现最好的模型则是具有MFM的OFD模型(其实现了0.35%的EER)。
{"title":"Overlapped Frequency-Distributed Network: Frequency-Aware Voice Spoofing Countermeasure","authors":"Sunmook Choi, Il-Youp Kwak, Seungsang Oh","doi":"10.21437/interspeech.2022-657","DOIUrl":"https://doi.org/10.21437/interspeech.2022-657","url":null,"abstract":"Numerous IT companies around the world are developing and deploying artificial voice assistants via their products, but they are still vulnerable to spoofing attacks. Since 2015, the competition “Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof)” has been held every two years to encourage people to design systems that can detect spoofing attacks. In this paper, we focused on developing spoofing countermeasure systems mainly based on Convolutional Neural Networks (CNNs). However, CNNs have translation invariant property, which may cause loss of frequency information when a spectrogram is used as input. Hence, we pro-pose models which split inputs along the frequency axis: 1) Overlapped Frequency-Distributed (OFD) model and 2) Non-overlapped Frequency-Distributed (Non-OFD) model. Using ASVspoof 2019 dataset, we measured their performances with two different activations; ReLU and Max feature map (MFM). The best performing model on LA dataset is the Non-OFD model with ReLU which achieved an equal error rate (EER) of 1.35%, and the best performing model on PA dataset is the OFD model with MFM which achieved an EER of 0.35%.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3558-3562"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47675680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
How do our eyebrows respond to masks and whispering? The case of Persians 我们的眉毛对面具和窃窃私语有何反应?波斯人的情况
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10867
Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh
Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.
窃窃私语是人类传递语言信息的一种交际机制。由于缺乏声带振动,耳语在声学上不同于浊音,因为没有基本频率,而基本频率是语调的主要韵律关联之一。本研究探讨了面部线索相对于语音语调线索的重要性。具体来说,我们的目标是探索当人们低声说话和戴口罩时,眉毛的速度和皱纹是如何变化的,同时,当他们应该产生韵律调节时,就像在音调上升的极性问题中一样。为此,我们对10位说波斯语的人做了一个实验。结果表明,说话者耳语时的平均速度更高,这表明耳语中F0的缺失有补偿效应。我们还发现,当说话者戴着面具时,双眉的运动更为明显。最后,我们的研究结果显示,在提问中,眉毛的运动更大,这表明问题比陈述句更明显。没有发现眉纹的显著影响。然而,眉毛的运动与眉毛的扩大正相关,这表明这两种运动类型之间存在相互联系。
{"title":"How do our eyebrows respond to masks and whispering? The case of Persians","authors":"Nasim Mahdinazhad Sardhaei, Marzena Żygis, H. Sharifzadeh","doi":"10.21437/interspeech.2022-10867","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10867","url":null,"abstract":"Whispering is one of the mechanisms of human communication to convey linguistic information. Due to the lack of vocal fold vibration, whispering acoustically differs from the voiced speech in the absence of fundamental frequency which is one of the main prosodic correlates of intonation. This study addresses the importance of facial cues with respect to acoustic cues of intonation. Specifically, we aim to probe how eyebrow velocity and furrowing change when people whisper and wear face masks, also, when they are supposed to produce a prosodic modulation as it is the case in polar questions with rising intonation. To this end, we run an experiment with 10 Persian speakers. The results show the greater mean speed when speakers whisper indicating a compensation effect for the lack of F0 in whispering. We also found a more pronounced movement of both eyebrows when the speakers wear a mask. Finally, our results reveal greater eyebrow motions in questions suggesting the question is a more marked utterance type than a statement. No significant effect of eyebrow furrowing was found. However, eyebrow movements were positively correlated with the eyebrow widening suggesting a mutual link between these two movement types.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2023-2027"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47894190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training 会话历史依赖和独立ASR系统的多历史训练端到端联合建模
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11357
Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando
This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.
本文提出了会话历史依赖和独立自动语音识别系统的端到端联合建模方法。会话历史记录在ASR系统中可用,例如会议转录应用程序,但在语音搜索应用程序中不可用。到目前为止,这两个ASR系统都是使用不同的模型单独构建的,但是这对于每个应用程序来说都是低效的。事实上,传统的会话历史相关ASR系统既可以执行历史相关处理,也可以执行独立处理。然而,它们的性能不如历史无关的ASR系统。这是因为传统会话历史相关的ASR系统中的模型体系结构及其训练标准专门用于会话历史可用的情况。为了解决这个问题,我们提出的端到端联合建模方法使用了一种基于跨模式转换器的体系结构,该体系结构可以灵活地切换使用或不使用会话历史。此外,我们提出了多历史训练,同时利用无历史数据集和具有不同历史数据集,通过引入统一的体系结构有效地改进两种类型的ASR处理。在日语ASR任务上的实验验证了该方法的有效性。多历史训练,可以生成针对各种会话上下文和无会话上下文的鲁棒ASR模型。实验结果表明,与传统的E2E-ASR系统相比,所提出的E2E联合模型在历史依赖和独立ASR处理方面都具有更好的性能。
{"title":"End-to-End Joint Modeling of Conversation History-Dependent and Independent ASR Systems with Multi-History Training","authors":"Ryo Masumura, Yoshihiro Yamazaki, Saki Mizuno, Naoki Makishima, Mana Ihori, Mihiro Uchida, Hiroshi Sato, Tomohiro Tanaka, Akihiko Takashima, Satoshi Suzuki, Shota Orihashi, Takafumi Moriya, Nobukatsu Hojo, Atsushi Ando","doi":"10.21437/interspeech.2022-11357","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11357","url":null,"abstract":"This paper proposes end-to-end joint modeling of conversation history-dependent and independent automatic speech recognition (ASR) systems. Conversation histories are available in ASR systems such as meeting transcription applications but not available in those such as voice search applications. So far, these two ASR systems have been individually constructed using different models, but this is inefficient for each application. In fact, conventional conversation history-dependent ASR systems can perform both history-dependent and independent processing. However, their performance is inferior to history-independent ASR systems. This is because the model architecture and its training criterion in the conventional conversation history-dependent ASR systems are specialized in the case where conversational histories are available. To address this problem, our proposed end-to-end joint modeling method uses a crossmodal transformer-based architecture that can flexibly switch to use the conversation histories or not. In addition, we propose multi-history training that simultaneously utilizes a dataset without histories and datasets with various histories to effectively improve both types of ASR processing by introduc-ing unified architecture. Experiments on Japanese ASR tasks demonstrate the effectiveness of the proposed method. multi-history training which can produce a robust ASR model against both a variety of conversational contexts and none. Experimental results showed that the proposed E2E joint model provides superior performance in both history-dependent and independent ASR processing compared with conventional E2E-ASR systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3218-3222"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47910133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1