Interspeech最新文献

英文中文

Norm-constrained Score-level Ensemble for Spoofing Aware Speaker Verification 用于欺骗感知说话人验证的规范约束分数级集成

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-470

Peng Zhang, Peng Hu, Xueliang Zhang

In this paper, we present the Elevoc systems submitted to the Spooﬁng Aware Speaker Veriﬁcation Challenge (SASVC) 2022. Our submissions focus on bridge the gap between the automatic speaker veriﬁcation (ASV) and countermeasure (CM) systems. We investigate a general and efﬁcient norm-constrained score-level ensemble method which jointly processes the scores extracted from ASV and CM subsystems, improving robustness to both zero-effect imposters and spoof-ing attacks. Furthermore, we explore that the ensemble system can provide better performance when both ASV and CM subsystems are optimized. Experimental results show that our primary system yields 0.45% SV-EER, 0.26% SPF-EER and 0.37% SASV-EER, and obtains more than 96.08%, 66.67% and 94.19% relative improvements over the best performing baseline systems on the SASVC 2022 evaluation set. All of our code and pre-trained models weights are publicly available and reproducible 1 .

在本文中，我们介绍了提交给2022年Spoo fing Aware扬声器验证挑战赛（SASVC）的Elevoc系统。我们提交的材料重点是弥合自动扬声器验证（ASV）和对抗（CM）系统之间的差距。我们研究了一种通用且有效的范数约束分数级集成方法，该方法联合处理从ASV和CM子系统提取的分数，提高了对零效果冒名顶替和欺骗攻击的鲁棒性。此外，我们探索了当ASV和CM子系统都被优化时，集成系统可以提供更好的性能。实验结果表明，我们的主要系统产生了0.45%SV-EER、0.26%SPF-EER和0.37%SASV-EER，并在SASVC 2022评估集上获得了超过96.08%、66.67%和94.19%的相对改进。我们所有的代码和预先训练的模型权重都是公开的，并且是可复制的1。

引用次数: 5

CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content 基于cnn的音频事件识别对主要视频内容的自动暴力分类和评级

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10053

Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid

Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difﬁculty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classiﬁcation. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) beneﬁt due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backﬁll top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.

数字娱乐内容(DEC)中的自动暴力检测使用计算机视觉和自然语言处理方法对视觉和文本模式进行处理。由于数据的多样性、模糊性和多语言性质，这些方法在检测暴力方面面临困难。因此，我们引入了一种基于音频的方法来增强现有的暴力和评级分类方法。我们开发了一个通用的音频事件检测器模型(AED)，该模型使用开源和Prime Video专有的语料库作为特征提取器。我们的特征集包括全局语义嵌入和从AED中提取的稀疏的局部音频事件概率。我们证明了音频的全局-局部特征视图可以获得最佳的检测性能。接下来，我们提出了一个多模态检测器融合多个学习器跨模态。我们的训练和评估集也至少比以前的文献大一个数量级。此外，我们表明，(a)与其他基线相比，基于音频的方法具有优越的性能;(b)与英语数据相比，音频模型在全球多语言数据上的优势更为明显;(c)多模态模型的评级准确率为63%，并提供了在PV目录中以88%的覆盖率和91%的准确率回填前90%流加权覆盖率的标题的能力。

{"title":"CNN-based Audio Event Recognition for Automated Violence Classification and Rating for Prime Video Content","authors":"Mayank Sharma, Tarun Gupta, Kenny Qiu, Xiang Hao, Raffay Hamid","doi":"10.21437/interspeech.2022-10053","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10053","url":null,"abstract":"Automated violence detection in Digital Entertainment Content (DEC) uses computer vision and natural language processing methods on visual and textual modalities. These methods face difﬁculty in detecting violence due to diversity, ambiguity and multilingual nature of data. Hence, we introduce a method based on audio to augment existing methods for violence and rating classiﬁcation. We develop a generic Audio Event Detector model (AED) using open-source and Prime Video proprietary corpora which is used as a feature extractor. Our feature set in-cludes global semantic embedding and sparse local audio event probabilities extracted from AED. We demonstrate that a global-local feature view of audio results in best detection performance. Next, we present a multi-modal detector by fusing several learners across modalities. Our training and evaluation set is also at least an order of magnitude larger than previous literature. Furthermore, we show that, (a) audio based approach results in superior performance compared to other baselines, (b) beneﬁt due to audio model is more pronounced on global multi-lingual data compared to English data and (c) the multi-modal model results in 63% rating accuracy and provides the ability to backﬁll top 90% Stream Weighted Coverage titles in PV catalog with 88% coverage at 91% accuracy.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2758-2762"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41753715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Attentive Feature Fusion for Robust Speaker Verification 基于关注特征融合的稳健说话人验证

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-478

Bei Liu, Zhengyang Chen, Y. Qian

As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.

该方法利用深度神经网络提取代表不同说话人身份的固定维嵌入向量。ResNet和ECAPA-TDNN这两种网络架构在之前的研究中被普遍采用，并取得了最先进的性能。特征融合是其中一个无所不在的部分，在二者中起着重要的作用。例如，为了融合ResNet中剩余块输入和输出的身份映射，设计了快捷连接。ECAPA-TDNN采用多层特征聚合将浅层特征映射与深层特征映射进行融合。传统的特征融合通常是通过简单的操作来实现的，比如元素的添加或连接。本文提出了一种更有效的特征融合方案，即a - tentive F - fusion (AFF)，实现不同特征的动态加权融合。它利用注意模块根据特征内容学习融合权值。此外，还设计了两种融合策略:顺序融合和并行融合。在Voxceleb数据集上的实验表明，我们提出的关注特征融合方案比基线系统的相对改进率高达40%。

{"title":"Attentive Feature Fusion for Robust Speaker Verification","authors":"Bei Liu, Zhengyang Chen, Y. Qian","doi":"10.21437/interspeech.2022-478","DOIUrl":"https://doi.org/10.21437/interspeech.2022-478","url":null,"abstract":"As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatena-tion. In this paper, we propose a more effective feature fusion scheme, namely A ttentive F eature F usion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"286-290"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41894666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting 人工耳蜗听者对可理解性的言语修饰:元音和辅音增强的个体效应

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11131

Juliana N. Saba, J. Hansen

Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.

先前的研究已经证明了通过合成伦巴第效应(LE)语音来提高正常听力(NH)和人工耳蜗(CI)听者的自动语音识别和噪声中语音的可理解性的技术。在这项研究中，我们模拟和评估了基于自然LE语音中观察到的语音产生特征的片段特定修改，以提高CI听众的可理解性。设计了两种语音处理方法来修改元音、辅音的表示，并使用基于幅度的“电域”压缩技术来组合元音、辅音的表示。“电域”指的是传递给耳蜗内电极阵列的刺激序列，该序列与声信号相对应。使用辅音增强策略和辅音和元音增强策略，分别更好地表征与构音和辅音结构相对应的中频和高频内容，对CI听者的表现没有显著差异。还观察到光谱模糊和幅度变化减小，这可能对可理解性产生负面影响。在本研究中，使用加权对数和s型压缩函数的分段扰动证明了提高频率内容表示的能力，但破坏了基于幅度的线索，而不考虑可比的语音可理解性。虽然有无数的声学域修饰表征LE语音，但本研究展示了一个模拟电域分段差异的基本框架。

{"title":"Speech Modification for Intelligibility in Cochlear Implant Listeners: Individual Effects of Vowel- and Consonant-Boosting","authors":"Juliana N. Saba, J. Hansen","doi":"10.21437/interspeech.2022-11131","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11131","url":null,"abstract":"Previous research has demonstrated techniques to improve automatic speech recognition and speech-in-noise intelligibility for normal hearing (NH) and cochlear implant (CI) listeners by synthesizing Lombard Effect (LE) speech. In this study, we emulate and evaluate segment-specific modifications based on speech production characteristics observed in natural LE speech in order to improve intelligibility for CI listeners. Two speech processing approaches were designed to modify representation of vowels, consonants, and the combination using amplitude-based compression techniques in the “ electric domain ” – referring to the stimulation sequence delivered to the intracochlear electrode array that corresponds to the acoustic signal. Performance with CI listeners resulted in no significant difference using consonant-boosting and consonant- and vowel-boosting strategies with better representation of mid-frequency and high-frequency content corresponding to both formant and consonant structure, respectively. Spectral smearing and decreased amplitude variation were also observed which may have negatively impacted intelligibility. Segmental perturbations using a weighted logarithmic and sigmoid compression functions in this study demonstrated the ability to improve representation of frequency content but disrupted amplitude-based cues, regardless of comparable speech intelligibility. While there are an infinite number of acoustic domain modifications characterizing LE speech, this study demonstrates a basic framework for emulating segmental differences in the electric domain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"5473-5477"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41805320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deep Self-Supervised Learning of Speech Denoising from Noisy Speeches 噪声语音去噪的深度自监督学习

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-306

Y. Sanada, Takumi Nakagawa, Yuichiro Wada, K. Takanashi, Yuhui Zhang, Kiichi Tokuyama, T. Kanamori, Tomonori Yamada

In the last few years, unsupervised learning methods have been proposed in speech denoising by taking advantage of Deep Neural Networks (DNNs). The reason is that such unsupervised methods are more practical than the supervised counterparts. In our scenario, we are given a set of noisy speech data, where any two data do not share the same clean data. Our goal is to obtain the denoiser by training a DNN based model. Using the set, we train the model via the following two steps: 1) From the noisy speech data, construct another noisy speech data via our proposed masking technique. 2) Minimize our proposed loss deﬁned from the DNN and the two noisy speech data. We evaluate our method using Gaussian and real-world noises in our numerical experiments. As a result, our method outperforms the state-of-the-art method on average for both noises. In addi-tion, we provide the theoretical explanation of why our method can be efﬁcient if the noise has Gaussian distribution.

近年来，人们利用深度神经网络（DNNs）在语音去噪中提出了无监督学习方法。原因是这种无监督的方法比有监督的方法更实用。在我们的场景中，我们得到了一组有噪声的语音数据，其中任何两个数据都不共享相同的干净数据。我们的目标是通过训练基于DNN的模型来获得去噪器。使用该集合，我们通过以下两个步骤训练模型：1）从有噪声的语音数据中，通过我们提出的掩蔽技术构造另一个有噪声的话音数据。2）最大限度地减少我们从DNN和两个有噪声的语音数据中确定的拟议损失。我们在数值实验中使用高斯噪声和真实世界中的噪声来评估我们的方法。因此，我们的方法在两种噪声方面的平均性能都优于最先进的方法。此外，我们还提供了理论解释，说明如果噪声具有高斯分布，为什么我们的方法是有效的。

引用次数: 1

Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition 基于联合训练框架的粗粒度注意力融合复杂语音增强和端到端语音识别

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-698

Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang

Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.

语音增强和自动语音识别(ASR)的联合训练可以使模型在噪声环境下鲁棒工作。然而，这些模型大多是直接串联工作的，有噪声的语音信息没有被ASR模型重用，导致大量的特征失真。为了从根本上解决语音失真问题，我们提出了一种复杂语音增强网络，该网络将复域中的掩蔽和映射相结合，对语音进行增强。其次，提出了一种粗粒度注意力融合(CAF)机制来融合带噪语音和增强语音的特征。此外，进一步引入感知损失来约束CAF模块的输出和预训练模型的多层输出，使CAF的特征空间与ASR模型更加一致。我们的实验是在AISHELL-1语料库和DNS-3噪声数据集生成的数据集上进行训练和测试的。实验结果表明，该模型在0 dB和-5 dB噪声情况下的字符错误率分别为13.42%和20.67%。在AISHELL-2语料库和MUSAN噪声数据集生成的错配测试数据集上，所提出的联合训练模型具有良好的泛化性能(相对CER下降5.98%)。

{"title":"Coarse-Grained Attention Fusion With Joint Training Framework for Complex Speech Enhancement and End-to-End Speech Recognition","authors":"Xuyi Zhuang, Lu Zhang, Zehua Zhang, Yukun Qian, Mingjiang Wang","doi":"10.21437/interspeech.2022-698","DOIUrl":"https://doi.org/10.21437/interspeech.2022-698","url":null,"abstract":"Joint training of speech enhancement and automatic speech recognition (ASR) can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of noisy speech is not reused by the ASR model, leading to a large amount of feature distortion. In order to solve the distortion problem from the root, we propose a complex speech enhancement network which is used to enhance the speech by combining the masking and mapping in the complex domain. Secondly, we propose a coarse-grained attention fusion (CAF) mechanism to fuse the features of noisy speech and enhanced speech. In addition, perceptual loss is further introduced to constrain the output of the CAF module and the multi-layer output of the pre-trained model so that the feature space of the CAF is more consistent with the ASR model. Our experiments are trained and tested on the dataset generated by AISHELL-1 corpus and DNS-3 noise dataset. The experimental results show that the character error rates (CERs) of the model are 13.42% and 20.67% for the noisy cases of 0 dB and -5 dB. And the proposed joint training model exhibits good generalization performance (5.98% relative CER degradation) on the mismatch test dataset generated by AISHELL-2 corpus and MUSAN noise dataset.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3794-3798"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41833974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CAUSE: Crossmodal Action Unit Sequence Estimation from Speech 原因:从语音中估计跨模动作单元序列

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11232

H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka

This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively conﬁrmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we conﬁrmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.

本文提出了一种仅从语音中估计面部动作单元序列的任务和方法。在面部动作编码系统中引入AUs来客观地描述面部肌肉的激活。我们的动机是，在各种应用中，例如表达性语音合成和情感语音转换，au可以作为有用的连续量来表示说话者微妙的情绪状态、态度和情绪。我们假设，关于说话者面部肌肉运动的信息是在生成的语音中表达出来的，并且可以以某种方式仅从语音中进行预测。为了验证这一点，我们设计了一个神经网络模型，该模型从输入语音的梅尔谱图中预测一个AU序列，并使用由许多说话面部轨迹组成的大规模视听数据集对其进行训练。我们把我们的方法和模型称为“跨模AU序列估计/估计(CAUSE)”。我们为CAUSE实现了几个最基本的体系结构，并定量地证实了完全卷积体系结构的性能最好。此外，通过将CAUSE与au条件下的图像到图像翻译方法相结合，我们实现了一个系统，该系统可以从语音中激活给定的静止人脸图像。使用这个系统，我们通过主观评价确认了AUs作为非语言特征表示的潜在有用性。

{"title":"CAUSE: Crossmodal Action Unit Sequence Estimation from Speech","authors":"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka","doi":"10.21437/interspeech.2022-11232","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11232","url":null,"abstract":"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively conﬁrmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we conﬁrmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"506-510"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43127714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network 基于半监督生成对抗网络的实时丢包隐藏

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10428

Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang

Packet loss is one of the main reasons for speech quality degradation in voice over internet phone (VOIP) calls. However, the existing packet loss concealment (PLC) algorithms are hard to generate high-quality speech signal while maintaining low computational complexity. In this paper, a causal wave-to-wave non-autoregressive lightweight PLC model (PLCNet) is proposed, which can do real-time streaming process with low latency. In addition, we introduce multiple multi-resolution discriminators and semi-supervised training strategy to improve the ability of the encoder part to extract global features while enabling the decoder part to accurately reconstruct waveforms where packets are lost. Contrary to autoregressive model, PLCNet can guarantee the smoothness and continuity of the speech phase before and after packet loss without any smoothing operations. Experimental results show that PLCNet achieves significant improvements in perceptual quality and intelligibility over three classical PLC methods and three state-of-the-art deep PLC methods. In the INTERSPEECH 2022 PLC Challenge, our approach has ranked the 3rd place on PLCMOS (3.829) and the 3rd place on the final score (0.798).

丢包是网络电话语音(VOIP)通话中语音质量下降的主要原因之一。然而，现有的丢包隐藏(PLC)算法难以在保证低计算复杂度的前提下生成高质量的语音信号。本文提出了一种因果波对波非自回归轻量级PLC模型(PLCNet)，该模型可以实现低延迟的实时流处理。此外，我们引入了多个多分辨率鉴别器和半监督训练策略，以提高编码器部分提取全局特征的能力，同时使解码器部分能够准确地重建丢失数据包的波形。与自回归模型相反，PLCNet无需任何平滑操作，即可保证丢包前后语音相位的平滑性和连续性。实验结果表明，与三种经典PLC方法和三种最新的深度PLC方法相比，PLCNet在感知质量和可理解性方面取得了显著改善。在INTERSPEECH 2022 PLC挑战赛中，我们的方法在PLCMOS(3.829)和最终得分(0.798)中排名第三。

{"title":"PLCNet: Real-time Packet Loss Concealment with Semi-supervised Generative Adversarial Network","authors":"Baiyun Liu, Qi Song, Mingxue Yang, Wuwen Yuan, Tianbao Wang","doi":"10.21437/interspeech.2022-10428","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10428","url":null,"abstract":"Packet loss is one of the main reasons for speech quality degradation in voice over internet phone (VOIP) calls. However, the existing packet loss concealment (PLC) algorithms are hard to generate high-quality speech signal while maintaining low computational complexity. In this paper, a causal wave-to-wave non-autoregressive lightweight PLC model (PLCNet) is proposed, which can do real-time streaming process with low latency. In addition, we introduce multiple multi-resolution discriminators and semi-supervised training strategy to improve the ability of the encoder part to extract global features while enabling the decoder part to accurately reconstruct waveforms where packets are lost. Contrary to autoregressive model, PLCNet can guarantee the smoothness and continuity of the speech phase before and after packet loss without any smoothing operations. Experimental results show that PLCNet achieves significant improvements in perceptual quality and intelligibility over three classical PLC methods and three state-of-the-art deep PLC methods. In the INTERSPEECH 2022 PLC Challenge, our approach has ranked the 3rd place on PLCMOS (3.829) and the 3rd place on the final score (0.798).","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"575-579"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43369716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

ASR-Robust Natural Language Understanding on ASR-GLUE dataset 基于ASR-GLUE数据集的ASR鲁棒自然语言理解

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10097

Lingyun Feng, Jianwei Yu, Yan Wang, Songxiang Liu, Deng Cai, Haitao Zheng

引用次数: 0

A Complementary Joint Training Approach Using Unpaired Speech and Text A Complementary Joint Training Approach Using Unpaired Speech and Text 一种使用非配对语音和文本的互补联合训练方法

Interspeech

Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-291

Ye Du, J. Zhang, Qiu-shi Zhu, Lirong Dai, Ming Wu, Xin Fang, Zhouwang Yang

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Interspeech

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀