Deep Speaker Representation Using Orthogonal Decomposition and Recombination for Speaker Verification

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8683332

I. Kim, Kyu-hong Kim, Ji-Whan Kim, Changkyu Choi

{"title":"Deep Speaker Representation Using Orthogonal Decomposition and Recombination for Speaker Verification","authors":"I. Kim, Kyu-hong Kim, Ji-Whan Kim, Changkyu Choi","doi":"10.1109/ICASSP.2019.8683332","DOIUrl":null,"url":null,"abstract":"Speech signal contains intrinsic and extrinsic variations such as accent, emotion, dialect, phoneme, speaking manner, noise, music, and reverberation. Some of these variations are unnecessary and are unspecified factors of variation. These factors lead to increased variability in speaker representation. In this paper, we assume that unspecified factors of variation exist in speaker representations, and we attempt to minimize variability in speaker representation. The key idea is that a primal speaker representation can be decomposed into orthogonal vectors and these vectors are recombined by using deep neural networks (DNN) to reduce speaker representation variability, yielding performance improvement for speaker verification (SV). The experimental results show that our proposed approach produces a relative equal error rate (EER) reduction of 47.1% compared to the use of the same convolutional neural network (CNN) architecture on the Vox-Celeb dataset. Furthermore, our proposed method provides significant improvement for short utterances.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"6126-6130"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8683332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Speech signal contains intrinsic and extrinsic variations such as accent, emotion, dialect, phoneme, speaking manner, noise, music, and reverberation. Some of these variations are unnecessary and are unspecified factors of variation. These factors lead to increased variability in speaker representation. In this paper, we assume that unspecified factors of variation exist in speaker representations, and we attempt to minimize variability in speaker representation. The key idea is that a primal speaker representation can be decomposed into orthogonal vectors and these vectors are recombined by using deep neural networks (DNN) to reduce speaker representation variability, yielding performance improvement for speaker verification (SV). The experimental results show that our proposed approach produces a relative equal error rate (EER) reduction of 47.1% compared to the use of the same convolutional neural network (CNN) architecture on the Vox-Celeb dataset. Furthermore, our proposed method provides significant improvement for short utterances.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于正交分解和重组的深度说话人表示用于说话人验证

语音信号包括口音、情绪、方言、音素、说话方式、噪音、音乐和混响等内在和外在的变化。其中一些变化是不必要的，是未指明的变化因素。这些因素导致说话人表现的变异性增加。在本文中，我们假设说话人表征中存在未指明的变异因素，并试图最小化说话人表征中的变异。其关键思想是将原始说话人表示分解为正交向量，并使用深度神经网络(DNN)对这些向量进行重组，以减少说话人表示的可变性，从而提高说话人验证(SV)的性能。实验结果表明，与在Vox-Celeb数据集上使用相同的卷积神经网络(CNN)架构相比，我们提出的方法产生的相对相等错误率(EER)降低了47.1%。此外，我们提出的方法对短话语有显著的改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量

期刊最新文献

Universal Acoustic Modeling Using Neural Mixture Models Speech Landmark Bigrams for Depression Detection from Naturalistic Smartphone Speech Robust M-estimation Based Matrix Completion When Can a System of Subnetworks Be Registered Uniquely? Learning Search Path for Region-level Image Matching