Automatic Speaker Recognition with Limited Data

Proceedings of the 13th International Conference on Web Search and Data Mining Pub Date : 2020-01-20 DOI:10.1145/3336191.3371802

Ruirui Li, Jyun-Yu Jiang, Jiahao Liu, Chu-Cheng Hsieh, Wei Wang

{"title":"Automatic Speaker Recognition with Limited Data","authors":"Ruirui Li, Jyun-Yu Jiang, Jiahao Liu, Chu-Cheng Hsieh, Wei Wang","doi":"10.1145/3336191.3371802","DOIUrl":null,"url":null,"abstract":"Automatic speaker recognition (ASR) is a stepping-stone technology towards semantic multimedia understanding and benefits versatile downstream applications. In recent years, neural network-based ASR methods have demonstrated remarkable power to achieve excellent recognition performance with sufficient training data. However, it is impractical to collect sufficient training data for every user, especially for fresh users. Therefore, a large portion of users usually has a very limited number of training instances. As a consequence, the lack of training data prevents ASR systems from accurately learning users acoustic biometrics, jeopardizes the downstream applications, and eventually impairs user experience. In this work, we propose an adversarial few-shot learning-based speaker identification framework (AFEASI) to develop robust speaker identification models with only a limited number of training instances. We first employ metric learning-based few-shot learning to learn speaker acoustic representations, where the limited instances are comprehensively utilized to improve the identification performance. In addition, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. Experiments conducted on a publicly available large-scale dataset demonstrate that \\model significantly outperforms eleven baseline methods. An in-depth analysis further indicates both effectiveness and robustness of the proposed method.","PeriodicalId":319008,"journal":{"name":"Proceedings of the 13th International Conference on Web Search and Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th International Conference on Web Search and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3336191.3371802","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Automatic speaker recognition (ASR) is a stepping-stone technology towards semantic multimedia understanding and benefits versatile downstream applications. In recent years, neural network-based ASR methods have demonstrated remarkable power to achieve excellent recognition performance with sufficient training data. However, it is impractical to collect sufficient training data for every user, especially for fresh users. Therefore, a large portion of users usually has a very limited number of training instances. As a consequence, the lack of training data prevents ASR systems from accurately learning users acoustic biometrics, jeopardizes the downstream applications, and eventually impairs user experience. In this work, we propose an adversarial few-shot learning-based speaker identification framework (AFEASI) to develop robust speaker identification models with only a limited number of training instances. We first employ metric learning-based few-shot learning to learn speaker acoustic representations, where the limited instances are comprehensively utilized to improve the identification performance. In addition, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. Experiments conducted on a publicly available large-scale dataset demonstrate that \model significantly outperforms eleven baseline methods. An in-depth analysis further indicates both effectiveness and robustness of the proposed method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

有限数据的自动说话人识别

自动说话人识别(ASR)是实现语义多媒体理解的基石技术，对多种下游应用都有好处。近年来，基于神经网络的ASR方法在训练数据充足的情况下取得了优异的识别性能。然而，为每个用户收集足够的训练数据是不切实际的，特别是对于新用户。因此，很大一部分用户通常只有非常有限的训练实例。因此，训练数据的缺乏阻碍了ASR系统准确地学习用户的声学生物特征，危及下游应用，并最终损害用户体验。在这项工作中，我们提出了一个对抗性的基于少量学习的说话人识别框架(AFEASI)，以开发仅使用有限数量的训练实例的鲁棒说话人识别模型。我们首先采用基于度量学习的少镜头学习来学习说话人的声学表征，其中综合利用有限的实例来提高识别性能。此外，利用对抗学习进一步增强了对抗性样本说话人识别的泛化性和鲁棒性。在公开可用的大规模数据集上进行的实验表明，\模型显著优于11种基线方法。进一步的分析表明了该方法的有效性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 13th International Conference on Web Search and Data Mining

自引率

0.00%

发文量

期刊最新文献

Recurrent Memory Reasoning Network for Expert Finding in Community Question Answering Joint Recognition of Names and Publications in Academic Homepages LouvainNE Enhancing Re-finding Behavior with External Memories for Personalized Search Temporal Pattern of Retweet(s) Help to Maximize Information Diffusion in Twitter