E. Marchi, Stephen Shum, Kvuveon Hwang, S. Kajarekar, Siddharth Sigtia, H. Richards, R. Haynes, Yoon Kim, J. Bridle
{"title":"基于课程学习的广义判别变换用于说话人识别","authors":"E. Marchi, Stephen Shum, Kvuveon Hwang, S. Kajarekar, Siddharth Sigtia, H. Richards, R. Haynes, Yoon Kim, J. Bridle","doi":"10.1109/ICASSP.2018.8461296","DOIUrl":null,"url":null,"abstract":"In this paper we introduce a speaker verification system deployed on mobile devices that can be used to personalise a keyword spotter. We describe a baseline DNN system that maps an utterance to a speaker embedding, which is used to measure speaker differences via cosine similarity. We then introduce an architectural modification which uses an LSTM system where the parameters are optimised via a curriculum learning procedure to reduce the detection error and improve its generalisability across various conditions. Experiments on our internal datasets show that the proposed approach outperforms the DNN baseline system and yields a relative EER reduction of 30-70% on both text-dependent and text-independent tasks under a variety of acoustic conditions.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"7 1","pages":"5324-5328"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Generalised Discriminative Transform via Curriculum Learning for Speaker Recognition\",\"authors\":\"E. Marchi, Stephen Shum, Kvuveon Hwang, S. Kajarekar, Siddharth Sigtia, H. Richards, R. Haynes, Yoon Kim, J. Bridle\",\"doi\":\"10.1109/ICASSP.2018.8461296\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we introduce a speaker verification system deployed on mobile devices that can be used to personalise a keyword spotter. We describe a baseline DNN system that maps an utterance to a speaker embedding, which is used to measure speaker differences via cosine similarity. We then introduce an architectural modification which uses an LSTM system where the parameters are optimised via a curriculum learning procedure to reduce the detection error and improve its generalisability across various conditions. Experiments on our internal datasets show that the proposed approach outperforms the DNN baseline system and yields a relative EER reduction of 30-70% on both text-dependent and text-independent tasks under a variety of acoustic conditions.\",\"PeriodicalId\":6638,\"journal\":{\"name\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"7 1\",\"pages\":\"5324-5328\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2018.8461296\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8461296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Generalised Discriminative Transform via Curriculum Learning for Speaker Recognition
In this paper we introduce a speaker verification system deployed on mobile devices that can be used to personalise a keyword spotter. We describe a baseline DNN system that maps an utterance to a speaker embedding, which is used to measure speaker differences via cosine similarity. We then introduce an architectural modification which uses an LSTM system where the parameters are optimised via a curriculum learning procedure to reduce the detection error and improve its generalisability across various conditions. Experiments on our internal datasets show that the proposed approach outperforms the DNN baseline system and yields a relative EER reduction of 30-70% on both text-dependent and text-independent tasks under a variety of acoustic conditions.