{"title":"Improving Speaker Verification in Noisy Environment Using DNN Classifier","authors":"Chung Tran Quang, Quang Minh Nguyen, Pham Ngoc Phuong, Quoc Truong Do","doi":"10.1109/RIVF51545.2021.9642074","DOIUrl":null,"url":null,"abstract":"Speaker verification in noisy environments is still a challenging task. Previous studies have proposed speaker embeddings (x-vectors, ThinResNet) with classifier models (PLDA, cosine) to classify if an audio is spoken by a specific speaker. The verification process is defined in 3 steps: training an embedding extractor, enrollment and verification. Most studies were trying to mitigate the noisy issue by augmenting noises in the embedding extractor. This method helps the extractor to tolerate more types of noise during the inference process. However, the classification model is still sensitive in noisy environments. In this paper, we (1) evaluate the effectiveness of different speaker embedding models and classifiers in various conditions, and (2) propose a neural network classifier on top of embedding vectors and train it with data augmentation. Experimental results indicate that the proposed pipeline outperforms the traditional pipeline by 5% F1 on a clean test set and 9% F1 on noisy test sets.","PeriodicalId":6860,"journal":{"name":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","volume":"18 1","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 RIVF International Conference on Computing and Communication Technologies (RIVF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF51545.2021.9642074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speaker verification in noisy environments is still a challenging task. Previous studies have proposed speaker embeddings (x-vectors, ThinResNet) with classifier models (PLDA, cosine) to classify if an audio is spoken by a specific speaker. The verification process is defined in 3 steps: training an embedding extractor, enrollment and verification. Most studies were trying to mitigate the noisy issue by augmenting noises in the embedding extractor. This method helps the extractor to tolerate more types of noise during the inference process. However, the classification model is still sensitive in noisy environments. In this paper, we (1) evaluate the effectiveness of different speaker embedding models and classifiers in various conditions, and (2) propose a neural network classifier on top of embedding vectors and train it with data augmentation. Experimental results indicate that the proposed pipeline outperforms the traditional pipeline by 5% F1 on a clean test set and 9% F1 on noisy test sets.