Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury
{"title":"Raw Waveform Based Speaker Identification Using Deep Neural Networks","authors":"Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury","doi":"10.1109/SILCON55242.2022.10028890","DOIUrl":null,"url":null,"abstract":"Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.","PeriodicalId":183947,"journal":{"name":"2022 IEEE Silchar Subsection Conference (SILCON)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Silchar Subsection Conference (SILCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SILCON55242.2022.10028890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.