Raw Waveform Based Speaker Identification Using Deep Neural Networks

2022 IEEE Silchar Subsection Conference (SILCON) Pub Date : 2022-11-04 DOI:10.1109/SILCON55242.2022.10028890

Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury

{"title":"Raw Waveform Based Speaker Identification Using Deep Neural Networks","authors":"Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury","doi":"10.1109/SILCON55242.2022.10028890","DOIUrl":null,"url":null,"abstract":"Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.","PeriodicalId":183947,"journal":{"name":"2022 IEEE Silchar Subsection Conference (SILCON)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Silchar Subsection Conference (SILCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SILCON55242.2022.10028890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于原始波形的深度神经网络说话人识别

深度学习在说话人识别任务中作为i-向量的适当替代品而引起了极大的关注。深度神经网络在端到端说话人识别领域受到广泛关注。早些时候，深度神经网络在手工制作的语音特征上进行训练，比如mel滤波器组和mel频率倒谱系数。后来，由于原始语音信号是无损的，对原始波形的处理成为端到端说话人识别、音乐自动标注、语音识别等领域的研究热点。卷积神经网络(cnn)最近在直接输入原始语音样本时显示出了令人鼓舞的结果。CNN通过分析波形来发现低级语音表征，而不是传统的手工特征，这可能使系统能够更有效地处理音高和共振峰等扬声器属性。有效的神经网络设计对于实现这一目标至关重要。本文提出的CNN架构促进了深度卷积层，为端到端说话人识别系统开发更有效的滤波器。所提出的架构收敛速度快，在原始波形上优于传统的CNN。该方法在librisspeech数据集上进行了测试，结果表明，该方法的说话人识别准确率提高了10%，验证损失减少了32%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE Silchar Subsection Conference (SILCON)

自引率

0.00%

发文量