Raw Waveform Based Speaker Identification Using Deep Neural Networks

Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury
{"title":"Raw Waveform Based Speaker Identification Using Deep Neural Networks","authors":"Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury","doi":"10.1109/SILCON55242.2022.10028890","DOIUrl":null,"url":null,"abstract":"Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.","PeriodicalId":183947,"journal":{"name":"2022 IEEE Silchar Subsection Conference (SILCON)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Silchar Subsection Conference (SILCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SILCON55242.2022.10028890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于原始波形的深度神经网络说话人识别
深度学习在说话人识别任务中作为i-向量的适当替代品而引起了极大的关注。深度神经网络在端到端说话人识别领域受到广泛关注。早些时候,深度神经网络在手工制作的语音特征上进行训练,比如mel滤波器组和mel频率倒谱系数。后来,由于原始语音信号是无损的,对原始波形的处理成为端到端说话人识别、音乐自动标注、语音识别等领域的研究热点。卷积神经网络(cnn)最近在直接输入原始语音样本时显示出了令人鼓舞的结果。CNN通过分析波形来发现低级语音表征,而不是传统的手工特征,这可能使系统能够更有效地处理音高和共振峰等扬声器属性。有效的神经网络设计对于实现这一目标至关重要。本文提出的CNN架构促进了深度卷积层,为端到端说话人识别系统开发更有效的滤波器。所提出的架构收敛速度快,在原始波形上优于传统的CNN。该方法在librisspeech数据集上进行了测试,结果表明,该方法的说话人识别准确率提高了10%,验证损失减少了32%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Performance Analysis of Relay Assisted NOMA Multi-Casting System for Cellular V2X Communications Comparison of Resampling Techniques for Imbalanced Datasets in Student Dropout Prediction Raw Waveform Based Speaker Identification Using Deep Neural Networks Multiclass Signal Quality Assessment of Electrocardiogram using Entropy-based Features and Machine Learning Technique A Study on Legal Judgment Prediction using Deep Learning Techniques
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1