The structural tuning of the convolutional neural network forspeaker identification in mel frequency cepstrumcoefficients space

Anastasiia D. Matychenko, M. V. Polyakova
{"title":"The structural tuning of the convolutional neural network forspeaker identification in mel frequency cepstrumcoefficients space","authors":"Anastasiia D. Matychenko, M. V. Polyakova","doi":"10.15276/hait.06.2023.7","DOIUrl":null,"url":null,"abstract":"As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important fordevices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Insteadof the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposedto use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices.","PeriodicalId":375628,"journal":{"name":"Herald of Advanced Information Technology","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Herald of Advanced Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15276/hait.06.2023.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As a result of the literature analysis, the main methods for speaker identification from speech signals were defined. These are statistical methods based on Gaussian mixture model and a universal background model, as well as neural network methods, in particular, using convolutional or Siamese neural networks. The main characteristics of these methods are the recognition performance, a number of parameters, and the training time. High recognition performance is achieved by using convolutional neural networks, but a number of parameters of these networks are much higher than for statistical methods, although lower than for Siamese neural networks. A large number of parameters require a large training set, which is not always available for the researcher. In addition, despite the effectiveness of convolutional neural networks, model size and inference efficiency remain important fordevices with a limited source of computing power, such as peripheral or mobile devices. Therefore, the aspects of tuning of the structure of existing convolutional neural networks are relevant for research. In this work, we have performed a structural tuning of an existing convolutional neural network based on the VGGNet architecture for speaker identification in the space of mel frequency cepstrum coefficients. The aim of the work was to reduce the number of neural network parameters and, as a result, to reduce the network training time, provided that the recognition performance is sufficient (the correct recognition is above 95 %). The neural network proposed as a result of structural tuning has fewer layers than the architecture of the basic neural network. Insteadof the ReLU activation function, the related Leaky ReLU function with a parameter of 0.1 was used. The number of filters and the size of kernels in convolutional layers are changed. The size of kernels for the max pooling layer has been increased. It is proposedto use the averaging of the results of each convolution to input a two-dimensional convolution results to a fully connected layer with the Softmax activation function. The performed experiment showed that the number of parameters of the proposed neural network is 29 % less than the number of parameters of the basic neural network, provided that the speaker recognition performance is almost the same. In addition, the training time of the proposed and basic neural networks was evaluated on five datasets of audio recordings corresponding to different numbers of speakers. The training time of the proposed network was reduced by 10-39 % compared to the basic neural network. The results of the research show the advisability of the structural tuning of the convolutional neural network for devices with a limited source of computing, namely, peripheral or mobile devices.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
卷积神经网络在低频倒频系数空间中用于说话人识别的结构调谐
在文献分析的基础上,定义了语音信号识别说话人的主要方法。这些是基于高斯混合模型和通用背景模型的统计方法,以及神经网络方法,特别是使用卷积或暹罗神经网络。这些方法的主要特点是识别性能好、参数多、训练时间短。使用卷积神经网络可以实现高的识别性能,但这些网络的一些参数远高于统计方法,尽管低于暹罗神经网络。大量的参数需要大量的训练集,这对于研究人员来说并不总是可用的。此外,尽管卷积神经网络很有效,但对于计算能力有限的设备(如外围设备或移动设备),模型大小和推理效率仍然很重要。因此,现有卷积神经网络的结构调整方面的研究是相关的。在这项工作中,我们基于VGGNet架构对现有的卷积神经网络进行了结构调整,用于在mel频率倒谱系数空间中识别说话人。这项工作的目的是减少神经网络参数的数量,从而减少网络训练时间,前提是识别性能足够(正确识别率在95%以上)。由于结构调整而提出的神经网络比基本神经网络的结构层数更少。我们没有使用ReLU激活函数,而是使用了参数为0.1的Leaky ReLU函数。改变了卷积层中滤波器的数量和核的大小。最大池化层的内核大小已经增加。提出了利用每次卷积结果的平均,将二维卷积结果输入到具有Softmax激活函数的全连接层。实验表明,在说话人识别性能基本相同的情况下,所提神经网络的参数数量比基本神经网络的参数数量少29%。此外,在不同说话人数量对应的5个录音数据集上,对所提神经网络和基本神经网络的训练时间进行了评估。与基本神经网络相比,该网络的训练时间缩短了10- 39%。研究结果表明,对于计算源有限的设备,即外围设备或移动设备,卷积神经网络的结构调谐是可取的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Method of reliability control of thermoelectric systems to ensure thermal regimes Reaching consensus in group recommendation systems Modeling and forecasting of stock market processes The use of augmented reality for renovation of cultural heritage sites Assessment of the quality of neural network models based on a multifactorial information criterion
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1