Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis

IF 4 3区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Computers & Electrical Engineering Pub Date : 2024-11-26 DOI:10.1016/j.compeleceng.2024.109915

Nirupam Shome , Richik Kashyap , Rabul Hussain Laskar

{"title":"Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis","authors":"Nirupam Shome , Richik Kashyap , Rabul Hussain Laskar","doi":"10.1016/j.compeleceng.2024.109915","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"121 ","pages":"Article 109915"},"PeriodicalIF":4.0000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790624008413","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用修改后的 SincNet 和来自合适声学区域的稳健特征以及用于原始音频分析的适当优化器，进行与文本无关的说话者识别

扬声器识别是从一组扬声器中识别出一个人的方法，与文本无关的扬声器识别系统允许扬声器不受任何限制地说出任何短语。本研究的重点是原始音频分析，因为在处理原始波形时，相位、细粒度频率模式、时间线索和其他微小特征都会保留下来，而手工制作的特征（如梅尔频率倒频谱系数（MFCC）和类似音频频谱图的视觉表示）则不会。由于信息的深度，其中包括语音节奏、音高和声道形状的变化，因此有利于识别说话者。被称为 SincNet 的深度学习架构因其参数化 Sinc 函数可直接对原始音频输入进行操作而在扬声器识别领域大受欢迎。在本文中，我们将 SincNet 视为识别说话人的基准模型。本文分析了正确的语音边界检测（包括高级特征和有效的优化器选择）的效果。精确识别信号的起点和终点对于消除多余的非语音区域非常重要。我们在系统中加入了终点检测模块作为预处理步骤。正确的特征提取和选择是模型成功的关键。为了从数据中提取更多抽象特征，我们在原始 SincNet 模型中添加了更多卷积层。此外，我们还研究了超参数调整协议对优化器的敏感性，并为原始音频分析选择了合适的优化器。在对系统架构进行所有修改后，我们在训练、验证和测试方面分别比原始 SincNet 模型提高了 12.76%、13.33% 和 13.39%。在验证损失方面，我们提出的方法达到了 0.35，而原始 SincNet 的损失为 1.02。由于这一重大改进，我们提出的模型的总训练时间略微增加了 20 分钟。我们在 LibriSpeech 数据集上进行了调查，以检验我们提出的系统与其他模型相比的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers & Electrical Engineering 工程技术-工程：电子与电气

CiteScore

9.20

自引率

7.00%

发文量

661

审稿时长

47 days

期刊介绍： The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency. Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.