Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis
{"title":"Text-independent speaker identification using modified SincNet with robust features from suitable acoustic region and appropriate optimizer for raw audio analysis","authors":"Nirupam Shome , Richik Kashyap , Rabul Hussain Laskar","doi":"10.1016/j.compeleceng.2024.109915","DOIUrl":null,"url":null,"abstract":"<div><div>Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..</div></div>","PeriodicalId":50630,"journal":{"name":"Computers & Electrical Engineering","volume":"121 ","pages":"Article 109915"},"PeriodicalIF":4.0000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Electrical Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0045790624008413","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Speaker identification is a method of identifying an individual from a set of speakers, and text-independent speaker identification systems allow speakers to utter any phrase without any constraints. This study is focused on raw audio analysis as phase, fine-grained frequency patterns, timing cues, and other minute characteristics are preserved when raw waveforms are processed as compared to handcrafted features like Mel-Frequency Cepstral Coefficients (MFCC) and visual representation of audio-like spectrogram. Due to the depth of information, which includes variations in speech rhythm, pitch, and vocal tract shape, it is beneficial for identifying speakers. The deep learning architecture known as SincNet has gained popularity in speaker identification because of its parametric Sinc functions that allow it to operate directly on the raw audio input. In this paper, we have considered SincNet as the baseline model for speaker identification. The effect of proper speech boundary detection, including high-level features and effective optimizer selection are analysed. The precise identification of the signal start and terminus point is important for eliminating the redundant non-speech regions. We have included endpoint detection module as a pre-processing step in the system. Proper feature extraction and selection are crucial to the model's success. To extract more abstract features from the data, we have added more convolution layers to the original SincNet model. Further, we investigated the hyperparameter tuning protocol's sensitivity to the optimizer and selected the suitable optimizer for raw audio analysis. With all the modifications in the system architecture, we are able to archive improvements of 12.76 %, 13.33 %, and 13.39 % respectively for training, validation, and testing over the original SincNet model. In terms of validation loss, our proposed approach attains 0.35 in comparison to the original SincNet loss of 1.02. With this significant improvement, the total training time is marginally increased by 20 minutes for our proposed model. We have performed our investigation on the LibriSpeech dataset to check the effectiveness of our proposed system in comparison to the other model..
期刊介绍:
The impact of computers has nowhere been more revolutionary than in electrical engineering. The design, analysis, and operation of electrical and electronic systems are now dominated by computers, a transformation that has been motivated by the natural ease of interface between computers and electrical systems, and the promise of spectacular improvements in speed and efficiency.
Published since 1973, Computers & Electrical Engineering provides rapid publication of topical research into the integration of computer technology and computational techniques with electrical and electronic systems. The journal publishes papers featuring novel implementations of computers and computational techniques in areas like signal and image processing, high-performance computing, parallel processing, and communications. Special attention will be paid to papers describing innovative architectures, algorithms, and software tools.