{"title":"Multiclass Spoken Language Identification for Indian Languages using Deep Learning","authors":"Lakshmana Rao Arla, Sridevi Bonthu, Abhinav Dayal","doi":"10.1109/IBSSC51096.2020.9332161","DOIUrl":null,"url":null,"abstract":"Spoken Language Identification (SLID) aims at assigning language labels to speech in an audio file. This paper proposes an approach based on Convolution Neural Networks (CNN) for the automatic identification of four Indian languages, Bengali, Gujarati, Tamil and Telugu. The classifier is trained on audio data of 5 hours duration, from each of the four languages. The CNN operates on MFCC spectrogram images generated from short splits of two to four second duration from the raw audio input with varying audio quality and noise print. The paper also analyzes the SLID system performance as a function of different train and test audio sample durations. The proposed CNN model achieves 88.82% accuracy, which can be considered as best when compared with machine learning models.","PeriodicalId":432093,"journal":{"name":"2020 IEEE Bombay Section Signature Conference (IBSSC)","volume":"6 23","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Bombay Section Signature Conference (IBSSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBSSC51096.2020.9332161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Spoken Language Identification (SLID) aims at assigning language labels to speech in an audio file. This paper proposes an approach based on Convolution Neural Networks (CNN) for the automatic identification of four Indian languages, Bengali, Gujarati, Tamil and Telugu. The classifier is trained on audio data of 5 hours duration, from each of the four languages. The CNN operates on MFCC spectrogram images generated from short splits of two to four second duration from the raw audio input with varying audio quality and noise print. The paper also analyzes the SLID system performance as a function of different train and test audio sample durations. The proposed CNN model achieves 88.82% accuracy, which can be considered as best when compared with machine learning models.