{"title":"A Novel Spec-CNN-CTC Model for End-to-End Speech Recognition","authors":"Jing Xue, Jun Zhang","doi":"10.1145/3457682.3457703","DOIUrl":null,"url":null,"abstract":"This paper discusses the application of a special data augmentation approach for end-to-end phone recognition system on the Deep Neural Networks. The system improves the performance of phone recognition and alleviates overfitting during training. Also, it offers a solution to the problem of few public datasets annotated at the phone level. And we propose the CNN-CTC structure as a baseline model. The model is based on Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) objective function. Which is an end-to-end structure, and there is no need to force alignment each frame of audio. The SpecAugment approach directly processes the feature of audio, such as the log Mel-spectrogram. In our experiment, the Spec-CNN-CTC system achieves a phone error rate of 16.11% on TIMIT corpus with no prior linguistic information. Which is outperforming the previous work Acoustic-State-Transition Model (ASTM) by 27.63%, the DNN-HMM with MFCC + IFCC features by 16.8%, the RNN-CRF model by 17.3% and the DBM-DNN model by 22.62%.","PeriodicalId":142045,"journal":{"name":"2021 13th International Conference on Machine Learning and Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457682.3457703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper discusses the application of a special data augmentation approach for end-to-end phone recognition system on the Deep Neural Networks. The system improves the performance of phone recognition and alleviates overfitting during training. Also, it offers a solution to the problem of few public datasets annotated at the phone level. And we propose the CNN-CTC structure as a baseline model. The model is based on Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) objective function. Which is an end-to-end structure, and there is no need to force alignment each frame of audio. The SpecAugment approach directly processes the feature of audio, such as the log Mel-spectrogram. In our experiment, the Spec-CNN-CTC system achieves a phone error rate of 16.11% on TIMIT corpus with no prior linguistic information. Which is outperforming the previous work Acoustic-State-Transition Model (ASTM) by 27.63%, the DNN-HMM with MFCC + IFCC features by 16.8%, the RNN-CRF model by 17.3% and the DBM-DNN model by 22.62%.