端到端语音识别的新型Spec-CNN-CTC模型

2021 13th International Conference on Machine Learning and Computing Pub Date : 2021-02-26 DOI:10.1145/3457682.3457703

Jing Xue, Jun Zhang

{"title":"端到端语音识别的新型Spec-CNN-CTC模型","authors":"Jing Xue, Jun Zhang","doi":"10.1145/3457682.3457703","DOIUrl":null,"url":null,"abstract":"This paper discusses the application of a special data augmentation approach for end-to-end phone recognition system on the Deep Neural Networks. The system improves the performance of phone recognition and alleviates overfitting during training. Also, it offers a solution to the problem of few public datasets annotated at the phone level. And we propose the CNN-CTC structure as a baseline model. The model is based on Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) objective function. Which is an end-to-end structure, and there is no need to force alignment each frame of audio. The SpecAugment approach directly processes the feature of audio, such as the log Mel-spectrogram. In our experiment, the Spec-CNN-CTC system achieves a phone error rate of 16.11% on TIMIT corpus with no prior linguistic information. Which is outperforming the previous work Acoustic-State-Transition Model (ASTM) by 27.63%, the DNN-HMM with MFCC + IFCC features by 16.8%, the RNN-CRF model by 17.3% and the DBM-DNN model by 22.62%.","PeriodicalId":142045,"journal":{"name":"2021 13th International Conference on Machine Learning and Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Novel Spec-CNN-CTC Model for End-to-End Speech Recognition\",\"authors\":\"Jing Xue, Jun Zhang\",\"doi\":\"10.1145/3457682.3457703\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper discusses the application of a special data augmentation approach for end-to-end phone recognition system on the Deep Neural Networks. The system improves the performance of phone recognition and alleviates overfitting during training. Also, it offers a solution to the problem of few public datasets annotated at the phone level. And we propose the CNN-CTC structure as a baseline model. The model is based on Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) objective function. Which is an end-to-end structure, and there is no need to force alignment each frame of audio. The SpecAugment approach directly processes the feature of audio, such as the log Mel-spectrogram. In our experiment, the Spec-CNN-CTC system achieves a phone error rate of 16.11% on TIMIT corpus with no prior linguistic information. Which is outperforming the previous work Acoustic-State-Transition Model (ASTM) by 27.63%, the DNN-HMM with MFCC + IFCC features by 16.8%, the RNN-CRF model by 17.3% and the DBM-DNN model by 22.62%.\",\"PeriodicalId\":142045,\"journal\":{\"name\":\"2021 13th International Conference on Machine Learning and Computing\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 13th International Conference on Machine Learning and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3457682.3457703\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 13th International Conference on Machine Learning and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457682.3457703","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文讨论了一种特殊的数据增强方法在深度神经网络端到端手机识别系统中的应用。该系统提高了手机识别的性能，缓解了训练过程中的过拟合问题。此外，它还解决了在电话级别上标注的公共数据集较少的问题。我们提出了CNN-CTC结构作为基线模型。该模型基于卷积神经网络(cnn)和连接时间分类(CTC)目标函数。这是一个端到端的结构，不需要强制对齐每一帧音频。SpecAugment方法直接处理音频的特征，如对数梅尔谱图。在我们的实验中，Spec-CNN-CTC系统在没有先验语言信息的TIMIT语料库上实现了16.11%的电话错误率。它比之前的声学状态转换模型(ASTM)高27.63%，比具有MFCC + IFCC特征的DNN-HMM高16.8%，比RNN-CRF模型高17.3%，比DBM-DNN模型高22.62%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Novel Spec-CNN-CTC Model for End-to-End Speech Recognition

This paper discusses the application of a special data augmentation approach for end-to-end phone recognition system on the Deep Neural Networks. The system improves the performance of phone recognition and alleviates overfitting during training. Also, it offers a solution to the problem of few public datasets annotated at the phone level. And we propose the CNN-CTC structure as a baseline model. The model is based on Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) objective function. Which is an end-to-end structure, and there is no need to force alignment each frame of audio. The SpecAugment approach directly processes the feature of audio, such as the log Mel-spectrogram. In our experiment, the Spec-CNN-CTC system achieves a phone error rate of 16.11% on TIMIT corpus with no prior linguistic information. Which is outperforming the previous work Acoustic-State-Transition Model (ASTM) by 27.63%, the DNN-HMM with MFCC + IFCC features by 16.8%, the RNN-CRF model by 17.3% and the DBM-DNN model by 22.62%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 13th International Conference on Machine Learning and Computing

自引率

0.00%

发文量