{"title":"语音识别的联合瓶颈特征与注意模型","authors":"Long Xingyan, Qu Dan","doi":"10.1145/3208788.3208798","DOIUrl":null,"url":null,"abstract":"Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.","PeriodicalId":211585,"journal":{"name":"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence","volume":"44 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Joint bottleneck feature and attention model for speech recognition\",\"authors\":\"Long Xingyan, Qu Dan\",\"doi\":\"10.1145/3208788.3208798\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.\",\"PeriodicalId\":211585,\"journal\":{\"name\":\"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence\",\"volume\":\"44 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3208788.3208798\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 2018 International Conference on Mathematics and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3208788.3208798","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Joint bottleneck feature and attention model for speech recognition
Recently, attention based sequence-to-sequence model become a research hotspot in speech recognition. The attention model has the problem of slow convergence and poor robustness. In this paper, a model that jointed a bottleneck feature extraction network and attention model is proposed. The model is composed of a Deep Belief Network as bottleneck feature extraction network and an attention-based encoder-decoder model. DBN can store the priori information from Hidden Markov Model so that increasing convergence speed of and enhancing both robustness and discrimination of features. Attention model utilizes the temporal information of feature sequence to calculate the posterior probability of phoneme. Then the number of stack recurrent neural network layers in attention model is reduced in order to decrease the calculation of gradient. Experiments in the TIMIT corpus showed that the phoneme error rate is 17.80% in test set, the average training iteration decreased 52%, and the number of training iterations decreased from 139 to 89. The word error rate of WSJ eval92 is 12.9% without any external language model.