{"title":"基于递归神经网络的语音识别强化学习","authors":"Imad Burhan Kadhim, Mahdi Fadil Khaleel, Zuhair Shakor Mahmood, Ali Nasret Najdet Coran","doi":"10.1109/ASIANCON55314.2022.9908930","DOIUrl":null,"url":null,"abstract":"This work describes a voice recognition system that does not need an intermediate phonetic representation to convert audio input to text. The system is based on a mix of the the Connectionist Temporal Classification goal function and deep bidirectional LSTM recurrent neural network architecture . A new method is proposed in which the network is taught to reduce the likelihood of an arbitrary transcription loss function being encountered. without the aid of any lexicons or models, this allows for a direct optimization of WER. The system has a WER (word error rate) of 22 percent, 20 percent with simply a lexicon of authorized terms, 9 percent using a trigram language model. The error rate drops to 7 percent when the network is used in conjunction with a baseline system.","PeriodicalId":429704,"journal":{"name":"2022 2nd Asian Conference on Innovation in Technology (ASIANCON)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Reinforcement Learning for Speech Recognition using Recurrent Neural Networks\",\"authors\":\"Imad Burhan Kadhim, Mahdi Fadil Khaleel, Zuhair Shakor Mahmood, Ali Nasret Najdet Coran\",\"doi\":\"10.1109/ASIANCON55314.2022.9908930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work describes a voice recognition system that does not need an intermediate phonetic representation to convert audio input to text. The system is based on a mix of the the Connectionist Temporal Classification goal function and deep bidirectional LSTM recurrent neural network architecture . A new method is proposed in which the network is taught to reduce the likelihood of an arbitrary transcription loss function being encountered. without the aid of any lexicons or models, this allows for a direct optimization of WER. The system has a WER (word error rate) of 22 percent, 20 percent with simply a lexicon of authorized terms, 9 percent using a trigram language model. The error rate drops to 7 percent when the network is used in conjunction with a baseline system.\",\"PeriodicalId\":429704,\"journal\":{\"name\":\"2022 2nd Asian Conference on Innovation in Technology (ASIANCON)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 2nd Asian Conference on Innovation in Technology (ASIANCON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASIANCON55314.2022.9908930\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 2nd Asian Conference on Innovation in Technology (ASIANCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASIANCON55314.2022.9908930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Reinforcement Learning for Speech Recognition using Recurrent Neural Networks
This work describes a voice recognition system that does not need an intermediate phonetic representation to convert audio input to text. The system is based on a mix of the the Connectionist Temporal Classification goal function and deep bidirectional LSTM recurrent neural network architecture . A new method is proposed in which the network is taught to reduce the likelihood of an arbitrary transcription loss function being encountered. without the aid of any lexicons or models, this allows for a direct optimization of WER. The system has a WER (word error rate) of 22 percent, 20 percent with simply a lexicon of authorized terms, 9 percent using a trigram language model. The error rate drops to 7 percent when the network is used in conjunction with a baseline system.