O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy
{"title":"End-to-End Model Based on RNN-T for Kazakh Speech Recognition","authors":"O. Mamyrbayev, Dina Oralbekova, A. Kydyrbekova, Tolganay Turdalykyzy, A. Bekarystankyzy","doi":"10.1109/ICCCI51764.2021.9486811","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.","PeriodicalId":180004,"journal":{"name":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Computer Communication and the Internet (ICCCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCI51764.2021.9486811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Stream speech recognition allows to transfer the audio stream to speech-to-text conversion and get the results of stream speech recognition in real time as the audio is processed. This article discusses and implements a popular RNN-T-based model for recognizing Kazakh speech. The analysis of works related to recognition of Kazakh speech based on the CTC model is also given. The findings demonstrated that an RNN-T-based model can work well without additional components, like a language model and showed the best outcome on our dataset. As a result of the research, the system reached 10.6% CER, which is the best indicator among other end-to-end systems for recognizing Kazakh speech.