俄语语音识别的编码器-解码器模型

Q3 Mathematics Informatsionno-Upravliaiushchie Sistemy Pub Date : 2019-10-04 DOI:10.31799/1684-8853-2019-4-45-53

Nikita Markovnikov, I. Kipyatkova

{"title":"俄语语音识别的编码器-解码器模型","authors":"Nikita Markovnikov, I. Kipyatkova","doi":"10.31799/1684-8853-2019-4-45-53","DOIUrl":null,"url":null,"abstract":"Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.","PeriodicalId":36977,"journal":{"name":"Informatsionno-Upravliaiushchie Sistemy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Encoder-decoder models for recognition of Russian speech\",\"authors\":\"Nikita Markovnikov, I. Kipyatkova\",\"doi\":\"10.31799/1684-8853-2019-4-45-53\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.\",\"PeriodicalId\":36977,\"journal\":{\"name\":\"Informatsionno-Upravliaiushchie Sistemy\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatsionno-Upravliaiushchie Sistemy\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31799/1684-8853-2019-4-45-53\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatsionno-Upravliaiushchie Sistemy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31799/1684-8853-2019-4-45-53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 1

摘要

问题：传统的自动语音识别系统是使用基于隐藏马尔可夫模型和统计语言模型的声学模型来构建的。这种系统显示出很高的识别精度，但由几个独立的复杂部分组成，这可能会在构建模型时造成问题。最近，一种使用深度人工神经网络的端到端识别方法得到了推广。这种方法使得只使用一个神经网络就可以很容易地实现模型。端到端模型软化在语音识别的速度和准确性方面表现出更好的性能。目的：实现用于连续俄语语音识别的端到端模型，在识别精度和计算特性（如学习和解码速度）方面对其进行调整，并与混合基础模型进行比较。方法：使用注意力机制创建语音识别的编码器模型；应用神经网络的稳定化和正则化技术；增加训练数据；使用单词的部分作为神经网络的输出。结果：在不提取特征或使用语言模型的情况下，使用注意力机制识别连续俄语语音，获得了编码器-解码器模型。作为输出序列的元素，我们使用了训练集中的部分单词。所得到的模型不能超过基本的混合模型，但在识别精度和解码/学习速度方面都超过了其他基线端到端模型。单词识别误差为24.17%，解码速度为实时的0.3倍，比基线端到端模型快6%，比基本混合模型快46%。我们证明了端到端模型可以在没有俄语语言模型的情况下工作，同时证明了比混合模型更高的解码速度。在没有提取任何特征的情况下，在原始数据上训练得到的模型。我们发现，对于俄语，与基于位置或基于上下文的注意力机制相比，混合类型的注意力机制给出了最好的结果。实际相关性：与传统的混合模型相比，所得到的模型需要更少的内存和更少的语音解码时间。这一事实可以让它们在移动设备上本地使用，而不需要在远程服务器上进行计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Encoder-decoder models for recognition of Russian speech

Problem: Classical systems of automatic speech recognition are traditionally built using an acoustic model based on hidden Markovmodels and a statistical language model. Such systems demonstrate high recognition accuracy, but consist of several independentcomplex parts, which can cause problems when building models. Recently, an end-to-end recognition method has been spread, usingdeep artificial neural networks. This approach makes it easy to implement models using just one neural network. End-to-end modelsoften demonstrate better performance in terms of speed and accuracy of speech recognition. Purpose: Implementation of end-toendmodels for the recognition of continuous Russian speech, their adjustment and comparison with hybrid base models in terms ofrecognition accuracy and computational characteristics, such as the speed of learning and decoding. Methods: Creating an encoderdecodermodel of speech recognition using an attention mechanism; applying techniques of stabilization and regularization of neuralnetworks; augmentation of data for training; using parts of words as an output of a neural network. Results: An encoder-decodermodel was obtained using an attention mechanism for recognizing continuous Russian speech without extracting features or usinga language model. As elements of the output sequence, we used parts of words from the training set. The resulting model could notsurpass the basic hybrid models, but surpassed the other baseline end-to-end models, both in recognition accuracy and in decoding/learning speed. The word recognition error was 24.17% and the decoding speed was 0.3 of the real time, which is 6% faster than thebaseline end-to-end model and 46% faster than the basic hybrid model. We showed that end-to-end models could work without languagemodels for the Russian language, while demonstrating a higher decoding speed than hybrid models. The resulting model was trained onraw data without extracting any features. We found that for the Russian language the hybrid type of an attention mechanism gives thebest result compared to location-based or context-based attention mechanisms. Practical relevance: The resulting models require lessmemory and less speech decoding time than the traditional hybrid models. That fact can allow them to be used locally on mobile deviceswithout using calculations on remote servers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Informatsionno-Upravliaiushchie Sistemy Mathematics-Control and Optimization

CiteScore

1.40

自引率

0.00%

发文量