Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer

Q4 Biochemistry, Genetics and Molecular Biology Journal of Biomolecular Techniques Pub Date : 2023-03-21 DOI:10.51173/jt.v5i1.749

Mohanad Sameer, Ahmed Talib, Alla Hussein

{"title":"Arabic Speech Recognition Based on Encoder-Decoder Architecture of Transformer","authors":"Mohanad Sameer, Ahmed Talib, Alla Hussein","doi":"10.51173/jt.v5i1.749","DOIUrl":null,"url":null,"abstract":"Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model.","PeriodicalId":39617,"journal":{"name":"Journal of Biomolecular Techniques","volume":"39 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomolecular Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.51173/jt.v5i1.749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Biochemistry, Genetics and Molecular Biology","Score":null,"Total":0}

引用次数: 0

Abstract

Recognizing and transcribing human speech has become an increasingly important task. Recently, researchers have been more interested in automatic speech recognition (ASR) using End to End models. Previous choices for the Arabic ASR architecture have been time-delay neural networks, recurrent neural networks (RNN), and long short-term memory (LSTM). Preview end-to-end approaches have suffered from slow training and inference speed because of the limitations of training parallelization, and they require a large amount of data to achieve acceptable results in recognizing Arabic speech This research presents an Arabic speech recognition based on a transformer encoder-decoder architecture with self-attention to transcribe Arabic audio speech segments into text, which can be trained faster with more efficiency. The proposed model exceeds the performance of previous end-to-end approaches when utilizing the Common Voice dataset from Mozilla. In this research, we introduced a speech-transformer model that was trained over 110 epochs using only 112 hours of speech. Although Arabic is considered one of the languages that are difficult to interpret by speech recognition systems, we achieved the best word error rate (WER) of 3.2 compared to other systems whose training requires a very large amount of data. The proposed system was evaluated on the common voice 8.0 dataset without using the language model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于变压器编解码器结构的阿拉伯语语音识别

识别和转录人类语言已成为一项日益重要的任务。近年来，研究人员对基于端到端模型的自动语音识别(ASR)越来越感兴趣。阿拉伯语ASR架构之前的选择是延时神经网络、循环神经网络(RNN)和长短期记忆(LSTM)。由于训练并行化的限制，预览端到端方法的训练和推理速度较慢，并且需要大量的数据才能达到可接受的阿拉伯语语音识别效果。本研究提出了一种基于自关注的转换器编码器-解码器架构的阿拉伯语语音识别，将阿拉伯语音频语音片段转录成文本，训练速度更快，效率更高。当利用来自Mozilla的Common Voice数据集时，所提出的模型的性能超过了以前的端到端方法。在这项研究中，我们引入了一个语音转换模型，该模型仅使用112小时的语音训练了110个epoch。尽管阿拉伯语被认为是语音识别系统难以解释的语言之一，但与其他需要大量数据进行训练的系统相比，我们实现了最佳的单词错误率(WER)为3.2。在不使用语言模型的情况下，在通用语音8.0数据集上对所提出的系统进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomolecular Techniques Biochemistry, Genetics and Molecular Biology-Molecular Biology

CiteScore

2.50

自引率

0.00%

发文量

期刊介绍： The Journal of Biomolecular Techniques is a peer-reviewed publication issued five times a year by the Association of Biomolecular Resource Facilities. The Journal was established to promote the central role biotechnology plays in contemporary research activities, to disseminate information among biomolecular resource facilities, and to communicate the biotechnology research conducted by the Association’s Research Groups and members, as well as other investigators.