On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode

Raviraj Joshi, Subodh Kumar
{"title":"On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode","authors":"Raviraj Joshi, Subodh Kumar","doi":"10.1109/SPCOM55316.2022.9840823","DOIUrl":null,"url":null,"abstract":"The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model.In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS rescoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The streaming automatic speech recognition (ASR) models are more popular and suitable for voice-based applications. However, non-streaming models provide better performance as they look at the entire audio context. To leverage the benefits of the non-streaming model in streaming applications like voice search, it is commonly used in second pass re-scoring mode. The candidate hypothesis generated using steaming models is re-scored using a non-streaming model.In this work, we evaluate the non-streaming attention-based end-to-end ASR models on the Flipkart voice search task in both standalone and re-scoring modes. These models are based on Listen-Attend-Spell (LAS) encoder-decoder architecture. We experiment with different encoder variations based on LSTM, Transformer, and Conformer. We compare the latency requirements of these models along with their performance. Overall we show that the Transformer model offers acceptable WER with the lowest latency requirements. We report a relative WER improvement of around 16% with the second pass LAS rescoring with latency overhead under 5ms. We also highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER). Moreover, we observe that in the second pass re-scoring mode all the encoders provide similar benefits whereas the difference in performance is prominent in standalone text generation mode.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于注意力的端到端语音识别独立模式和记分模式的编码器比较
流自动语音识别(ASR)模型更受欢迎,适用于基于语音的应用。然而,非流模型提供了更好的性能,因为它们着眼于整个音频环境。为了在流应用程序(如语音搜索)中利用非流模型的优势,它通常用于第二遍重新评分模式。使用蒸汽模型生成的候选假设使用非流模型重新评分。在这项工作中,我们评估了Flipkart语音搜索任务在独立和重新评分模式下的非流媒体基于注意力的端到端ASR模型。这些模型是基于“听-参与-拼写”(LAS)编码器-解码器架构的。我们尝试了基于LSTM、Transformer和Conformer的不同编码器变体。我们比较了这些模型的延迟需求及其性能。总的来说,我们展示了Transformer模型提供了具有最低延迟需求的可接受的WER。我们报告了第二次通过LAS评分的相对WER提高了约16%,延迟开销低于5ms。我们还强调了具有Transformer架构的CNN前端对于实现可比较的单词错误率(WER)的重要性。此外,我们观察到,在第二次重新评分模式下,所有编码器都提供了类似的好处,而在独立文本生成模式下,性能差异很突出。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
C-Band Iris Coupled Cavity Bandpass Filter A Wideband Bandpass Filter using U-shaped slots on SIW with two Notches at 8 GHz and 10 GHz Semi-Blind Technique for Frequency Selective Channel Estimation in Millimeter-Wave MIMO Coded FBMC System Binary Intelligent Reflecting Surfaces Assisted OFDM Systems Improving the Performance of Zero-Resource Children’s ASR System through Formant and Duration Modification based Data Augmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1