Evaluation of real-time transcriptions using end-to-end ASR models

Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso
{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\nevolved in the last few years. Traditional architectures based on pipelines\nhave been replaced by joint end-to-end (E2E) architectures that simplify and\nstreamline the model training process. In addition, new AI training methods,\nsuch as weak-supervised learning have reduced the need for high-quality audio\ndatasets for model training. However, despite all these advancements, little to\nno research has been done on real-time transcription. In real-time scenarios,\nthe audio is not pre-recorded, and the input audio must be fragmented to be\nprocessed by the ASR systems. To achieve real-time requirements, these\nfragments must be as short as possible to reduce latency. However, audio cannot\nbe split at any point as dividing an utterance into two separate fragments will\ngenerate an incorrect transcription. Also, shorter fragments provide less\ncontext for the ASR model. For this reason, it is necessary to design and test\ndifferent splitting algorithms to optimize the quality and delay of the\nresulting transcription. In this paper, three audio splitting algorithms are\nevaluated with different ASR models to determine their impact on both the\nquality of the transcription and the end-to-end delay. The algorithms are\nfragmentation at fixed intervals, voice activity detection (VAD), and\nfragmentation with feedback. The results are compared to the performance of the\nsame model, without audio fragmentation, to determine the effects of this\ndivision. The results show that VAD fragmentation provides the best quality\nwith the highest delay, whereas fragmentation at fixed intervals provides the\nlowest quality and the lowest delay. The newly proposed feedback algorithm\nexchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\nto the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用端到端 ASR 模型对实时转录进行评估
自动语音识别(ASR)或语音到文本(STT)技术在过去几年中得到了长足的发展。基于流水线的传统架构已被端到端 (E2E) 联合架构所取代,后者简化并简化了模型训练过程。此外,新的人工智能训练方法(如弱监督学习)降低了模型训练对高质量音频数据集的需求。然而,尽管取得了所有这些进步,有关实时转录的研究却少之又少。在实时场景中,音频不是预先录制的,输入的音频必须经过分片才能被 ASR 系统处理。为了达到实时要求,这些片段必须尽可能短,以减少延迟。但是,音频不能在任何时候被分割,因为将一个语句分割成两个独立的片段会产生错误的转录。此外,较短的片段为 ASR 模型提供的语境较少。因此,有必要设计和测试不同的分割算法,以优化转录结果的质量和延迟。本文使用不同的 ASR 模型对三种音频分割算法进行了评估,以确定它们对转录质量和端到端延迟的影响。这三种算法分别是固定间隔分割、语音活动检测(VAD)和带反馈的分割。将结果与不进行音频分片的相同模型的性能进行比较,以确定这种分片的效果。结果表明,VAD 分片提供了最好的质量和最高的延迟,而固定间隔分片则提供了最差的质量和最低的延迟。新提出的反馈算法分别以 WER 增加 2-4% 换取 VAD 分割延迟减少 1.5-2s 的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Benchmarking Sub-Genre Classification For Mainstage Dance Music PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification Evaluation of real-time transcriptions using end-to-end ASR models Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks Harmonic Reasoning in Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1