Evaluation of real-time transcriptions using end-to-end ASR models

arXiv - CS - Sound Pub Date : 2024-09-09 DOI:arxiv-2409.05674

Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso

{"title":"Evaluation of real-time transcriptions using end-to-end ASR models","authors":"Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso","doi":"arxiv-2409.05674","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly\nevolved in the last few years. Traditional architectures based on pipelines\nhave been replaced by joint end-to-end (E2E) architectures that simplify and\nstreamline the model training process. In addition, new AI training methods,\nsuch as weak-supervised learning have reduced the need for high-quality audio\ndatasets for model training. However, despite all these advancements, little to\nno research has been done on real-time transcription. In real-time scenarios,\nthe audio is not pre-recorded, and the input audio must be fragmented to be\nprocessed by the ASR systems. To achieve real-time requirements, these\nfragments must be as short as possible to reduce latency. However, audio cannot\nbe split at any point as dividing an utterance into two separate fragments will\ngenerate an incorrect transcription. Also, shorter fragments provide less\ncontext for the ASR model. For this reason, it is necessary to design and test\ndifferent splitting algorithms to optimize the quality and delay of the\nresulting transcription. In this paper, three audio splitting algorithms are\nevaluated with different ASR models to determine their impact on both the\nquality of the transcription and the end-to-end delay. The algorithms are\nfragmentation at fixed intervals, voice activity detection (VAD), and\nfragmentation with feedback. The results are compared to the performance of the\nsame model, without audio fragmentation, to determine the effects of this\ndivision. The results show that VAD fragmentation provides the best quality\nwith the highest delay, whereas fragmentation at fixed intervals provides the\nlowest quality and the lowest delay. The newly proposed feedback algorithm\nexchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively,\nto the VAD splitting.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"37 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05674","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic Speech Recognition (ASR) or Speech-to-text (STT) has greatly evolved in the last few years. Traditional architectures based on pipelines have been replaced by joint end-to-end (E2E) architectures that simplify and streamline the model training process. In addition, new AI training methods, such as weak-supervised learning have reduced the need for high-quality audio datasets for model training. However, despite all these advancements, little to no research has been done on real-time transcription. In real-time scenarios, the audio is not pre-recorded, and the input audio must be fragmented to be processed by the ASR systems. To achieve real-time requirements, these fragments must be as short as possible to reduce latency. However, audio cannot be split at any point as dividing an utterance into two separate fragments will generate an incorrect transcription. Also, shorter fragments provide less context for the ASR model. For this reason, it is necessary to design and test different splitting algorithms to optimize the quality and delay of the resulting transcription. In this paper, three audio splitting algorithms are evaluated with different ASR models to determine their impact on both the quality of the transcription and the end-to-end delay. The algorithms are fragmentation at fixed intervals, voice activity detection (VAD), and fragmentation with feedback. The results are compared to the performance of the same model, without audio fragmentation, to determine the effects of this division. The results show that VAD fragmentation provides the best quality with the highest delay, whereas fragmentation at fixed intervals provides the lowest quality and the lowest delay. The newly proposed feedback algorithm exchanges a 2-4% increase in WER for a reduction of 1.5-2s delay, respectively, to the VAD splitting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用端到端 ASR 模型对实时转录进行评估

自动语音识别（ASR）或语音到文本（STT）技术在过去几年中得到了长足的发展。基于流水线的传统架构已被端到端 (E2E) 联合架构所取代，后者简化并简化了模型训练过程。此外，新的人工智能训练方法（如弱监督学习）降低了模型训练对高质量音频数据集的需求。然而，尽管取得了所有这些进步，有关实时转录的研究却少之又少。在实时场景中，音频不是预先录制的，输入的音频必须经过分片才能被 ASR 系统处理。为了达到实时要求，这些片段必须尽可能短，以减少延迟。但是，音频不能在任何时候被分割，因为将一个语句分割成两个独立的片段会产生错误的转录。此外，较短的片段为 ASR 模型提供的语境较少。因此，有必要设计和测试不同的分割算法，以优化转录结果的质量和延迟。本文使用不同的 ASR 模型对三种音频分割算法进行了评估，以确定它们对转录质量和端到端延迟的影响。这三种算法分别是固定间隔分割、语音活动检测（VAD）和带反馈的分割。将结果与不进行音频分片的相同模型的性能进行比较，以确定这种分片的效果。结果表明，VAD 分片提供了最好的质量和最高的延迟，而固定间隔分片则提供了最差的质量和最低的延迟。新提出的反馈算法分别以 WER 增加 2-4% 换取 VAD 分割延迟减少 1.5-2s 的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量