Proper Error Estimation and Calibration for Attention-Based Encoder-Decoder Models

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-06 DOI:10.1109/TASLP.2024.3492799

Mun-Hak Lee;Joon-Hyuk Chang

引用次数: 0

Abstract

An attention-based automatic speech recognition (ASR) model generates a probability distribution of the tokens set at each time step. Recent studies have shown that calibration errors exist in the output probability distributions of attention-based ASR models trained to minimize the negative log likelihood. This study analyzes the causes of calibration errors in ASR model outputs and their impact on model performance. Based on the analysis, we argue that conventional methods for estimating calibration errors at the token level are unsuitable for ASR tasks. Accordingly, we propose a new calibration measure that estimates the calibration error at the sequence level. Moreover, we present a new post-hoc calibration function and training objective to mitigate the calibration error of the ASR model at the sequence level. Through experiments using the ASR benchmark, we show that the proposed methods effectively alleviate the calibration error of the ASR model and improve the generalization performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于注意力的编码器-解码器模型的正确误差估计和校准

基于注意力的自动语音识别（ASR）模型会在每个时间步生成标记集的概率分布。最近的研究表明，为最小化负对数似然而训练的注意力型 ASR 模型的输出概率分布存在校准误差。本研究分析了 ASR 模型输出校准误差的原因及其对模型性能的影响。根据分析结果，我们认为在标记水平上估计校准误差的传统方法不适合 ASR 任务。因此，我们提出了一种新的校准测量方法，可以估计序列级别的校准误差。此外，我们还提出了一种新的事后校准函数和训练目标，以减轻 ASR 模型在序列层面的校准误差。通过使用 ASR 基准进行实验，我们发现所提出的方法有效地减轻了 ASR 模型的校准误差，并提高了泛化性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach