Listening and seeing again: Generative error correction for audio-visual speech recognition

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2025-08-01 Epub Date: 2025-03-15 DOI:10.1016/j.inffus.2025.103077

Rui Liu , Hongyu Yuan , Guanglai Gao , Haizhou Li

{"title":"Listening and seeing again: Generative error correction for audio-visual speech recognition","authors":"Rui Liu , Hongyu Yuan , Guanglai Gao , Haizhou Li","doi":"10.1016/j.inffus.2025.103077","DOIUrl":null,"url":null,"abstract":"<div><div>Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed <strong>AVGER</strong>, that follows the concept of “listening and seeing again”. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 27.59% compared to them. Code and models can be found at: <span><span>https://github.com/AI-S2-Lab/AVGER</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"120 ","pages":"Article 103077"},"PeriodicalIF":15.5000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525001502","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/15 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of “listening and seeing again”. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 27.59% compared to them. Code and models can be found at: https://github.com/AI-S2-Lab/AVGER.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

再听再看：视听语音识别的生成纠错

与传统的自动语音识别（ASR）不同，视听语音识别（AVSR）同时接收音频和视觉信号来推断转录。最近的研究表明，大语言模型（LLMs）可以有效地用于ASR中的生成错误纠正（GER），通过预测ASR生成的n -最佳假设的最佳转录。然而，这些法学硕士缺乏同时理解音频和视频的能力，这使得GER方法在AVSR中的应用具有挑战性。在这项工作中，我们为AVSR提出了一种新的GER范式，称为AVGER，它遵循“倾听和再次看到”的概念。具体而言，我们首先使用强大的AVSR系统读取音频和视觉信号，得到N-Best假设，然后使用基于q -former的Multimodal Synchronous Encoder再次读取音频和视觉信息，并将其分别转换为可被LLM理解的音频和视频压缩表示。之后，视听压缩表示和N-Best假设共同构成了一个跨模态提示，以指导LLM产生最佳转录。此外，我们还提出了一个多级一致性约束训练准则，包括逻辑级、话语级和表示级，以提高校正精度，同时增强视听压缩表示的可解释性。在LRS3数据集上的实验结果表明，该方法优于当前主流的AVSR系统。所提出的平均错误率比传统的平均错误率降低了27.59%。代码和模型可以在https://github.com/AI-S2-Lab/AVGER上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.