Look, listen, and decode: Multimodal speech recognition with images

2016 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2016-12-01 DOI:10.1109/SLT.2016.7846320

Felix Sun, David F. Harwath, James R. Glass

引用次数: 27

Abstract

In this paper, we introduce a multimodal speech recognition scenario, in which an image provides contextual information for a spoken caption to be decoded. We investigate a lattice rescoring algorithm that integrates information from the image at two different points: the image is used to augment the language model with the most likely words, and to rescore the top hypotheses using a word-level RNN. This rescoring mechanism decreases the word error rate by 3 absolute percentage points, compared to a baseline speech recognizer operating with only the speech recording.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

看，听，解码:多模态语音识别与图像

在本文中，我们引入了一个多模态语音识别场景，其中图像为要解码的语音标题提供上下文信息。我们研究了一种网格评分算法，该算法集成了图像在两个不同点的信息:图像被用来用最有可能的词来增强语言模型，并使用词级RNN来重新评分顶级假设。与仅使用语音记录的基线语音识别器相比，这种评分机制将单词错误率降低了3个绝对百分点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量