Speaker-Independent Visual Speech Recognition with the Inception V3 Model

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI:10.1109/SLT48900.2021.9383540

Timothy Israel Santos, Andrew Abel, N. Wilson, Yan Xu

引用次数: 4

Abstract

The natural process of understanding speech involves combining auditory and visual cues. CNN based lip reading systems have become very popular in recent years. However, many of these systems consider lipreading to be a black box problem, with limited detailed performance analysis. In this paper, we performed transfer learning by training the Inception v3 CNN model, which has pre-trained weights produced from IMAGENET, with the GRID corpus, delivering good speech recognition results, with 0.61 precision, 0.53 recall, and 0.51 F1-score. The lip reading model was able to automatically learn pertinent features, demonstrated using visualisation, and achieve speaker-independent results comparable to human lip readers on the GRID corpus. We also identify limitations that match those of humans, therefore limiting potential deep learning performance in real world situations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于Inception V3模型的说话人独立视觉语音识别

理解语言的自然过程包括听觉和视觉线索的结合。近年来，基于CNN的唇读系统变得非常流行。然而，这些系统中的许多都认为唇读是一个黑匣子问题，缺乏详细的性能分析。在本文中，我们通过使用GRID语料库训练Inception v3 CNN模型进行迁移学习，该模型具有由IMAGENET产生的预训练权值，获得了良好的语音识别结果，精度为0.61，召回率为0.53,F1-score为0.51。唇读模型能够自动学习相关特征，使用可视化进行演示，并获得与GRID语料库上的人类唇读器相当的独立于说话人的结果。我们还发现了与人类相匹配的局限性，从而限制了在现实世界中潜在的深度学习性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量