FYEO : A Character Level Model For Lip Reading

V. Joshi, Ebin Deni Raj
{"title":"FYEO : A Character Level Model For Lip Reading","authors":"V. Joshi, Ebin Deni Raj","doi":"10.1109/ICSCC51209.2021.9528104","DOIUrl":null,"url":null,"abstract":"The human mind is an amazing piece of creation that can handle multiple modalities of input seamlessly and help to make sense about the surroundings. When it comes to making sense about speech, 2 main features of input are sound and vision (although there are many other components). Since not every mind is alike, some of them have trouble processing the sound aspect of input therefore vision becomes their primary source to process and understand speech. Lip reading is a skill that is used mainly by people suffering from hearing deformities and it involves large amount of language specific knowledge as well as contextual awareness i.e. using all possible visual clues that help to make sense of what the other person is saying and thus allow them to take part in the conversation. Recent breakthroughs in the field of Deep learning have clearly shown promise with models that have the ability to extract complex, intricate and generalizable patterns both in spatial as well as temporal dimension. In this paper we present FYEO (For Your Eyes Only) an end-to-end deep learning based solution that only uses vision as its single modality of input and generates a single word, character by character. The model is a modified version of the LipNet architecture from Deep Mind, to a subset of words curated from the Oxford-BBC Lip Reading in the Wild (LRW) dataset. Also, as a part of novel work FYEO is extended by adding attention mechanism for further improvement of the model’s contextual awareness and observe the model’s focus while making a prediction. The standard FYEO model achieves a length normalised test CER (character-error-rate) of 25.024%.","PeriodicalId":382982,"journal":{"name":"2021 8th International Conference on Smart Computing and Communications (ICSCC)","volume":"62 11","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th International Conference on Smart Computing and Communications (ICSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSCC51209.2021.9528104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The human mind is an amazing piece of creation that can handle multiple modalities of input seamlessly and help to make sense about the surroundings. When it comes to making sense about speech, 2 main features of input are sound and vision (although there are many other components). Since not every mind is alike, some of them have trouble processing the sound aspect of input therefore vision becomes their primary source to process and understand speech. Lip reading is a skill that is used mainly by people suffering from hearing deformities and it involves large amount of language specific knowledge as well as contextual awareness i.e. using all possible visual clues that help to make sense of what the other person is saying and thus allow them to take part in the conversation. Recent breakthroughs in the field of Deep learning have clearly shown promise with models that have the ability to extract complex, intricate and generalizable patterns both in spatial as well as temporal dimension. In this paper we present FYEO (For Your Eyes Only) an end-to-end deep learning based solution that only uses vision as its single modality of input and generates a single word, character by character. The model is a modified version of the LipNet architecture from Deep Mind, to a subset of words curated from the Oxford-BBC Lip Reading in the Wild (LRW) dataset. Also, as a part of novel work FYEO is extended by adding attention mechanism for further improvement of the model’s contextual awareness and observe the model’s focus while making a prediction. The standard FYEO model achieves a length normalised test CER (character-error-rate) of 25.024%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FYEO:唇读的角色级模型
人类的大脑是一个神奇的创造物,它可以无缝地处理多种形式的输入,并帮助理解周围的环境。说到理解语音,输入的两个主要特征是声音和视觉(尽管还有许多其他组成部分)。由于并非每个人的大脑都是一样的,有些人在处理声音输入方面有困难,因此视觉成为他们处理和理解语音的主要来源。唇读是一种主要由听力畸形患者使用的技能,它涉及到大量的语言特定知识以及上下文意识,即使用所有可能的视觉线索来帮助理解他人所说的话,从而使他们能够参与到对话中来。深度学习领域最近的突破已经清楚地显示出有能力在空间和时间维度上提取复杂、复杂和可推广的模式的模型的前景。在本文中,我们提出了FYEO (For Your Eyes Only),这是一种基于端到端深度学习的解决方案,它只使用视觉作为其单一的输入方式,并一个字符一个字符地生成单个单词。该模型是来自Deep Mind的LipNet架构的修改版本,是来自牛津- bbc野生唇读(LRW)数据集的单词子集。此外,作为新颖工作的一部分,FYEO被扩展,加入了注意机制,进一步提高了模型的上下文意识,并在进行预测时观察模型的焦点。标准FYEO模型的长度归一化测试CER(字符错误率)为25.024%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FYEO : A Character Level Model For Lip Reading Parameter Dependencies and Optimization of True Random Number Generator (TRNG) using Genetic Algorithm (GA) Chaotic Time Series Prediction Model for Fractional-Order Duffing's Oscillator Segmentation of Brain Tumour in MR Images Using Modified Deep Learning Network Classification of Power Quality Disturbances in Emerging Power System with Distributed Generation Using Space Phasor Model and Normalized Cross Correlation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1