基于声学时序排列的连续唇语阅读

IF 1.7 3区 计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-05-06 DOI:10.1186/s13636-024-00345-7
David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
{"title":"基于声学时序排列的连续唇语阅读","authors":"David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos","doi":"10.1186/s13636-024-00345-7","DOIUrl":null,"url":null,"abstract":"Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"9 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Continuous lipreading based on acoustic temporal alignments\",\"authors\":\"David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos\",\"doi\":\"10.1186/s13636-024-00345-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.\",\"PeriodicalId\":49202,\"journal\":{\"name\":\"Eurasip Journal on Audio Speech and Music Processing\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eurasip Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-024-00345-7\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00345-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

视觉语音识别(VSR)是一项极具挑战性的任务,在过去几十年中受到越来越多的关注。目前的技术采用了基于深度学习的强大端到端架构,这些架构的估算依赖于大量数据和高计算资源。我们通过使用基于隐马尔可夫模型的传统方法,解决了在计算资源有限的情况下数据稀缺场景下的 VSR 任务。我们提出了一种新颖的学习策略,利用从以前的声学时序排列中获得的信息来提高视觉系统的性能。此外,我们还研究了多种视觉语音表征以及图像分辨率或帧速率对其性能的影响。所有这些实验都是在数据有限的 VLRF 语料库上进行的,该语料库为西班牙语连续语音识别提供了视听支持。结果表明,我们的方法明显优于迄今为止在该任务上取得的最佳结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Continuous lipreading based on acoustic temporal alignments
Visual speech recognition (VSR) is a challenging task that has received increasing interest during the last few decades. Current state of the art employs powerful end-to-end architectures based on deep learning which depend on large amounts of data and high computational resources for their estimation. We address the task of VSR for data scarcity scenarios with limited computational resources by using traditional approaches based on hidden Markov models. We present a novel learning strategy that employs information obtained from previous acoustic temporal alignments to improve the visual system performance. Furthermore, we studied multiple visual speech representations and how image resolution or frame rate affect its performance. All these experiments were conducted on the limited data VLRF corpus, a database which offers an audio-visual support to address continuous speech recognition in Spanish. The results show that our approach significantly outperforms the best results achieved on the task to date.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Eurasip Journal on Audio Speech and Music Processing
Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
4.10
自引率
4.20%
发文量
0
审稿时长
12 months
期刊介绍: The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.
期刊最新文献
Compression of room impulse responses for compact storage and fast low-latency convolution Guest editorial: AI for computational audition—sound and music processing Physics-constrained adaptive kernel interpolation for region-to-region acoustic transfer function: a Bayesian approach Physics-informed neural network for volumetric sound field reconstruction of speech signals Optimal sensor placement for the spatial reconstruction of sound fields
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1