Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal

Eslam E. El Maghraby, A. Gody, M. Farouk
{"title":"Enhancing quality and accuracy of speech recognition system by using multimodal audio-visual speech signal","authors":"Eslam E. El Maghraby, A. Gody, M. Farouk","doi":"10.1109/ICENCO.2016.7856472","DOIUrl":null,"url":null,"abstract":"Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.","PeriodicalId":332360,"journal":{"name":"2016 12th International Computer Engineering Conference (ICENCO)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Computer Engineering Conference (ICENCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICENCO.2016.7856472","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose, These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrate by a preliminary experiment on the GRID sentence database one of the largest databases available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 3.9% for the grammar based word recognition system overall speakers is achieved for speaker independent test and for speaker dependent, it changes from speaker to another between 7% and 1%. Also when test the system under noisy environment it improve the result.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用多模态视听语音信号提高语音识别系统的质量和准确性
大多数基于语音的自动识别的发展都依赖于声学语音作为唯一的输入信号,而忽略了视觉信号。然而,仅基于声学语音的识别可能存在缺陷,妨碍其在许多实际应用中使用,特别是在不利条件下。本文旨在构建一个结合语音信息和视觉信息的英语连词视听语音识别系统(AV-ASR)来提高识别性能。本文利用低频倒谱系数(MFCCs)从语音文件中提取音频特征。对于视觉对应部分,使用离散余弦变换(DCT)系数从说话人的嘴部区域提取视觉特征,并使用主成分分析(PCA)进行降维,然后将这些特征与传统的音频特征连接起来,所得特征用于使用词级声学模型训练隐马尔可夫模型(hmm)参数。该系统使用隐马尔可夫模型工具包(HTK)开发,该工具包使用隐马尔可夫模型(hmm)进行识别。在GRID句子数据库上的初步实验证明了该方法的潜力,GRID句子数据库是视听识别系统中最大的数据库之一,它包含用于小词汇任务的连续英语语音命令。实验结果表明,所提出的音视频语音识别器(AV-ASR)系统具有比纯音频识别器更高的识别率和鲁棒性。在独立测试中,基于语法的单词识别系统的成功率提高了3.9%,而在独立测试中,基于语法的单词识别系统的成功率在7%到1%之间变化。并对系统进行了噪声环境下的测试,改善了测试结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
New scheme for CSFB improvement in LTE A robust local data and membership information based FCM algorithm for noisy image segmentation Global distributed clustering technique for randomly deployed wireless sensor networks Grey wolf optimizer-based back-propagation neural network algorithm Loan portfolio optimization using Genetic Algorithm: A case of credit constraints
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1