{"title":"Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines","authors":"Chao Sui, Bennamoun, R. Togneri","doi":"10.1109/ICCV.2015.26","DOIUrl":null,"url":null,"abstract":"This paper presents a novel feature learning method for visual speech recognition using Deep Boltzmann Machines (DBM). Unlike all existing visual feature extraction techniques which solely extracts features from video sequences, our method is able to explore both acoustic information and visual information to learn a better visual feature representation in the training stage. During the test stage, instead of using both audio and visual signals, only the videos are used for generating the missing audio feature, and both the given visual and given audio features are used to obtain a joint representation. We carried out our experiments on a large scale audio-visual data corpus, and experimental results show that our proposed techniques outperforms the performance of the hadncrafted features and features learned by other commonly used deep learning techniques.","PeriodicalId":6633,"journal":{"name":"2015 IEEE International Conference on Computer Vision (ICCV)","volume":"36 6","pages":"154-162"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV.2015.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
Abstract
This paper presents a novel feature learning method for visual speech recognition using Deep Boltzmann Machines (DBM). Unlike all existing visual feature extraction techniques which solely extracts features from video sequences, our method is able to explore both acoustic information and visual information to learn a better visual feature representation in the training stage. During the test stage, instead of using both audio and visual signals, only the videos are used for generating the missing audio feature, and both the given visual and given audio features are used to obtain a joint representation. We carried out our experiments on a large scale audio-visual data corpus, and experimental results show that our proposed techniques outperforms the performance of the hadncrafted features and features learned by other commonly used deep learning techniques.