{"title":"Statistical LIP modelling for visual speech recognition","authors":"J. Luettin, N. Thacker, S. W. Beet","doi":"10.5281/ZENODO.36365","DOIUrl":null,"url":null,"abstract":"We describe a speechreading (lipreading) system purely based on visual features extracted from grey level image sequences of the speaker's lips. Active shape models are used to track the lip contours while visual speech information is extracted from the shape of the contours. The distribution and temporal dependencies of the shape features are modelled by continuous density Hidden Markov Models. Experiments are reported for speaker independent recognition tests of isolated digits. The analysis of individual feature components suggests that speech relevant information is embedded in a low dimensional space and fairly robust to inter- and intra-speaker variability.","PeriodicalId":282153,"journal":{"name":"1996 8th European Signal Processing Conference (EUSIPCO 1996)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"1996 8th European Signal Processing Conference (EUSIPCO 1996)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5281/ZENODO.36365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31
Abstract
We describe a speechreading (lipreading) system purely based on visual features extracted from grey level image sequences of the speaker's lips. Active shape models are used to track the lip contours while visual speech information is extracted from the shape of the contours. The distribution and temporal dependencies of the shape features are modelled by continuous density Hidden Markov Models. Experiments are reported for speaker independent recognition tests of isolated digits. The analysis of individual feature components suggests that speech relevant information is embedded in a low dimensional space and fairly robust to inter- and intra-speaker variability.