Kshitiz Kumar, Jirí Navrátil, E. Marcheret, V. Libal, G. Ramaswamy, G. Potamianos
{"title":"Audio-visual speech synchronization detection using a bimodal linear prediction model","authors":"Kshitiz Kumar, Jirí Navrátil, E. Marcheret, V. Libal, G. Ramaswamy, G. Potamianos","doi":"10.1109/CVPRW.2009.5204303","DOIUrl":null,"url":null,"abstract":"In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two types of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.","PeriodicalId":431981,"journal":{"name":"2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2009.5204303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
In this work, we study the problem of detecting audio-visual (AV) synchronization in video segments containing a speaker in frontal head pose. The problem holds important applications in biometrics, for example spoofing detection, and it constitutes an important step in AV segmentation necessary for deriving AV fingerprints in multimodal speaker recognition. To attack the problem, we propose a time-evolution model for AV features and derive an analytical approach to capture the notion of synchronization between them. We report results on an appropriate AV database, using two types of visual features extracted from the speaker's facial area: geometric ones and features based on the discrete cosine image transform. Our results demonstrate that the proposed approach provides substantially better AV synchrony detection over a baseline method that employs mutual information, with the geometric visual features outperforming the image transform ones.