Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, R. Horaud
{"title":"在多说话人跟踪中利用视听数据的互补性","authors":"Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, R. Horaud","doi":"10.1109/ICCVW.2017.60","DOIUrl":null,"url":null,"abstract":"Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.","PeriodicalId":149766,"journal":{"name":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking\",\"authors\":\"Yutong Ban, Laurent Girin, Xavier Alameda-Pineda, R. Horaud\",\"doi\":\"10.1109/ICCVW.2017.60\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.\",\"PeriodicalId\":149766,\"journal\":{\"name\":\"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCVW.2017.60\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Computer Vision Workshops (ICCVW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCVW.2017.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploiting the Complementarity of Audio and Visual Data in Multi-speaker Tracking
Multi-speaker tracking is a central problem in human-robot interaction. In this context, exploiting auditory and visual information is gratifying and challenging at the same time. Gratifying because the complementary nature of auditory and visual information allows us to be more robust against noise and outliers than unimodal approaches. Challenging because how to properly fuse auditory and visual information for multi-speaker tracking is far from being a solved problem. In this paper we propose a probabilistic generative model that tracks multiple speakers by jointly exploiting auditory and visual features in their own representation spaces. Importantly, the method is robust to missing data and is therefore able to track even when observations from one of the modalities are absent. Quantitative and qualitative results on the AVDIAR dataset are reported.