Andrew Emerson, Patrick Houghton, Ke Chen, Vinay Basheerabad, Rutuja Ubale, C. W. Leong
{"title":"Predicting User Confidence in Video Recordings with Spatio-Temporal Multimodal Analytics","authors":"Andrew Emerson, Patrick Houghton, Ke Chen, Vinay Basheerabad, Rutuja Ubale, C. W. Leong","doi":"10.1145/3536220.3558007","DOIUrl":null,"url":null,"abstract":"A critical component of effective communication is the ability to project confidence. In video presentations (e.g., video interviews), there are many factors that influence perceived confidence by a listener. Advances in computer vision, speech processing, and natural language processing have enabled the automatic extraction of salient features that can be used to model a presenter’s perceived confidence. Moreover, these multimodal features can be used to automatically provide feedback to a user with ways they can improve their projected confidence. This paper introduces a multimodal approach to modeling user confidence in video presentations by leveraging features from visual cues (i.e., eye gaze) and speech patterns. We investigate the degree to which the extracted multimodal features were predictive of user confidence with a dataset of 48 2-minute videos, where the participants used a webcam and microphone to record themselves responding to a prompt. Comparative experimental results indicate that our modeling approach of using both visual and speech features are able to score 83% and 78% improvements over the random and majority label baselines, respectively. We discuss implications of using the multimodal features for modeling confidence as well as the potential for automated feedback to users who want to improve their confidence in video presentations.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2022 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3536220.3558007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A critical component of effective communication is the ability to project confidence. In video presentations (e.g., video interviews), there are many factors that influence perceived confidence by a listener. Advances in computer vision, speech processing, and natural language processing have enabled the automatic extraction of salient features that can be used to model a presenter’s perceived confidence. Moreover, these multimodal features can be used to automatically provide feedback to a user with ways they can improve their projected confidence. This paper introduces a multimodal approach to modeling user confidence in video presentations by leveraging features from visual cues (i.e., eye gaze) and speech patterns. We investigate the degree to which the extracted multimodal features were predictive of user confidence with a dataset of 48 2-minute videos, where the participants used a webcam and microphone to record themselves responding to a prompt. Comparative experimental results indicate that our modeling approach of using both visual and speech features are able to score 83% and 78% improvements over the random and majority label baselines, respectively. We discuss implications of using the multimodal features for modeling confidence as well as the potential for automated feedback to users who want to improve their confidence in video presentations.