A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.
{"title":"Towards a Multimodal and Context-Aware Framework for Human Navigational Intent Inference","authors":"Z. Zhang","doi":"10.1145/3382507.3421156","DOIUrl":"https://doi.org/10.1145/3382507.3421156","url":null,"abstract":"A socially acceptable robot needs to make correct decisions and be able to understand human intent in order to interact with and navigate around humans safely. Although research in computer vision and robotics has made huge advance in recent years, today's robotics systems still need better understanding of human intent to be more effective and widely accepted. Currently such inference is typically done using only one mode of perception such as vision, or human movement trajectory. In this extended abstract, I describe my PhD research plan of developing a novel multimodal and context-aware framework, in which a robot infers human navigational intentions through multimodal perception comprised of human temporal facial, body pose and gaze features, human motion feature as well as environmental context. To facility this framework, a data collection experiment is designed to acquire multimodal human-robot interaction data. Our initial design of the framework is based on a temporal neural network model with human motion, body pose and head orientation features as input. And we will increase the complexity of the neural network model as well as the input features along the way. In the long term, this framework can benefit a variety of settings such as autonomous driving, service and household robots.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126567628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jasper J. van Beers, I. Stuldreher, Nattapong Thammasan, A. Brouwer
Measuring concurrent changes in autonomic physiological responses aggregated across individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure physiology rather than their high-end laboratory counterparts is desirable. Since it is currently ambiguous how different signal properties (arising from different types of measuring equipment) influence the detection of PS associated with mental processes, it is unclear whether, or to what extent, PS based on data from wearables compares to that from their laboratory equivalents. Existing literature has investigated PS using both types of equipment, but none compared them directly. In this study, we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse of heart rate) of participants who listened to the same audio stream but were either instructed to attend to the presented narrative (n=13) or to the interspersed auditory events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram (ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was classified based on which attentional group they shared greater synchrony with. For both types of sensors, we found classification accuracies of 73% or higher in both EDA and IBI. We found no significant difference in classification accuracies between the laboratory and wearable sensors. These findings encourage the use of wearables for PS based research and for in-the-field measurements.
{"title":"A Comparison between Laboratory and Wearable Sensors in the Context of Physiological Synchrony","authors":"Jasper J. van Beers, I. Stuldreher, Nattapong Thammasan, A. Brouwer","doi":"10.1145/3382507.3418837","DOIUrl":"https://doi.org/10.1145/3382507.3418837","url":null,"abstract":"Measuring concurrent changes in autonomic physiological responses aggregated across individuals (Physiological Synchrony - PS) can provide insight into group-level cognitive or emotional processes. Utilizing cheap and easy-to-use wearable sensors to measure physiology rather than their high-end laboratory counterparts is desirable. Since it is currently ambiguous how different signal properties (arising from different types of measuring equipment) influence the detection of PS associated with mental processes, it is unclear whether, or to what extent, PS based on data from wearables compares to that from their laboratory equivalents. Existing literature has investigated PS using both types of equipment, but none compared them directly. In this study, we measure PS in electrodermal activity (EDA) and inter-beat interval (IBI, inverse of heart rate) of participants who listened to the same audio stream but were either instructed to attend to the presented narrative (n=13) or to the interspersed auditory events (n=13). Both laboratory and wearable sensors were used (ActiveTwo electrocardiogram (ECG) and EDA; Wahoo Tickr and EdaMove4). A participant's attentional condition was classified based on which attentional group they shared greater synchrony with. For both types of sensors, we found classification accuracies of 73% or higher in both EDA and IBI. We found no significant difference in classification accuracies between the laboratory and wearable sensors. These findings encourage the use of wearables for PS based research and for in-the-field measurements.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126790660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhinav Dhall, Garima Sharma, R. Goecke, Tom Gedeon
This paper introduces the Eighth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW is a benchmarking effort run as a grand challenge of the 22nd ACM International Conference on Multimodal Interaction 2020. It comprises of four tasks related to automatic human behavior analysis: a) driver gaze prediction; b) audio-visual group-level emotion recognition; c) engagement prediction in the wild; and d) physiological signal based emotion recognition. The motivation of EmotiW is to bring researchers in affective computing, computer vision, speech processing and machine learning to a common platform for evaluating techniques on a test data. We discuss the challenge protocols, databases and their associated baselines.
{"title":"EmotiW 2020: Driver Gaze, Group Emotion, Student Engagement and Physiological Signal based Challenges","authors":"Abhinav Dhall, Garima Sharma, R. Goecke, Tom Gedeon","doi":"10.1145/3382507.3417973","DOIUrl":"https://doi.org/10.1145/3382507.3417973","url":null,"abstract":"This paper introduces the Eighth Emotion Recognition in the Wild (EmotiW) challenge. EmotiW is a benchmarking effort run as a grand challenge of the 22nd ACM International Conference on Multimodal Interaction 2020. It comprises of four tasks related to automatic human behavior analysis: a) driver gaze prediction; b) audio-visual group-level emotion recognition; c) engagement prediction in the wild; and d) physiological signal based emotion recognition. The motivation of EmotiW is to bring researchers in affective computing, computer vision, speech processing and machine learning to a common platform for evaluating techniques on a test data. We discuss the challenge protocols, databases and their associated baselines.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131466210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carla Viegas, Albert Lu, A. Su, Carter Strear, Yi Xu, Albert Topdjian, Daniel Limon, J. J. Xu
Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers show better performance. Leaders that are enthusiastic influence employee's innovative behavior and can also spark excitement in customers. We, at TalkMeUp, want to help people learn how to talk with enthusiasm in order to spark creativity among their listeners. In this work we want to present a multimodal speech analysis platform. We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice prosody, and text content.
{"title":"Spark Creativity by Speaking Enthusiastically: Communication Training using an E-Coach","authors":"Carla Viegas, Albert Lu, A. Su, Carter Strear, Yi Xu, Albert Topdjian, Daniel Limon, J. J. Xu","doi":"10.1145/3382507.3421164","DOIUrl":"https://doi.org/10.1145/3382507.3421164","url":null,"abstract":"Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers show better performance. Leaders that are enthusiastic influence employee's innovative behavior and can also spark excitement in customers. We, at TalkMeUp, want to help people learn how to talk with enthusiasm in order to spark creativity among their listeners. In this work we want to present a multimodal speech analysis platform. We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice prosody, and text content.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124590474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audience perceptions of public speakers' performance change over time. Some speakers start strong but quickly transition to mundane delivery, while others may have a few impactful and engaging portions of their talk preceded and followed by more pedestrian delivery. In this work, we model the time-varying qualities of a presentation as perceived by the audience and use these models both to provide diagnostic information to presenters and to improve the quality of automated performance assessments. In particular, we use HMMs to model various dimensions of perceived quality and how they change over time and use the sequence of quality states to improve feedback and predictions. We evaluate this approach on a corpus of 74 presentations given in a controlled environment. Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal behavior were derived both automatically and manually using crowdsourcing. Ground truth on audience perceptions was obtained using judge ratings on both overall presentations (aggregate) and portions of presentations segmented by topic. We distilled the overall presentation quality into states representing the presenter's gaze, audio, gesture, audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based representation of presentations improves the performance assessments.
{"title":"Multimodal Assessment of Oral Presentations using HMMs","authors":"Everlyne Kimani, Prasanth Murali, Ameneh Shamekhi, Dhaval Parmar, Sumanth Munikoti, T. Bickmore","doi":"10.1145/3382507.3418888","DOIUrl":"https://doi.org/10.1145/3382507.3418888","url":null,"abstract":"Audience perceptions of public speakers' performance change over time. Some speakers start strong but quickly transition to mundane delivery, while others may have a few impactful and engaging portions of their talk preceded and followed by more pedestrian delivery. In this work, we model the time-varying qualities of a presentation as perceived by the audience and use these models both to provide diagnostic information to presenters and to improve the quality of automated performance assessments. In particular, we use HMMs to model various dimensions of perceived quality and how they change over time and use the sequence of quality states to improve feedback and predictions. We evaluate this approach on a corpus of 74 presentations given in a controlled environment. Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal behavior were derived both automatically and manually using crowdsourcing. Ground truth on audience perceptions was obtained using judge ratings on both overall presentations (aggregate) and portions of presentations segmented by topic. We distilled the overall presentation quality into states representing the presenter's gaze, audio, gesture, audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based representation of presentations improves the performance assessments.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127961629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
My PhD project aims to make contributions in the affective computing application to assist in the depression diagnosis by micro-expression recognition. My motivation is the similarities of the low-intensity facial expressions in micro-expressions and the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation caused by depression. It will focus on, firstly, investigating spatio-temporal modelling and attention systems for micro-expression recognition (MER) and, secondly, exploring the role of micro-expressions in automated depression analysis by improving deep learning architectures to detect low-intensity facial expressions. This work will investigate different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN) or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression benchmark datasets to quantitatively analyse the robustness and accuracy of MER's contribution to improving automatic depression analysis. Moreover, video magnification as a way to enhance small movements will be combined with the deep learning methods to address the low-intensity issues in MER.
{"title":"Detection of Micro-expression Recognition Based on Spatio-Temporal Modelling and Spatial Attention","authors":"Mengjiong Bai","doi":"10.1145/3382507.3421160","DOIUrl":"https://doi.org/10.1145/3382507.3421160","url":null,"abstract":"My PhD project aims to make contributions in the affective computing application to assist in the depression diagnosis by micro-expression recognition. My motivation is the similarities of the low-intensity facial expressions in micro-expressions and the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation caused by depression. It will focus on, firstly, investigating spatio-temporal modelling and attention systems for micro-expression recognition (MER) and, secondly, exploring the role of micro-expressions in automated depression analysis by improving deep learning architectures to detect low-intensity facial expressions. This work will investigate different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN) or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression benchmark datasets to quantitatively analyse the robustness and accuracy of MER's contribution to improving automatic depression analysis. Moreover, video magnification as a way to enhance small movements will be combined with the deep learning methods to address the low-intensity issues in MER.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116347216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work investigates the interplay between Child-Computer Interaction and attachment, a psychological construct that accounts for how children perceive their parents to be. In particular, the article makes use of a multimodal approach to test whether children with different attachment conditions tend to use differently the same interactive system. The experiments show that the accuracy in predicting usage behaviour changes, to a statistically significant extent, according to the attachment conditions of the 52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant processes are actually at work when people interact with technology, at least when it comes to children.
{"title":"Did the Children Behave?: Investigating the Relationship Between Attachment Condition and Child Computer Interaction","authors":"Dong-Bach Vo, S. Brewster, A. Vinciarelli","doi":"10.1145/3382507.3418858","DOIUrl":"https://doi.org/10.1145/3382507.3418858","url":null,"abstract":"This work investigates the interplay between Child-Computer Interaction and attachment, a psychological construct that accounts for how children perceive their parents to be. In particular, the article makes use of a multimodal approach to test whether children with different attachment conditions tend to use differently the same interactive system. The experiments show that the accuracy in predicting usage behaviour changes, to a statistically significant extent, according to the attachment conditions of the 52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant processes are actually at work when people interact with technology, at least when it comes to children.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126948870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Emerson, Nathan L. Henderson, Jonathan P. Rowe, Wookhee Min, Seung Y. Lee, James Minogue, James C. Lester
Modeling visitor engagement is a key challenge in informal learning environments, such as museums and science centers. Devising predictive models of visitor engagement that accurately forecast salient features of visitor behavior, such as dwell time, holds significant potential for enabling adaptive learning environments and visitor analytics for museums and science centers. In this paper, we introduce a multimodal early prediction approach to modeling visitor engagement with interactive science museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression, posture, and interaction log data captured during visitor interactions with an interactive museum exhibit for environmental science education, to induce predictive models of visitor dwell time. We investigate machine learning techniques (random forest, support vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron) to induce multimodal predictive models of visitor engagement with data from 85 museum visitors. Results from a series of ablation experiments suggest that incorporating additional modalities into predictive models of visitor engagement improves model accuracy. In addition, the models show improved predictive performance over time, demonstrating that increasingly accurate predictions of visitor dwell time can be achieved as more evidence becomes available from visitor interactions with interactive science museum exhibits. These findings highlight the efficacy of multimodal data for modeling museum exhibit visitor engagement.
{"title":"Early Prediction of Visitor Engagement in Science Museums with Multimodal Learning Analytics","authors":"Andrew Emerson, Nathan L. Henderson, Jonathan P. Rowe, Wookhee Min, Seung Y. Lee, James Minogue, James C. Lester","doi":"10.1145/3382507.3418890","DOIUrl":"https://doi.org/10.1145/3382507.3418890","url":null,"abstract":"Modeling visitor engagement is a key challenge in informal learning environments, such as museums and science centers. Devising predictive models of visitor engagement that accurately forecast salient features of visitor behavior, such as dwell time, holds significant potential for enabling adaptive learning environments and visitor analytics for museums and science centers. In this paper, we introduce a multimodal early prediction approach to modeling visitor engagement with interactive science museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression, posture, and interaction log data captured during visitor interactions with an interactive museum exhibit for environmental science education, to induce predictive models of visitor dwell time. We investigate machine learning techniques (random forest, support vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron) to induce multimodal predictive models of visitor engagement with data from 85 museum visitors. Results from a series of ablation experiments suggest that incorporating additional modalities into predictive models of visitor engagement improves model accuracy. In addition, the models show improved predictive performance over time, demonstrating that increasingly accurate predictions of visitor dwell time can be achieved as more evidence becomes available from visitor interactions with interactive science museum exhibits. These findings highlight the efficacy of multimodal data for modeling museum exhibit visitor engagement.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"05 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127348787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a hybrid network for audio-video group Emo-tion Recognition. The proposed architecture includes audio stream,facial emotion stream, environmental object statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89% higherthan the baseline. Such improvements prove that our method isstate-of-the-art.
{"title":"Group Level Audio-Video Emotion Recognition Using Hybrid Networks","authors":"Chuanhe Liu, Wenqian Jiang, Minghao Wang, Tianhao Tang","doi":"10.1145/3382507.3417968","DOIUrl":"https://doi.org/10.1145/3382507.3417968","url":null,"abstract":"This paper presents a hybrid network for audio-video group Emo-tion Recognition. The proposed architecture includes audio stream,facial emotion stream, environmental object statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89% higherthan the baseline. Such improvements prove that our method isstate-of-the-art.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124921558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cigdem Beyan, Matteo Bustreo, Muhammad Shahid, Gianluca Bailo, N. Carissimi, Alessio Del Bue
We present the first publicly available annotations for the analysis of face-touching behavior. These annotations are for a dataset composed of audio-visual recordings of small group social interactions with a total number of 64 videos, each one lasting between 12 to 30 minutes and showing a single person while participating to four-people meetings. They were performed by in total 16 annotators with an almost perfect agreement (Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as face-touch and no-face-touch, respectively. Given the dataset and the collected annotations, we also present an extensive evaluation of several methods: rule-based, supervised learning with hand-crafted features and feature learning and inference with a Convolutional Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation Coefficient. To foster future research in this problem, code and dataset were made publicly available (github.com/IIT-PAVIS/Face-Touching-Behavior), providing all video frames, face-touch annotations, body pose estimations including face and hands key-points detection, face bounding boxes as well as the baseline methods implemented and the cross-validation splits used for training and evaluating our models.
{"title":"Analysis of Face-Touching Behavior in Large Scale Social Interaction Dataset","authors":"Cigdem Beyan, Matteo Bustreo, Muhammad Shahid, Gianluca Bailo, N. Carissimi, Alessio Del Bue","doi":"10.1145/3382507.3418876","DOIUrl":"https://doi.org/10.1145/3382507.3418876","url":null,"abstract":"We present the first publicly available annotations for the analysis of face-touching behavior. These annotations are for a dataset composed of audio-visual recordings of small group social interactions with a total number of 64 videos, each one lasting between 12 to 30 minutes and showing a single person while participating to four-people meetings. They were performed by in total 16 annotators with an almost perfect agreement (Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as face-touch and no-face-touch, respectively. Given the dataset and the collected annotations, we also present an extensive evaluation of several methods: rule-based, supervised learning with hand-crafted features and feature learning and inference with a Convolutional Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation Coefficient. To foster future research in this problem, code and dataset were made publicly available (github.com/IIT-PAVIS/Face-Touching-Behavior), providing all video frames, face-touch annotations, body pose estimations including face and hands key-points detection, face bounding boxes as well as the baseline methods implemented and the cross-validation splits used for training and evaluating our models.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123751220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}