Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati
We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.
{"title":"Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations","authors":"Oswald Barral, Sébastien Lallé, Grigorii Guz, A. Iranpour, C. Conati","doi":"10.1145/3382507.3418884","DOIUrl":"https://doi.org/10.1145/3382507.3418884","url":null,"abstract":"We leverage eye-tracking data to predict user performance and levels of cognitive abilities while reading magazine-style narrative visualizations (MSNV), a widespread form of multimodal documents that combine text and visualizations. Such predictions are motivated by recent interest in devising user-adaptive MSNVs that can dynamically adapt to a user's needs. Our results provide evidence for the feasibility of real-time user modeling in MSNV, as we are the first to consider eye tracking data for predicting task comprehension and cognitive abilities while processing multimodal documents. We follow with a discussion on the implications to the design of personalized MSNVs.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121492281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer
Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.
{"title":"Predicting Video Affect via Induced Affection in the Wild","authors":"Yi Ding, Radha Kumaran, Tianjiao Yang, Tobias Höllerer","doi":"10.1145/3382507.3418838","DOIUrl":"https://doi.org/10.1145/3382507.3418838","url":null,"abstract":"Curating large and high quality datasets for studying affect is a costly and time consuming process, especially when the labels are continuous. In this paper, we examine the potential to use unlabeled public reactions in the form of textual comments to aid in classifying video affect. We examine two popular datasets used for affect recognition and mine public reactions for these videos. We learn a representation of these reactions by using the video ratings as a weakly supervised signal. We show that our model can learn a fine-graind prediction of comment affect when given a video alone. Furthermore, we demonstrate how predicting the affective properties of a comment can be a potentially useful modality to use in multimodal affect modeling.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117039026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao
Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.
{"title":"BreathEasy: Assessing Respiratory Diseases Using Mobile Multimodal Sensors","authors":"Md. Mahbubur Rahman, M. Y. Ahmed, Tousif Ahmed, Bashima Islam, Viswam Nathan, K. Vatanparvar, Ebrahim Nemati, Daniel McCaffrey, Jilong Kuang, J. Gao","doi":"10.1145/3382507.3418852","DOIUrl":"https://doi.org/10.1145/3382507.3418852","url":null,"abstract":"Mobil respiratory assessments using commodity smartphones and smartwatches are unmet needs for patient monitoring at home. In this paper, we show the feasibility of using multimodal sensors embedded in consumer mobile devices for non-invasive, low-effort respiratory assessment. We have conducted studies with 228 chronic respiratory patients and healthy subjects, and show that our model can estimate respiratory rate with mean absolute error (MAE) 0.72$pm$0.62 breath per minute and differentiate respiratory patients from healthy subjects with 90% recall and 76% precision when the user breathes normally by holding the device on the chest or the abdomen for a minute. Holding the device on the chest or abdomen needs significantly lower effort compared to traditional spirometry which requires a specialized device and forceful vigorous breathing. This paper shows the feasibility of developing a low-effort respiratory assessment towards making it available anywhere, anytime through users' own mobile devices.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128290115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carla Viegas, Albert Lu, A. Su, Carter Strear, Yi Xu, Albert Topdjian, Daniel Limon, J. J. Xu
Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers show better performance. Leaders that are enthusiastic influence employee's innovative behavior and can also spark excitement in customers. We, at TalkMeUp, want to help people learn how to talk with enthusiasm in order to spark creativity among their listeners. In this work we want to present a multimodal speech analysis platform. We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice prosody, and text content.
{"title":"Spark Creativity by Speaking Enthusiastically: Communication Training using an E-Coach","authors":"Carla Viegas, Albert Lu, A. Su, Carter Strear, Yi Xu, Albert Topdjian, Daniel Limon, J. J. Xu","doi":"10.1145/3382507.3421164","DOIUrl":"https://doi.org/10.1145/3382507.3421164","url":null,"abstract":"Enthusiasm in speech has a huge impact on listeners. Students of enthusiastic teachers show better performance. Leaders that are enthusiastic influence employee's innovative behavior and can also spark excitement in customers. We, at TalkMeUp, want to help people learn how to talk with enthusiasm in order to spark creativity among their listeners. In this work we want to present a multimodal speech analysis platform. We provide feedback on enthusiasm by analyzing eye contact, facial expressions, voice prosody, and text content.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124590474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audience perceptions of public speakers' performance change over time. Some speakers start strong but quickly transition to mundane delivery, while others may have a few impactful and engaging portions of their talk preceded and followed by more pedestrian delivery. In this work, we model the time-varying qualities of a presentation as perceived by the audience and use these models both to provide diagnostic information to presenters and to improve the quality of automated performance assessments. In particular, we use HMMs to model various dimensions of perceived quality and how they change over time and use the sequence of quality states to improve feedback and predictions. We evaluate this approach on a corpus of 74 presentations given in a controlled environment. Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal behavior were derived both automatically and manually using crowdsourcing. Ground truth on audience perceptions was obtained using judge ratings on both overall presentations (aggregate) and portions of presentations segmented by topic. We distilled the overall presentation quality into states representing the presenter's gaze, audio, gesture, audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based representation of presentations improves the performance assessments.
{"title":"Multimodal Assessment of Oral Presentations using HMMs","authors":"Everlyne Kimani, Prasanth Murali, Ameneh Shamekhi, Dhaval Parmar, Sumanth Munikoti, T. Bickmore","doi":"10.1145/3382507.3418888","DOIUrl":"https://doi.org/10.1145/3382507.3418888","url":null,"abstract":"Audience perceptions of public speakers' performance change over time. Some speakers start strong but quickly transition to mundane delivery, while others may have a few impactful and engaging portions of their talk preceded and followed by more pedestrian delivery. In this work, we model the time-varying qualities of a presentation as perceived by the audience and use these models both to provide diagnostic information to presenters and to improve the quality of automated performance assessments. In particular, we use HMMs to model various dimensions of perceived quality and how they change over time and use the sequence of quality states to improve feedback and predictions. We evaluate this approach on a corpus of 74 presentations given in a controlled environment. Multimodal features-spanning acoustic qualities, speech disfluencies, and nonverbal behavior were derived both automatically and manually using crowdsourcing. Ground truth on audience perceptions was obtained using judge ratings on both overall presentations (aggregate) and portions of presentations segmented by topic. We distilled the overall presentation quality into states representing the presenter's gaze, audio, gesture, audience interaction, and proxemic behaviors. We demonstrate that an HMM of state-based representation of presentations improves the performance assessments.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127961629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
My PhD project aims to make contributions in the affective computing application to assist in the depression diagnosis by micro-expression recognition. My motivation is the similarities of the low-intensity facial expressions in micro-expressions and the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation caused by depression. It will focus on, firstly, investigating spatio-temporal modelling and attention systems for micro-expression recognition (MER) and, secondly, exploring the role of micro-expressions in automated depression analysis by improving deep learning architectures to detect low-intensity facial expressions. This work will investigate different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN) or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression benchmark datasets to quantitatively analyse the robustness and accuracy of MER's contribution to improving automatic depression analysis. Moreover, video magnification as a way to enhance small movements will be combined with the deep learning methods to address the low-intensity issues in MER.
{"title":"Detection of Micro-expression Recognition Based on Spatio-Temporal Modelling and Spatial Attention","authors":"Mengjiong Bai","doi":"10.1145/3382507.3421160","DOIUrl":"https://doi.org/10.1145/3382507.3421160","url":null,"abstract":"My PhD project aims to make contributions in the affective computing application to assist in the depression diagnosis by micro-expression recognition. My motivation is the similarities of the low-intensity facial expressions in micro-expressions and the low-intensity facial expressions (`frozen face?) in people with psycho-motor retardation caused by depression. It will focus on, firstly, investigating spatio-temporal modelling and attention systems for micro-expression recognition (MER) and, secondly, exploring the role of micro-expressions in automated depression analysis by improving deep learning architectures to detect low-intensity facial expressions. This work will investigate different deep learning architectures (e.g. Temporal Convolutional Networks (TCNN) or Gate Recurrent Unit (GRU)) and validate the results on publicly available micro-expression benchmark datasets to quantitatively analyse the robustness and accuracy of MER's contribution to improving automatic depression analysis. Moreover, video magnification as a way to enhance small movements will be combined with the deep learning methods to address the low-intensity issues in MER.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116347216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work investigates the interplay between Child-Computer Interaction and attachment, a psychological construct that accounts for how children perceive their parents to be. In particular, the article makes use of a multimodal approach to test whether children with different attachment conditions tend to use differently the same interactive system. The experiments show that the accuracy in predicting usage behaviour changes, to a statistically significant extent, according to the attachment conditions of the 52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant processes are actually at work when people interact with technology, at least when it comes to children.
{"title":"Did the Children Behave?: Investigating the Relationship Between Attachment Condition and Child Computer Interaction","authors":"Dong-Bach Vo, S. Brewster, A. Vinciarelli","doi":"10.1145/3382507.3418858","DOIUrl":"https://doi.org/10.1145/3382507.3418858","url":null,"abstract":"This work investigates the interplay between Child-Computer Interaction and attachment, a psychological construct that accounts for how children perceive their parents to be. In particular, the article makes use of a multimodal approach to test whether children with different attachment conditions tend to use differently the same interactive system. The experiments show that the accuracy in predicting usage behaviour changes, to a statistically significant extent, according to the attachment conditions of the 52 experiment participants (age-range 5 to 9). Such a result suggests that attachment-relevant processes are actually at work when people interact with technology, at least when it comes to children.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126948870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Emerson, Nathan L. Henderson, Jonathan P. Rowe, Wookhee Min, Seung Y. Lee, James Minogue, James C. Lester
Modeling visitor engagement is a key challenge in informal learning environments, such as museums and science centers. Devising predictive models of visitor engagement that accurately forecast salient features of visitor behavior, such as dwell time, holds significant potential for enabling adaptive learning environments and visitor analytics for museums and science centers. In this paper, we introduce a multimodal early prediction approach to modeling visitor engagement with interactive science museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression, posture, and interaction log data captured during visitor interactions with an interactive museum exhibit for environmental science education, to induce predictive models of visitor dwell time. We investigate machine learning techniques (random forest, support vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron) to induce multimodal predictive models of visitor engagement with data from 85 museum visitors. Results from a series of ablation experiments suggest that incorporating additional modalities into predictive models of visitor engagement improves model accuracy. In addition, the models show improved predictive performance over time, demonstrating that increasingly accurate predictions of visitor dwell time can be achieved as more evidence becomes available from visitor interactions with interactive science museum exhibits. These findings highlight the efficacy of multimodal data for modeling museum exhibit visitor engagement.
{"title":"Early Prediction of Visitor Engagement in Science Museums with Multimodal Learning Analytics","authors":"Andrew Emerson, Nathan L. Henderson, Jonathan P. Rowe, Wookhee Min, Seung Y. Lee, James Minogue, James C. Lester","doi":"10.1145/3382507.3418890","DOIUrl":"https://doi.org/10.1145/3382507.3418890","url":null,"abstract":"Modeling visitor engagement is a key challenge in informal learning environments, such as museums and science centers. Devising predictive models of visitor engagement that accurately forecast salient features of visitor behavior, such as dwell time, holds significant potential for enabling adaptive learning environments and visitor analytics for museums and science centers. In this paper, we introduce a multimodal early prediction approach to modeling visitor engagement with interactive science museum exhibits. We utilize multimodal sensor data including eye gaze, facial expression, posture, and interaction log data captured during visitor interactions with an interactive museum exhibit for environmental science education, to induce predictive models of visitor dwell time. We investigate machine learning techniques (random forest, support vector machine, Lasso regression, gradient boosting trees, and multi-layer perceptron) to induce multimodal predictive models of visitor engagement with data from 85 museum visitors. Results from a series of ablation experiments suggest that incorporating additional modalities into predictive models of visitor engagement improves model accuracy. In addition, the models show improved predictive performance over time, demonstrating that increasingly accurate predictions of visitor dwell time can be achieved as more evidence becomes available from visitor interactions with interactive science museum exhibits. These findings highlight the efficacy of multimodal data for modeling museum exhibit visitor engagement.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"05 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127348787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a hybrid network for audio-video group Emo-tion Recognition. The proposed architecture includes audio stream,facial emotion stream, environmental object statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89% higherthan the baseline. Such improvements prove that our method isstate-of-the-art.
{"title":"Group Level Audio-Video Emotion Recognition Using Hybrid Networks","authors":"Chuanhe Liu, Wenqian Jiang, Minghao Wang, Tianhao Tang","doi":"10.1145/3382507.3417968","DOIUrl":"https://doi.org/10.1145/3382507.3417968","url":null,"abstract":"This paper presents a hybrid network for audio-video group Emo-tion Recognition. The proposed architecture includes audio stream,facial emotion stream, environmental object statistics stream (EOS)and video stream. We adopted this method at the 8th EmotionRecognition in the Wild Challenge (EmotiW2020). According to thefeedback of our submissions, the best result achieved 76.85% in theVideo level Group AFfect (VGAF) Test Database, 26.89% higherthan the baseline. Such improvements prove that our method isstate-of-the-art.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124921558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cigdem Beyan, Matteo Bustreo, Muhammad Shahid, Gianluca Bailo, N. Carissimi, Alessio Del Bue
We present the first publicly available annotations for the analysis of face-touching behavior. These annotations are for a dataset composed of audio-visual recordings of small group social interactions with a total number of 64 videos, each one lasting between 12 to 30 minutes and showing a single person while participating to four-people meetings. They were performed by in total 16 annotators with an almost perfect agreement (Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as face-touch and no-face-touch, respectively. Given the dataset and the collected annotations, we also present an extensive evaluation of several methods: rule-based, supervised learning with hand-crafted features and feature learning and inference with a Convolutional Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation Coefficient. To foster future research in this problem, code and dataset were made publicly available (github.com/IIT-PAVIS/Face-Touching-Behavior), providing all video frames, face-touch annotations, body pose estimations including face and hands key-points detection, face bounding boxes as well as the baseline methods implemented and the cross-validation splits used for training and evaluating our models.
{"title":"Analysis of Face-Touching Behavior in Large Scale Social Interaction Dataset","authors":"Cigdem Beyan, Matteo Bustreo, Muhammad Shahid, Gianluca Bailo, N. Carissimi, Alessio Del Bue","doi":"10.1145/3382507.3418876","DOIUrl":"https://doi.org/10.1145/3382507.3418876","url":null,"abstract":"We present the first publicly available annotations for the analysis of face-touching behavior. These annotations are for a dataset composed of audio-visual recordings of small group social interactions with a total number of 64 videos, each one lasting between 12 to 30 minutes and showing a single person while participating to four-people meetings. They were performed by in total 16 annotators with an almost perfect agreement (Cohen's Kappa=0.89) on average. In total, 74K and 2M video frames were labelled as face-touch and no-face-touch, respectively. Given the dataset and the collected annotations, we also present an extensive evaluation of several methods: rule-based, supervised learning with hand-crafted features and feature learning and inference with a Convolutional Neural Network (CNN) for Face-Touching detection. Our evaluation indicates that among all, CNN performed the best, reaching 83.76% F1-score and 0.84 Matthews Correlation Coefficient. To foster future research in this problem, code and dataset were made publicly available (github.com/IIT-PAVIS/Face-Touching-Behavior), providing all video frames, face-touch annotations, body pose estimations including face and hands key-points detection, face bounding boxes as well as the baseline methods implemented and the cross-validation splits used for training and evaluating our models.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123751220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}