A system's ability to understand and model a human's engagement during an interactive task is important for both adapting its behavior to the moment and achieving a coherent interaction over time. Standard practice for creating such a capability requires uncovering and modeling the multimodal cues that predict engagement in a given task environment. The first step in this methodology is to have human coders produce "gold standard" judgments of sample behavior. In this paper we report results from applying this first step to the complex and varied behavior of children playing a fast-paced, speech-controlled, side-scrolling game called Mole Madness. We introduce a concrete metric for engagement-willingness to continue the interaction--that leads to better inter-coder judgments for children playing in pairs, explore how coders perceive the relative contribution of audio and visual cues, and describe engagement trends and patterns in our population. We also examine how the measures change when the same children play Mole Madness with a robot instead of a peer. We conclude by discussing the implications of the differences within and across play conditions for the automatic estimation of engagement and the extension of our autonomous robot player into a "buddy" that can individualize interaction for each player and game.
{"title":"Toward Better Understanding of Engagement in Multiparty Spoken Interaction with Children","authors":"S. Moubayed, J. Lehman","doi":"10.1145/2818346.2820733","DOIUrl":"https://doi.org/10.1145/2818346.2820733","url":null,"abstract":"A system's ability to understand and model a human's engagement during an interactive task is important for both adapting its behavior to the moment and achieving a coherent interaction over time. Standard practice for creating such a capability requires uncovering and modeling the multimodal cues that predict engagement in a given task environment. The first step in this methodology is to have human coders produce \"gold standard\" judgments of sample behavior. In this paper we report results from applying this first step to the complex and varied behavior of children playing a fast-paced, speech-controlled, side-scrolling game called Mole Madness. We introduce a concrete metric for engagement-willingness to continue the interaction--that leads to better inter-coder judgments for children playing in pairs, explore how coders perceive the relative contribution of audio and visual cues, and describe engagement trends and patterns in our population. We also examine how the measures change when the same children play Mole Madness with a robot instead of a peer. We conclude by discussing the implications of the differences within and across play conditions for the automatic estimation of engagement and the extension of our autonomous robot player into a \"buddy\" that can individualize interaction for each player and game.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"143 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75350975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nonverbal behaviors such as facial expressions, eye contact, gestures, and body movements in general have strong impacts on the process of communicative interactions. Gestures play an important role in interpersonal communication in the classroom between student and teacher. To assist teachers with exhibiting open and positive nonverbal signals in their actual classroom, we have designed a multimodal teaching application with provisions for real-time feedback in coordination with our TeachLivE test-bed environment and its reflective application; ReflectLivE. Individuals walk into this virtual environment and interact with five virtual students shown on a large screen display. The recent research study is designed to have two settings (7-minute long each). In each of the settings, the participants are provided lesson plans from which they teach. All the participants are asked to take part in both settings, with half receiving automated real-time feedback about their body poses in the first session (group 1) and the other half receiving such feedback in the second session (group 2). Feedback is in the form of a visual indication each time the participant exhibits a closed stance. To create this automated feedback application, a closed posture corpus was collected and trained based on the existing TeachLivE teaching records. After each session, the participants take a post-questionnaire about their experience. We hypothesize that visual feedback improves positive body gestures for both groups during the feedback session, and that, for group 2, this persists into their second unaided session but, for group 1, improvements occur only during the second session.
{"title":"Multimodal Assessment of Teaching Behavior in Immersive Rehearsal Environment-TeachLivE","authors":"R. Barmaki","doi":"10.1145/2818346.2823306","DOIUrl":"https://doi.org/10.1145/2818346.2823306","url":null,"abstract":"Nonverbal behaviors such as facial expressions, eye contact, gestures, and body movements in general have strong impacts on the process of communicative interactions. Gestures play an important role in interpersonal communication in the classroom between student and teacher. To assist teachers with exhibiting open and positive nonverbal signals in their actual classroom, we have designed a multimodal teaching application with provisions for real-time feedback in coordination with our TeachLivE test-bed environment and its reflective application; ReflectLivE. Individuals walk into this virtual environment and interact with five virtual students shown on a large screen display. The recent research study is designed to have two settings (7-minute long each). In each of the settings, the participants are provided lesson plans from which they teach. All the participants are asked to take part in both settings, with half receiving automated real-time feedback about their body poses in the first session (group 1) and the other half receiving such feedback in the second session (group 2). Feedback is in the form of a visual indication each time the participant exhibits a closed stance. To create this automated feedback application, a closed posture corpus was collected and trained based on the existing TeachLivE teaching records. After each session, the participants take a post-questionnaire about their experience. We hypothesize that visual feedback improves positive body gestures for both groups during the feedback session, and that, for group 2, this persists into their second unaided session but, for group 1, improvements occur only during the second session.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72720552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. F. A. Gaus, Temitayo A. Olugbade, Asim Jan, R. Qin, Jingxin Liu, Fan Zhang, H. Meng, N. Bianchi-Berthouze
Touch is a primary nonverbal communication channel used to communicate emotions or other social messages. Despite its importance, this channel is still very little explored in the affective computing field, as much more focus has been placed on visual and aural channels. In this paper, we investigate the possibility to automatically discriminate between different social touch types. We propose five distinct feature sets for describing touch behaviours captured by a grid of pressure sensors. These features are then combined together by using the Random Forest and Boosting methods for categorizing the touch gesture type. The proposed methods were evaluated on both the HAART (7 gesture types over different surfaces) and the CoST (14 gesture types over the same surface) datasets made available by the Social Touch Gesture Challenge 2015. Well above chance level performances were achieved with a 67% accuracy for the HAART and 59% for the CoST testing datasets respectively.
{"title":"Social Touch Gesture Recognition using Random Forest and Boosting on Distinct Feature Sets","authors":"Y. F. A. Gaus, Temitayo A. Olugbade, Asim Jan, R. Qin, Jingxin Liu, Fan Zhang, H. Meng, N. Bianchi-Berthouze","doi":"10.1145/2818346.2830599","DOIUrl":"https://doi.org/10.1145/2818346.2830599","url":null,"abstract":"Touch is a primary nonverbal communication channel used to communicate emotions or other social messages. Despite its importance, this channel is still very little explored in the affective computing field, as much more focus has been placed on visual and aural channels. In this paper, we investigate the possibility to automatically discriminate between different social touch types. We propose five distinct feature sets for describing touch behaviours captured by a grid of pressure sensors. These features are then combined together by using the Random Forest and Boosting methods for categorizing the touch gesture type. The proposed methods were evaluated on both the HAART (7 gesture types over different surfaces) and the CoST (14 gesture types over the same surface) datasets made available by the Social Touch Gesture Challenge 2015. Well above chance level performances were achieved with a 67% accuracy for the HAART and 59% for the CoST testing datasets respectively.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83207166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This consortium paper outlines a research plan for investigating deep learning techniques as applied to multimodal multi-task learning and multimodal fusion. We discuss our prior research results in this area, and how these results motivate us to explore more in this direction. We also define concrete steps of enquiry we wish to undertake as a short-term goal, and further outline some other challenges of multimodal learning using deep neural networks, such as inter and intra-modality synchronization, robustness to noise in modality data acquisition, and data insufficiency.
{"title":"Challenges in Deep Learning for Multimodal Applications","authors":"Sayan Ghosh","doi":"10.1145/2818346.2823313","DOIUrl":"https://doi.org/10.1145/2818346.2823313","url":null,"abstract":"This consortium paper outlines a research plan for investigating deep learning techniques as applied to multimodal multi-task learning and multimodal fusion. We discuss our prior research results in this area, and how these results motivate us to explore more in this direction. We also define concrete steps of enquiry we wish to undertake as a short-term goal, and further outline some other challenges of multimodal learning using deep neural networks, such as inter and intra-modality synchronization, robustness to noise in modality data acquisition, and data insufficiency.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"36 6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77677328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Emotion Recognition in the Wild challenge poses significant problems to state of the art auditory and visual affect quantification systems. To overcome the challenges, we investigate supplementary meta features based on film semiotics. Movie scenes are often presented and arranged in such a way as to amplify the emotion interpreted by the viewing audience. This technique is referred to as mise en scene in the film industry and involves strict and intentional control of color palette, light source color, and arrangement of actors and objects in the scene. To this end, two algorithms for extracting mise en scene information are proposed. Rule of thirds based motion history histograms detect motion along rule of thirds guidelines. Rule of thirds color layout descriptors compactly describe a scene at rule of thirds intersections. A comprehensive system is proposed that measures expression, emotion, vocalics, syntax, semantics, and film-based meta information. The proposed mise en scene features have a higher classification rate and ROC area than LBP-TOP features on the validation set of the EmotiW 2015 challenge. The complete system improves classification performance over the baseline algorithm by 3.17% on the testing set.
野外情感识别挑战对当前听觉和视觉情感量化系统提出了重大挑战。为了克服这些挑战,我们研究了基于电影符号学的补充元特征。电影场景的呈现和安排往往是为了放大观众所理解的情感。这种技术在电影工业中被称为mise en scene,涉及对调色板、光源颜色以及场景中演员和物体的安排的严格和有意的控制。为此,提出了两种场景信息提取算法。基于三分法则的运动历史直方图检测沿三分法则指导方针的运动。三分法则色彩布局描述符简洁地描述了三分法则交点处的场景。提出了一个综合的系统来测量表达、情感、语音、语法、语义和基于电影的元信息。在EmotiW 2015挑战的验证集上,所提出的场景特征比LBP-TOP特征具有更高的分类率和ROC面积。完整的系统在测试集上的分类性能比基线算法提高了3.17%。
{"title":"Quantification of Cinematography Semiotics for Video-based Facial Emotion Recognition in the EmotiW 2015 Grand Challenge","authors":"Albert C. Cruz","doi":"10.1145/2818346.2830592","DOIUrl":"https://doi.org/10.1145/2818346.2830592","url":null,"abstract":"The Emotion Recognition in the Wild challenge poses significant problems to state of the art auditory and visual affect quantification systems. To overcome the challenges, we investigate supplementary meta features based on film semiotics. Movie scenes are often presented and arranged in such a way as to amplify the emotion interpreted by the viewing audience. This technique is referred to as mise en scene in the film industry and involves strict and intentional control of color palette, light source color, and arrangement of actors and objects in the scene. To this end, two algorithms for extracting mise en scene information are proposed. Rule of thirds based motion history histograms detect motion along rule of thirds guidelines. Rule of thirds color layout descriptors compactly describe a scene at rule of thirds intersections. A comprehensive system is proposed that measures expression, emotion, vocalics, syntax, semantics, and film-based meta information. The proposed mise en scene features have a higher classification rate and ROC area than LBP-TOP features on the validation set of the EmotiW 2015 challenge. The complete system improves classification performance over the baseline algorithm by 3.17% on the testing set.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84877247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We report on the investigation on acted and non-acted emotional speech and the resulting Non-/acted LAST MINUTE corpus (NaLMC) database. The database consists of newly recorded acted emotional speech samples which were designed to allow the direct comparison of acted and non-acted emotional speech. The non-acted samples are taken from the LAST MINUTE corpus (LMC) [1]. Furthermore, emotional labels were added to selected passages of the LMC and a self-rating of the LMC recordings was performed. Although the main objective of the NaLMC database is to allow the comparative analysis of acted and non-acted emotional speech, both audio and video signals were recorded to allow multimodal investigations.
{"title":"NaLMC: A Database on Non-acted and Acted Emotional Sequences in HCI","authors":"Kim Hartmann, J. Krüger, J. Frommer, A. Wendemuth","doi":"10.1145/2818346.2820772","DOIUrl":"https://doi.org/10.1145/2818346.2820772","url":null,"abstract":"We report on the investigation on acted and non-acted emotional speech and the resulting Non-/acted LAST MINUTE corpus (NaLMC) database. The database consists of newly recorded acted emotional speech samples which were designed to allow the direct comparison of acted and non-acted emotional speech. The non-acted samples are taken from the LAST MINUTE corpus (LMC) [1]. Furthermore, emotional labels were added to selected passages of the LMC and a self-rating of the LMC recordings was performed. Although the main objective of the NaLMC database is to allow the comparative analysis of acted and non-acted emotional speech, both audio and video signals were recorded to allow multimodal investigations.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87975619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Task analysis using eye-activity has previously been used for estimating cognitive load on a per-task basis. However, since pupil size is a continuous physiological signal, eye-based classification accuracy of cognitive load can be improved by considering cognitive load at a higher temporal resolution and incorporating models of the interactions between the task-evoked pupillary response (TEPR) and other pupillary responses such as the Pupillary Light Reflex into the classification model. In this work, methods of using eye-activity as a measure of continuous mental load will be investigated. Subsequently pupil light reflex models will be incorporated into task analysis to investigate the possibility of enhancing the reliability of cognitive load estimation in varied lighting conditions. This will culminate in the development and evaluation of a classification system which measures rapidly changing cognitive load. Task analysis of this calibre will enable interfaces in wearable optical devices to be constantly aware of the user's mental state and control information flow to prevent information overload and interruptions.
{"title":"Instantaneous and Robust Eye-Activity Based Task Analysis","authors":"Hoe Kin Wong","doi":"10.1145/2818346.2823312","DOIUrl":"https://doi.org/10.1145/2818346.2823312","url":null,"abstract":"Task analysis using eye-activity has previously been used for estimating cognitive load on a per-task basis. However, since pupil size is a continuous physiological signal, eye-based classification accuracy of cognitive load can be improved by considering cognitive load at a higher temporal resolution and incorporating models of the interactions between the task-evoked pupillary response (TEPR) and other pupillary responses such as the Pupillary Light Reflex into the classification model. In this work, methods of using eye-activity as a measure of continuous mental load will be investigated. Subsequently pupil light reflex models will be incorporated into task analysis to investigate the possibility of enhancing the reliability of cognitive load estimation in varied lighting conditions. This will culminate in the development and evaluation of a classification system which measures rapidly changing cognitive load. Task analysis of this calibre will enable interfaces in wearable optical devices to be constantly aware of the user's mental state and control information flow to prevent information overload and interruptions.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"44 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91497943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Muszynski, Theodoros Kostoulas, G. Chanel, Patrizia Lombardo, T. Pun
Detection of highlights in movies is a challenge for the affective understanding and implicit tagging of films. Under the hypothesis that synchronization of the reaction of spectators indicates such highlights, we define a synchronization measure between spectators that is capable of extracting movie highlights. The intuitive idea of our approach is to define (a) a parameterization of one spectator's physiological data on a manifold; (b) the synchronization measure between spectators as the Kolmogorov-Smirnov distance between local shape distributions of the underlying manifolds. We evaluate our approach using data collected in an experiment where the electro-dermal activity of spectators was recorded during the entire projection of a movie in a cinema. We compare our methodology with baseline synchronization measures, such as correlation, Spearman's rank correlation, mutual information, Kolmogorov-Smirnov distance. Results indicate that the proposed approach allows to accurately distinguish highlight from non-highlight scenes.
{"title":"Spectators' Synchronization Detection based on Manifold Representation of Physiological Signals: Application to Movie Highlights Detection","authors":"Michal Muszynski, Theodoros Kostoulas, G. Chanel, Patrizia Lombardo, T. Pun","doi":"10.1145/2818346.2820773","DOIUrl":"https://doi.org/10.1145/2818346.2820773","url":null,"abstract":"Detection of highlights in movies is a challenge for the affective understanding and implicit tagging of films. Under the hypothesis that synchronization of the reaction of spectators indicates such highlights, we define a synchronization measure between spectators that is capable of extracting movie highlights. The intuitive idea of our approach is to define (a) a parameterization of one spectator's physiological data on a manifold; (b) the synchronization measure between spectators as the Kolmogorov-Smirnov distance between local shape distributions of the underlying manifolds. We evaluate our approach using data collected in an experiment where the electro-dermal activity of spectators was recorded during the entire projection of a movie in a cinema. We compare our methodology with baseline synchronization measures, such as correlation, Spearman's rank correlation, mutual information, Kolmogorov-Smirnov distance. Results indicate that the proposed approach allows to accurately distinguish highlight from non-highlight scenes.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"400 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80275527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. W. Leong, L. Chen, G. Feng, Chong Min Lee, Matthew David Mulholland
Body language plays an important role in learning processes and communication. For example, communication research produced evidence that mathematical knowledge can be embodied in gestures made by teachers and students. Likewise, body postures and gestures are also utilized by speakers in oral presentations to convey ideas and important messages. Consequently, capturing and analyzing non-verbal behaviors is an important aspect in multimodal learning analytics (MLA) research. With regard to sensing capabilities, the introduction of depth sensors such as the Microsoft Kinect has greatly facilitated research and development in this area. However, the rapid advancement in hardware and software capabilities is not always in sync with the expanding set of features reported in the literature. For example, though Anvil is a widely used state-of-the-art annotation and visualization toolkit for motion traces, its motion recording component based on OpenNI is outdated. As part of our research in developing multimodal educational assessments, we began an effort to develop and standardize algorithms for purposes of multimodal feature extraction and creating automated scoring models. This paper provides an overview of relevant work in multimodal research on educational tasks, and proceeds to summarize our work using multimodal sensors in developing assessments of communication skills, with attention on the use of depth sensors. Specifically, we focus on the task of public speaking assessment using Microsoft Kinect. Additionally, we introduce an open-source Python package for computing expressive body language features from Kinect motion data, which we hope will benefit the MLA research community.
{"title":"Utilizing Depth Sensors for Analyzing Multimodal Presentations: Hardware, Software and Toolkits","authors":"C. W. Leong, L. Chen, G. Feng, Chong Min Lee, Matthew David Mulholland","doi":"10.1145/2818346.2830605","DOIUrl":"https://doi.org/10.1145/2818346.2830605","url":null,"abstract":"Body language plays an important role in learning processes and communication. For example, communication research produced evidence that mathematical knowledge can be embodied in gestures made by teachers and students. Likewise, body postures and gestures are also utilized by speakers in oral presentations to convey ideas and important messages. Consequently, capturing and analyzing non-verbal behaviors is an important aspect in multimodal learning analytics (MLA) research. With regard to sensing capabilities, the introduction of depth sensors such as the Microsoft Kinect has greatly facilitated research and development in this area. However, the rapid advancement in hardware and software capabilities is not always in sync with the expanding set of features reported in the literature. For example, though Anvil is a widely used state-of-the-art annotation and visualization toolkit for motion traces, its motion recording component based on OpenNI is outdated. As part of our research in developing multimodal educational assessments, we began an effort to develop and standardize algorithms for purposes of multimodal feature extraction and creating automated scoring models. This paper provides an overview of relevant work in multimodal research on educational tasks, and proceeds to summarize our work using multimodal sensors in developing assessments of communication skills, with attention on the use of depth sensors. Specifically, we focus on the task of public speaking assessment using Microsoft Kinect. Additionally, we introduce an open-source Python package for computing expressive body language features from Kinect motion data, which we hope will benefit the MLA research community.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83451695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Verónica Pérez-Rosas, M. Abouelenien, Rada Mihalcea, Mihai Burzo
Hearings of witnesses and defendants play a crucial role when reaching court trial decisions. Given the high-stake nature of trial outcomes, implementing accurate and effective computational methods to evaluate the honesty of court testimonies can offer valuable support during the decision making process. In this paper, we address the identification of deception in real-life trial data. We introduce a novel dataset consisting of videos collected from public court trials. We explore the use of verbal and non-verbal modalities to build a multimodal deception detection system that aims to discriminate between truthful and deceptive statements provided by defendants and witnesses. We achieve classification accuracies in the range of 60-75% when using a model that extracts and fuses features from the linguistic and gesture modalities. In addition, we present a human deception detection study where we evaluate the human capability of detecting deception in trial hearings. The results show that our system outperforms the human capability of identifying deceit.
{"title":"Deception Detection using Real-life Trial Data","authors":"Verónica Pérez-Rosas, M. Abouelenien, Rada Mihalcea, Mihai Burzo","doi":"10.1145/2818346.2820758","DOIUrl":"https://doi.org/10.1145/2818346.2820758","url":null,"abstract":"Hearings of witnesses and defendants play a crucial role when reaching court trial decisions. Given the high-stake nature of trial outcomes, implementing accurate and effective computational methods to evaluate the honesty of court testimonies can offer valuable support during the decision making process. In this paper, we address the identification of deception in real-life trial data. We introduce a novel dataset consisting of videos collected from public court trials. We explore the use of verbal and non-verbal modalities to build a multimodal deception detection system that aims to discriminate between truthful and deceptive statements provided by defendants and witnesses. We achieve classification accuracies in the range of 60-75% when using a model that extracts and fuses features from the linguistic and gesture modalities. In addition, we present a human deception detection study where we evaluate the human capability of detecting deception in trial hearings. The results show that our system outperforms the human capability of identifying deceit.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83785615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}